TREX Workshop at CALMIP: StarPU Tutorial, November, 22nd 2022

Overview

This tutorial is part of the TREX Workshop at CALMIP taking place on November, 22nd 2022

The general presentation slides are available as PDF.

The tutorial slides are available as PDF.

The C version of the tutorial practice session is also available as webpage

Environment preparation

We have preinstalled starpu and its tools on Olympe, so you can start right away by loading the modules.

First add this at the end of your $HOME/.bashrc:

MODULEPATH=$MODULEPATH:/usr/local/trex/modulefiles_toadd
source /usr/local/trex/modulefiles_toadd/starpu/starpu_env

Un-log and re-log into olympe.

And then you can load the modules:

module load cuda/11.2
module load fxt
module load hwloc
module load gcc/10.3.0
module load starpu

You can check with

starpu-volta lstopo -
starpu-volta starpu_machine_display

That your environment is ready (starpu will calibrate performance models for the machine).

Session Part 1: Task-based Programming Model

Application Example: Vector Scaling

A vector scaling example is available in fortran folder of the archive file

Base version

The original non-StarPU Fortran version (nf_vector_scal0.f90) is available in the material tarball and shows the basic example that we will be using to illustrate how to use StarPU. It simply allocates a vector, and calls a scaling function over it.

	subroutine vector_scal_cpu(val, n, factor)
	  implicit none

	  real, dimension(:), intent(inout):: val
	  integer, intent(in) :: n
	  real, intent(in) :: factor
	  integer :: i

	  do i=1, n
	          val(i) = val(i)*factor
	          !write(*,*) i, val(i)
	  end do

	end subroutine vector_scal_cpu

	program nf_vector_scal0
	  use nf_vector_scal_cpu
	  implicit none

	  real, dimension(:), allocatable :: vector 
	  integer :: i, NX = 2048
	  real :: factor = 3.14

	  allocate(vector(NX))
	  vector = 1

	  write(*,*) "BEFORE : First element was", vector(1)

	  call vector_scal_cpu(vector, NX, factor)

	  write(*,*) "AFTER First element is", vector(1)

	end program nf_vector_scal0

StarPU Fortran version

The StarPU Fortran version of the scaling example is available in the material tarball:

Computation Kernels

Examine the source code, starting from nf_vector_scal_cl.f90 : always use the same Fortran prototype which takes a series of DSM (Distributed Memory System) interfaces and a non-DSM parameter

	recursive subroutine cl_cpu_func_vector_scal (buffers, cl_args) bind(C)
	type(c_ptr), value, intent(in) :: buffers, cl_args

The code first gets the size of the vector data, and extracts the base pointer:

  real, dimension(:), pointer :: val
  integer :: n_val

  n_val = fstarpu_vector_get_nx(buffers, 0)
  call c_f_pointer(fstarpu_vector_get_ptr(buffers, 0), val, shape=[n_val])

It then gets the factor value from the non-DSM parameter:

	real, target :: factor
	call fstarpu_unpack_arg(cl_args, (/ c_loc(factor) /))

and it eventually performs the vector scaling:

	integer :: i
	do i=1,n_val
		val(i) = val(i)*factor
 	end do

The GPU implementation, in vector_scal_cuda.cu, is basically the same as the C implementation, with the host part (vector_scal_cuda) which extracts the actual CUDA pointer from the DSM interface, and passes it to the device part (vector_mult_cuda) which performs the actual computation.

Main Code

Now examine nf_vector_scal.f90

The main function starts with initializing StarPU with the default parameters:
```
  err = fstarpu_init(C_NULL_PTR)
```
It then allocates the vector and fills it like the original code:
```
  allocate(vector(NX))
  vector = 1
```

The cl (codelet) structure simply gathers pointers on the functions mentioned above, and notes that the functions takes only one DSM parameter. It needs allocate an empty codelet structure before adding the CPU function and setting codelet fileds:

  type(c_ptr) :: scal_cl
  scal_cl = fstarpu_codelet_allocate()
  call fstarpu_codelet_set_name(scal_cl, C_CHAR_"vector_scal_codelet"//C_NULL_CHAR)
  call fstarpu_codelet_add_cpu_func(scal_cl, C_FUNLOC(cl_cpu_func_vector_scal))
  call fstarpu_codelet_add_buffer(scal_cl, FSTARPU_RW)

It then registers the data to StarPU, and gets back a DSM handle. From now on, the application is not supposed to access vector directly, since its content may be copied and modified by a task on a GPU, the main-memory copy then being outdated.
```
  type(c_ptr) :: vector_handle
  call fstarpu_vector_data_register(vector_handle, 0, c_loc(vector), NX, c_sizeof(vector(0)))
```

It then submits a task to StarPU.

  call fstarpu_task_insert((/ scal_cl, &
      FSTARPU_RW, vector_handle, &
      FSTARPU_VALUE, c_loc(factor), FSTARPU_SZ_C_FLOAT, &
      C_NULL_PTR /))

It waits for task completion:
```
  call fstarpu_task_wait_for_all()
```
It unregisters the vector from StarPU, which brings back the modified version to main memory, so the result can be read.
```
  call fstarpu_data_unregister(vector_handle)
```
It frees codelet structure
```
call fstarpu_codelet_free(scal_cl)
```
Eventually, it shuts down StarPU and deallocates the vector:
```
  call fstarpu_shutdown()
```
```
  deallocate(vector)
```

Making it and running the StarPU Fortran version

Building

Let us look at how this should be built. A typical Makefile for Fortran applications using StarPU is the following:

	PROG = nf_vector_scal

	STARPU_VERSION=1.3
	FSTARPU_MOD = $(shell pkg-config --cflags-only-I starpu-$(STARPU_VERSION)|sed -e 's/^\([^ ]*starpu\/$(STARPU_VERSION)\).*$$/\1/;s/^.* //;s/^-I//')/fstarpu_mod.f90

	SRCSF = nf_vector_scal_cl.f90

	FC = gfortran
	CC = gcc

	CFLAGS = -g $(shell pkg-config --cflags starpu-$(STARPU_VERSION))
	FCFLAGS = -J. -g
	LDLIBS =  $(shell pkg-config --libs starpu-$(STARPU_VERSION))

	OBJS = fstarpu_mod.o $(SRCSF:%.f90=%.o)

	all: $(PROG)

	$(PROG): %: %.o $(OBJS)
		$(FC) $(LDFLAGS) -o $@ $^ $(LDLIBS)

	fstarpu_mod.o: $(FSTARPU_MOD)
		$(FC) $(FCFLAGS) -c -o $@ $<

	%.o: %.f90
		$(FC) $(FCFLAGS) -c -o $@ $<

	nf_vector_scal.o: nf_vector_scal_cl.o fstarpu_mod.o
	nf_vector_scal_cl.o: fstarpu_mod.o

The Fortran module fstarpu_mod.f90 must be compiled with the same compiler as the application itself, and the resulting fstarpu_mod.o object file must linked with the application executable.

The provided Makefile additionally detects whether CUDA is available in StarPU, and adds the corresponding files and link flags.

Running

Run make nf_vector_scal, and run the resulting nf_vector_scal executable It should be working: it simply scales a given vector by a given factor.

make nf_vector_scal

./nf_vector_scal

Note that if you are using the simulation version of StarPU, the computation will not be performed, and thus the final value will be equal to the initial value, but the timing provided by fstarpu_timing_now() will correspond to the correct execution time.

You can set the environment variable STARPU_WORKER_STATS to 1 when running your application to see the number of tasks executed by each device. You can see the whole list of environment variables here.

STARPU_WORKER_STATS=1 ./nf_vector_scal

To run with CUDA support (or for large runs), you need to submit a job to Slurm with the starpu-volta script we provide here:

STARPU_WORKER_STATS=1 starpu-volta ./vector_scal_task_insert


# to force the implementation on a GPU device, by default, it will enable CUDA
STARPU_WORKER_STATS=1 STARPU_NCPU=0 starpu-volta ./nf_vector_scal

Data Partitioning

In the previous section, we submitted only one task. We here discuss how to /partition/ data so as to submit multiple tasks which can be executed in parallel by the various CPUs.

Let’s examine partition/nf_mult.f90.

The computation kernel, cl_cpu_mult_func is a trivial matrix multiplication kernel, which operates on 3 given DSM interfaces. These will actually not be whole matrices, but only small parts of matrices.
init_problem_data initializes the whole A, B and C matrices.
partition_mult_data does the actual registration and partitioning. Matrices are first registered completely, then two partitioning filters are declared. The first one, filter_vert, is used to split B and C vertically. The second one, filter_horiz, is used to split A and C horizontally. We thus end up with a grid of pieces of C to be computed from stripes of A and B.
launch_tasks submits the actual tasks: for each piece of C, take the appropriate piece of A and B to produce the piece of C.
The access mode is interesting: A and B just need to be read from, and C will only be written to. This means that StarPU will make copies of the pieces of A and B along the machines, where they are needed for tasks, and will give to the tasks some uninitialized buffers for the pieces of C, since they will not be read from.
The main code initializes StarPU and data, launches tasks, unpartitions data, and unregisters it. Unpartitioning is an interesting step: until then the pieces of C are residing on the various GPUs where they have been computed. Unpartitioning will collect all the pieces of C into the main memory to form the whole C result matrix.

Run the application, enabling some statistics:

make nf_mult
STARPU_WORKER_STATS=1 ./nf_mult

It shows how the computation were distributed on the various processing units.

Other example

gemm/nf_xgemm.f90 is a very similar matrix-matrix product example, but which makes use of BLAS kernels for much better performance. The mult_kernel_common interface shows how we call SGEMM (float type value) or DGEMM (double type value) on the DSM interface.

Let’s execute them.

cd gemm
make nf_sgemm
STARPU_WORKER_STATS=1 nf_sgemm

make nf_dgemm
STARPU_WORKER_STATS=1 nf_dgemm

Exercise

Take the vector example again, and add partitioning support to it, using the matrix-matrix multiplication as an example. Here we will use the fstarpu_df_alloc_vector_filter_block() filter function. You can see the list of predefined filters provided by StarPU here. We provide a solution for the exercice here.

Session Part 2: Optimizations

This is based on StarPU’s documentation optimization chapter.

Data Management

We have explained how StarPU can overlap computation and data transfers thanks to DMAs. This is however only possible when CUDA has control over the application buffers. The application should thus use fstarpu_memory_pin() after allocating its buffer, to permit asynchronous DMAs from and to it.

Take the vector example again, and after calling the allocation, to make it use fstarpu_memory_pin().

Task Submission

To let StarPU reorder tasks, submit data transfers in advance, etc., task submission should be asynchronous whenever possible. Ideally, the application should behave like that: submit the whole graph of tasks, and wait for termination.

The CUDA execution should be submitted asynchronously, so as to let kernel computation and data transfer proceed independently:

	call fstarpu_codelet_add_cuda_func(scal_cl, C_FUNLOC(cl_cuda_func_vector_scal))

Performance Model Calibration

Inspection

Performance prediction is essential for proper scheduling decisions, the performance models thus have to be calibrated. This is done automatically by StarPU when a codelet is executed for the first time. Once this is done, the result is saved to a file in $STARPU_PERF_MODEL_DIR for later re-use. The starpu_perfmodel_display tool can be used to check the resulting performance model.

STARPU_PERF_MODEL_DIR specifies the main directory in which StarPU stores its performance model files. The default is $STARPU_HOME/.starpu/sampling.

STARPU_HOME specifies the main directory in which StarPU stores its configuration files. The default is $HOME on Unix environments, and $USERPROFILE on Windows environments.

In this tutorial we provide some pre-calibrated performance models with the Simgrid version of StarPU. You can run

source ./perfmodels.sh

to enable using them, (it sets STARPU_PERF_MODEL_DIR to a specific directory perfmodels available in the archive). Then you can use starpu_perfmodel_display to get the performance model details:

$ starpu_perfmodel_display -l         # Show the list of codelets that have a performance model
file: <nf_mult_perf.conan>
file: <nf_vector_scal_perf.conan>

$ starpu_perfmodel_display -s nf_vector_scal_perf.conan
# performance model for cpu0_impl0 (Comb0)
	Regression : #sample = 200
	Linear: y = alpha size ^ beta
		alpha = 4.670765e-04
		beta = 9.461948e-01
	Non-Linear: y = a size ^b + c
		a = 3.353878e-04
		b = 9.649029e-01
		c = 2.874142e+01
# hash		size		flops		mean (us or J)	stddev (us or J)		n
...
09be3ca9	1048576        	0.000000e+00   	2.762759e+02   	7.657556e+01   	10
...
a3d3725e	4096           	0.000000e+00   	2.352400e+00   	1.392095e+00   	10
# performance model for cuda0_impl0 (Comb1)
	Regression : #sample = 139
	Linear: y = alpha size ^ beta
		alpha = 1.991936e+00
		beta = 2.785852e-01
	Non-Linear: y = a size ^b + c
		a = 6.510932e-07
		b = 1.104311e+00
		c = 4.811140e+01
# hash		size		flops		mean (us or J)	stddev (us or J)		n
...
09be3ca9	1048576        	0.000000e+00   	7.092850e+01   	3.152889e+01   	10
...
a3d3725e	4096           	0.000000e+00   	6.897490e+01   	7.239839e+01   	10
# performance model for cuda1_impl0 (Comb2)
	Regression : #sample = 99
	Linear: y = alpha size ^ beta
		alpha = 4.848004e-01
		beta = 4.274543e-01
	Non-Linear: y = a size ^b + c
		a = 2.668639e-06
		b = 1.066538e+00
		c = 1.301450e+02
# hash		size		flops		mean (us or J)	stddev (us or J)		n
...
09be3ca9	1048576        	0.000000e+00   	1.573870e+02   	0.000000e+00   	1
...
# performance model for cuda2_impl0 (Comb3)
	Regression : #sample = 362
	Linear: y = alpha size ^ beta
		alpha = 4.137257e-02
		beta = 5.631607e-01
	Non-Linear: y = a size ^b + c
		a = 1.459716e-06
		b = 1.093299e+00
		c = 1.194248e+02
# hash		size		flops		mean (us or J)	stddev (us or J)		n
...
09be3ca9	1048576        	0.000000e+00   	1.874546e+02   	3.837819e+02   	19
...

This shows that for the vector_scal kernel with a 4KB size, the average execution time on CPUs was about 2.35µs, with a 1.39µs standard deviation, over 10 samples, while it took about 69µs on GPU CUDA0, with a 72µs standard deviation. With a 1MB size, execution time on CPUs is 276µs, while it is only 71µs on GPU CUDA0.

The performance model can also be drawn by using starpu_perfmodel_plot, which will emit a gnuplot file in the current directory:

$ starpu_perfmodel_plot -s nf_vector_scal_perf.conan
	Non-Linear: y = a size ^b + c
		a = 3.353878e-07
		b = 9.649029e-01
		c = 2.874142e-02
	Non-Linear: y = a size ^b + c
		a = 6.510932e-10
		b = 1.104311e+00
		c = 4.811140e-02
	Non-Linear: y = a size ^b + c
		a = 2.668639e-09
		b = 1.066538e+00
		c = 1.301450e-01
	Non-Linear: y = a size ^b + c
		a = 1.459716e-09
		b = 1.093299e+00
		c = 1.194248e-01
4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 8388608 16777216 33554432 67108864 134217728 268435456 536870912 1073741824 2147483648 
[starpu][he-XPS-13-9370][main] Gnuplot file <.//starpu_nf_vector_scal_perf.conan.gp> generated
$ gnuplot starpu_nf_vector_scal_perf.conan.gp

We have also measured the performance of the nf_mult kernel example, which can be drawn with

starpu_perfmodel_plot -s nf_mult_perf.conan
gnuplot starpu_nf_mult_perf.conan.gp

If we define the number of flops per task, and set it into task field:

real(KIND=C_DOUBLE), target :: flops
flops = 2 * (X / X_parts) * (Y / Y_parts) * Z

call fstarpu_task_insert((/ cl_mult, &
        FSTARPU_R, sub_handleA, &
        FSTARPU_R, sub_handleB, &
        FSTARPU_W, sub_handleC, &
        FSTARPU_FLOPS, c_loc(flops), &
        C_NULL_PTR /))

This allows to draw GFlop/s instead of just time:

starpu_perfmodel_plot -f -s nf_mult_perf.conan
gnuplot starpu_gflops_nf_mult_perf.conan.gp

(Energy part is not yet available in Fortran)

Measurement

And run the application with

make clean
make nf_mult
STARPU_CALIBRATE=1 ./nf_mult

The performance model can then be seen with

starpu_perfmodel_display -s nf_mult_perf
starpu_perfmodel_plot -s nf_mult_perf
gnuplot starpu_nf_mult_perf.gp

It is a good idea to check the variation before doing actual performance measurements. If the kernel has varying performance, it may be a good idea to force StarPU to continue calibrating the performance model, by using export STARPU_CALIBRATE=1 If the code of a computation kernel is modified, the performance changes, the performance model thus has to be recalibrated from start. To do so, use export STARPU_CALIBRATE=2

Task Scheduling Policy

By default, StarPU uses the lws simple greedy scheduler. This is because it provides correct load balance even if the application codelets do not have performance models: it uses a single central queue, from which workers draw tasks to work on. This however does not permit to prefetch data, since the scheduling decision is taken late.

If the application codelets have performance models, the scheduler should be changed to take benefit from that. StarPU will then really take scheduling decision in advance according to performance models, and issue data prefetch requests, to overlap data transfers and computations.

For instance, compare the lws (default) and dmdar scheduling policies:

STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=lws ./gemm/nf_sgemm -xy $((2048*4) -nblocks 4

with:

STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=dmdar ./gemm/nf_sgemm -xy $((2048*4)) -nblocks 4

You can see most (all?) the computation have been done on GPUs, leading to better performances.

Try other schedulers, use STARPU_SCHED=help to get the list.

Also try with various sizes (keeping a 256 tile size, i.e. increase both occurrences of 4 above) and draw curves.

You can also try the double version, nf_dgemm, and notice that GPUs get less great performance.

Sessions Part 3: MPI Support

StarPU provides support for MPI communications. It does so in two ways. Either the application specifies MPI transfers by hand, or it lets StarPU infer them from data dependencies.

Manual MPI transfers

Basically, StarPU provides equivalents of MPI_* functions, but which operate on DSM handles instead of void* buffers. The difference is that the source data may be residing on a GPU where it just got computed. StarPU will automatically handle copying it back to main memory before submitting it to MPI.

In the fortran/mpi/ subdirectory, nf_ring_async_implicit.f90 shows an example of mixing MPI communications and task submission. It is a classical ring MPI ping-pong, but the token which is being passed on from neighbour to neighbour is incremented by a starpu task at each step.

This is written very naturally by simply submitting all MPI communication requests and task submission asynchronously in a sequential-looking loop, and eventually waiting for all the tasks to complete.

cd mpi
make nf_ring_async_implicit
mpirun -np 2 $PWD/nf_ring_async_implicit

starpu_mpi_task_insert

A stencil application shows a basic MPI task model application. The data distribution over MPI nodes is decided by the my_distrib function, and can thus be changed trivially. It also shows how data can be migrated to a new distribution.

make nf_stencil5
mpirun -np 2 $PWD/nf_stencil5 -display

More Performance Optimizations

The StarPU performance feedback chapter provides more optimization tips for further reading after this tutorial.

FxT Tracing Support

In addition to online profiling, StarPU provides offline profiling tools, based on recording a trace of events during execution, and analyzing it afterwards.

The trace file is stored in /tmp by default. To tell StarPU to store output traces in the home directory, one can set:

export STARPU_FXT_PREFIX=$HOME/

The application should be run again, for instance:

make clean
make nf_mult
./nf_mult

This time a prof_file_XX_YY trace file will be generated in your home directory. This can be converted to several formats by using:

starpu_fxt_tool -i ~/prof_file_*

This will create

a paje.trace file, which can be opened by using the ViTE tool. This shows a Gant diagram of the tasks which executed, and thus the activity and idleness of tasks, as well as dependencies, data transfers, etc. You may have to zoom in to actually focus on the computation part, and not the lengthy CUDA initialization.
a dag.dot file, which contains the graph of all the tasks submitted by the application. It can be opened by using Graphviz.
an activity.data file, which records the activity of all processing units over time.

Contact

For any questions regarding StarPU, please contact the StarPU developers mailing list starpu-devel@inria.fr