StarPU Docker Tutorial

(last updated on 2024-08-20)

The tutorial slides are available as PDF.

Setup

Using the StarPU docker image

docker run --rm -it registry.gitlab.inria.fr/starpu/starpu-docker/starpu:1.4.7

If you have GPU nodes on your machine, use instead the following command

docker run --rm -it --gpus all registry.gitlab.inria.fr/starpu/starpu-docker/starpu:1.4.7

StarPU is installed in /usr/local and StarPU-simgrid is installed in /usr/local/starpu-simgrid

Testing the installation

Make sure that StarPU finds your hardware with:

starpu_machine_display

Note that the first time starpu_machine_display is executed, it calibrates the performance model of the bus, the results are then stored in different files in the directory $HOME/.starpu/sampling/bus.

Using the simulation version of StarPU

STARPU_PATH=/usr/local/starpu-simgrid
export PATH=/usr/local/starpu-simgrid/bin:$PATH
export PKG_CONFIG_PATH=/usr/local/starpu-simgrid/lib/pkgconfig:$PKG_CONFIG_PATH

You can for instance run some examples provided by StarPU:

export STARPU_HOSTNAME=attila
STARPU_WORKER_STATS=1 $STARPU_PATH/lib/starpu/examples/cholesky_implicit -size $((960*20)) -nblocks 20

and notice that a lot of the tasks went to the virtual GPU.

Example

To get the simulation version working, you need to set the path to the performance models:

cd $HOME/tutorial/C
export STARPU_PERF_MODEL_DIR=$PWD/../perfmodels
export STARPU_HOSTNAME=conan
starpu_machine_display
make
./vector_scal_task_insert

Session Part 1: Task-based Programming Model

Application Example: Vector Scaling

A vector scaling example is available both in the $HOME/tutorial/C folder and in the $HOME/tutorial/fortran folder of the docker image.

Base version

The original non-StarPU versions in C vector_scal0.c, also available remotely, and in Fortran vector_scal0.f90, also available remotely, show the basic example that we will be using to illustrate how to use StarPU. It simply allocates a vector, and calls a scaling function over it.

C code Fortran code
void vector_scal_cpu(float *val, unsigned n, float factor)
{
	unsigned i;

	for (i = 0; i < n; i++)
		val[i] *= factor;
}






int main(int argc, char **argv)
{
	float *vector;
	unsigned i;

	vector = malloc(sizeof(vector[0]) * NX);
	for (i = 0; i < NX; i++)
		vector[i] = 1.0f;

	float factor = 3.14;
	vector_scal_cpu(vector, NX, factor);

	free(vector);
	return 0;
}
subroutine vector_scal_cpu(val, n, factor)
  implicit none

  real, dimension(:), intent(inout):: val
  integer, intent(in) :: n
  real, intent(in) :: factor
  integer :: i

  do i=1, n
          val(i) = val(i)*factor
  end do
end subroutine vector_scal_cpu

program vector_scal0
  use vector_scal_cpu
  implicit none

  real, dimension(:), allocatable :: vector
  integer :: i, NX = 2048
  real :: factor = 3.14

  allocate(vector(NX))
  vector = 1

  write(*,*) "BEFORE : First element was", vector(1)

  call vector_scal_cpu(vector, NX, factor)

  write(*,*) "AFTER First element is", vector(1)
end program vector_scal0

StarPU version

The StarPU version of the scaling example is available in the directory $HOME/tutorial/C/:

The StarPU Fortran version of the scaling example is available in the directory $HOME/tutorial/fortran/: tarball:

Computation Kernels

Examine the source code, starting from vector_scal_cpu.c and vector_scal_cl.f90 : this is the same vector_scal_cpu computation code …

C code Fortran code
// which is now wrapped into a function which takes a series of DSM handles and a non-DSM parameter.
void vector_scal_cpu(void *buffers[], void *cl_arg)
{
  // The code first gets the vector data, and extracts the pointer and size of the
  // vector data:
  struct starpu_vector_interface *vector = buffers[0];
  float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
  unsigned n = STARPU_VECTOR_GET_NX(vector);

  // It then gets the factor value from the non-DSM parameter:
  float factor;
  starpu_codelet_unpack_args(cl_arg, &factor);

  // and it eventually performs the vector scaling:
  int i;
  for (i = 0; i < n; i++) val[i] *= factor;
}
! which is now wrapped into a function which takes a series of DSM handles and a non-DSM parameter.
recursive subroutine vector_scal_cpu(buffers, cl_args) bind(C)
type(c_ptr), value, intent(in) :: buffers, cl_args
  ! The code first gets the vector data, and extracts the size and
  ! the pointer of the vector data:
  real, dimension(:), pointer :: val
  integer :: n_val
  n_val = fstarpu_vector_get_nx(buffers, 0)
  call c_f_pointer(fstarpu_vector_get_ptr(buffers, 0), val, shape=[n_val])

  ! It then gets the factor value from the non-DSM parameter:
  real, target :: factor
  call fstarpu_unpack_arg(cl_args, (/ c_loc(factor) /))

  ! and it eventually performs the vector scaling:
  integer :: i
  do i=1,n_val
    val(i) = val(i)*factor
  end do
end subroutine vector_scal_cpu

The GPU implementation, in vector_scal_cuda.cu (and the same in Fortran, are basically the same, with the host part (vector_scal_cuda) which extracts the actual CUDA pointer from the DSM interface, and passes it to the device part (vector_mult_cuda) which performs the actual computation.

The OpenCL implementation in vector_scal_opencl.c and vector_scal_opencl_kernel.cl is more hairy due to the low-level aspect of the OpenCL standard, but the principle remains the same. (Please note it is not available in the Fortran code).

Main Code

Now let’s examine the main code in C and in Fortran

C code Fortran code
  • The cl (codelet) structure simply gathers pointers on the functions mentioned above, and notes that the functions take only one DSM handle in Read-Write mode. It also notes that a performance model should be used.
    static struct starpu_codelet cl = {
      .cpu_funcs = {vector_scal_cpu},
      .cuda_funcs = {vector_scal_cuda},
      .opencl_funcs = {vector_scal_opencl},
    
      .nbuffers = 1,
      .modes = {STARPU_RW},
    
      .model = &perfmodel,
    };
    
  • The main function starts with initializing StarPU with the default parameters:
      starpu_init(NULL);
    
  • It then allocates the vector and fills it like the original code:
      vector = malloc(sizeof(vector[0]) * NX);
      for (i = 0; i < NX; i++)
          vector[i] = 1.0f;
    
  • It then registers the data to StarPU, and gets back a DSM handle. From now on, the application is not supposed to access vector directly, since its content may be copied and modified by a task on a GPU, the main-memory copy then being outdated.
      starpu_data_handle_t vector_handle;
      starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX, sizeof(vector[0]));
    
  • It then submits a (asynchronous) task to StarPU.
      starpu_task_insert(&cl,
          STARPU_VALUE, &factor, sizeof(factor),
          STARPU_RW, vector_handle,
          0);
    
  • It waits for the task completion:
      starpu_task_wait_for_all();
    
  • It unregisters the vector from StarPU, which brings back the modified version to main memory, so the result can be read.
      starpu_data_unregister(vector_handle);
    
  • Eventually, it shutdowns StarPU:
      starpu_shutdown();
    
  • The main function starts with initializing StarPU with the default parameters:
      err = fstarpu_init(C_NULL_PTR)
    
  • It then allocates the vector and fills it like the original code:
      allocate(vector(NX))
      vector = 1
    
  • The cl (codelet) structure simply gathers pointers on the functions mentioned above, and notes that the functions takes only one DSM parameter. It needs allocate an empty codelet structure before adding the CPU function and setting codelet fileds:
      type(c_ptr) :: scal_cl
      scal_cl = fstarpu_codelet_allocate()
      call fstarpu_codelet_set_name(scal_cl, C_CHAR_"vector_scal_codelet"//C_NULL_CHAR)
      call fstarpu_codelet_add_cpu_func(scal_cl, C_FUNLOC(cl_cpu_func_vector_scal))
      call fstarpu_codelet_add_buffer(scal_cl, FSTARPU_RW)
    
  • It then registers the data to StarPU, and gets back a DSM handle. From now on, the application is not supposed to access vector directly, since its content may be copied and modified by a task on a GPU, the main-memory copy then being outdated.
      type(c_ptr) :: vector_handle
      call fstarpu_vector_data_register(vector_handle, 0, c_loc(vector), NX, c_sizeof(vector(0)))
    
  • It then submits a task to StarPU.
      call fstarpu_task_insert((/ scal_cl, &
          FSTARPU_RW, vector_handle, &
          FSTARPU_VALUE, c_loc(factor), FSTARPU_SZ_C_FLOAT, &
          C_NULL_PTR /))
    
  • It waits for task completion:
      call fstarpu_task_wait_for_all()
    
  • It unregisters the vector from StarPU, which brings back the modified version to main memory, so the result can be read.
      call fstarpu_data_unregister(vector_handle)
    
  • It frees the codelet structure
    call fstarpu_codelet_free(scal_cl)
    
  • Eventually, it shutdowns StarPU and deallocates the vector:
      call fstarpu_shutdown()
    
      deallocate(vector)
    

Compilation and execution

Building

Let us look at how this should be built. Typical makefiles for applications using StarPU is the following are available in C and in Fortran

C code Fortran code
STARPU_VERSION=1.4
CPPFLAGS += $(shell pkg-config --cflags starpu-$(STARPU_VERSION))
LDLIBS += $(shell pkg-config --libs starpu-$(STARPU_VERSION))
%.o: %.cu
	nvcc $(CPPFLAGS) $< -c -o $@

vector_scal_task_insert: vector_scal_task_insert.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o

Additionally, to avoid having to set LD_LIBRARY_PATH one can add an rpath:

LDLIBS += -Wl,-rpath -Wl,$(shell pkg-config --variable=libdir starpu-$(STARPU_VERSION))
PROG = vector_scal_task_insert
OBJS = vector_scal_cl.o

STARPU_VERSION=1.4
FC = gfortran
CC = gfortran

CFLAGS = $(shell pkg-config --cflags starpu-$(STARPU_VERSION))
LDLIBS =  $(shell pkg-config --libs starpu-$(STARPU_VERSION))
FCFLAGS = -J. -g

FSTARPU_MOD = $(shell pkg-config --cflags-only-I starpu-$(STARPU_VERSION)|sed -e 's/^\([^ ]*starpu\/$(STARPU_VERSION)\).*$$/\1/;s/^.* //;s/^-I//')/fstarpu_mod.f90

all: $(PROG)

$(PROG): %: %.o $(OBJS)
	$(FC) $(LDFLAGS) -o $@ $^ $(LDLIBS)

fstarpu_mod.o: $(FSTARPU_MOD)
	$(FC) $(FCFLAGS) -c -o $@ $<

%.o: %.f90
	$(FC) $(FCFLAGS) -c -o $@ $<

vector_scal_task_insert: vector_scal_cl.o fstarpu_mod.o vector_scal_task_insert.o
vector_scal_cl.o: fstarpu_mod.o

The Fortran module fstarpu_mod.f90 must be compiled with the same compiler as the application itself, and the resulting fstarpu_mod.o object file must be linked with the application executable.

The provided Makefiles additionally detects whether CUDA or OpenCL are available in StarPU, and add the corresponding files and link flags.

Simulation

If your system does not have a CUDA or OpenCL GPU, you can use the simulation version of StarPU by setting some environment variables by running in your shell:

 . ./../simu.sh

If you ever want to get back to the non-simulated version of StarPU, you can run in your shell:

 . ./../native.sh

Note that after switching between the simulated and the non-simulated versions of StarPU, you have to rebuild completely:

 make clean
 make
Running
make clean vector_scal_task_insert
./vector_scal_task_insert

Note that if you are using the simulation version of StarPU, the computation will not be performed, and thus the final value will be equal to the initial value, but the timing provided by starpu_timing_now() or fstarpu_timing_now() will correspond to the correct execution time.

You can set the environment variable STARPU_WORKER_STATS to 1 when running your application to see the number of tasks executed by each device. You can see the whole list of environment variables here.

STARPU_WORKER_STATS=1 ./vector_scal_task_insert

To make the program rather use OpenCL on CUDA devices, one can set STARPU_NCUDA to 0:

# to force the implementation on a OpenCL device
STARPU_WORKER_STATS=1 STARPU_NCPU=0 STARPU_NCUDA=0 ./vector_scal_task_insert

Data Partitioning

In the previous section, we submitted only one task. We here discuss how to partition data so as to submit multiple tasks which can be executed in parallel by the various CPUs and GPUs.

Let’s examine mult.c or mult.f90

Run the application, enabling some statistics:

make -C partition clean all
STARPU_WORKER_STATS=1 ./partition/mult

Figures show how the computation were distributed on the various processing units.

Other example

gemm/xgemm.c and gemm/xgemm.f90 are very similar matrix-matrix product examples, which make use of BLAS kernels for much better performance. The mult_kernel_common functions show how we call DGEMM (CPUs) or cublasDgemm (GPUs) on the DSM interface. Also note the presence of the starpu_cublas_init() call in the main function so as to more efficiently connect cublas with StarPU.

Let’s execute it.

make -C gemm clean all
STARPU_WORKER_STATS=1 ./gemm/sgemm
STARPU_WORKER_STATS=1 ./gemm/dgemm

Exercise

Take the vector example again, and add partitioning support to it, using the matrix-matrix multiplication as an example. Here we will use the starpu_vector_filter_block() filter function or its equivalent in Fortran fstarpu_df_alloc_vector_filter_block().

You can see the list of predefined filters provided by StarPU here.

We provide a solution for the exercice in C or in Fortran.

Session Part 2: Optimizations

This is based on StarPU’s documentation optimization chapter.

Data Management

We have explained how StarPU can overlap computation and data transfers thanks to DMAs. This is however only possible when CUDA has control over the application buffers. The application should thus use starpu_malloc() when allocating its buffer, to permit asynchronous DMAs from and to it. In Fortran, the application should thus use fstarpu_memory_pin() after allocating its buffer for the same effect.

Take the vector example again, and fix the allocation, to make it use starpu_malloc(). In Fortran, after calling the allocation, you need to use fstarpu_memory_pin().

Task Submission

To let StarPU reorder tasks, submit data transfers in advance, etc., task submission should be asynchronous whenever possible. Ideally, the application should behave like that: submit the whole graph of tasks, and wait for termination.

The CUDA and OpenCL kernel execution themselves should be submitted asynchronously, so as to let kernel computation and data transfer proceed independently:

	.cuda_flags = {STARPU_CUDA_ASYNC},
	call fstarpu_codelet_add_cuda_func(scal_cl, C_FUNLOC(cl_cuda_func_vector_scal))
	.opencl_flags = {STARPU_OPENCL_ASYNC},

Performance Model Calibration

Inspection

Performance prediction is essential for proper scheduling decisions, the performance models thus have to be calibrated. This is done automatically by StarPU when a codelet is executed for the first time. Once this is done, the result is saved to a file in $STARPU_PERF_MODEL_DIR for later re-use. The starpu_perfmodel_display tool can be used to check the resulting performance model.

STARPU_PERF_MODEL_DIR specifies the main directory in which StarPU stores its performance model files. The default is $STARPU_HOME/.starpu/sampling.

STARPU_HOME specifies the main directory in which StarPU stores its configuration files. The default is $HOME on Unix environments, and $USERPROFILE on Windows environments.

In this tutorial we provide some pre-calibrated performance models with the Simgrid version of StarPU. You can run

Note that the following curves are based on the C version of the code, but the experiments can be made on the Fortran execution. The numbers will just be different.

source $HOME/tutorial/perfmodels.sh

to enable using them, (it sets STARPU_PERF_MODEL_DIR to a specific directory perfmodels available in the docker image). Then you can use starpu_perfmodel_display to get the performance model details:

$ starpu_perfmodel_display -l         # Show the list of codelets that have a performance model
file: <mult_energy_model.conan>
file: <mult_perf_model.conan>
file: <starpu_dgemm_gemm.conan>
file: <starpu_sgemm_gemm.conan>
file: <vector_scal.conan>

$ starpu_perfmodel_display -s vector_scal.conan     # Show the details for one codelet
# performance model for cuda0_impl0 (Comb0)
# performance model for cuda0_impl0 (Comb0)
	Regression : #sample = 132
	Linear: y = alpha size ^ beta
		alpha = 7.040874e-01
		beta = 3.326125e-01
	Non-Linear: y = a size ^b + c
		a = 6.207150e-05
		b = 9.503886e-01
		c = 1.887639e+01
# hash		size		flops		mean (us)	stddev (us)		n
a3d3725e	4096           	0.000000e+00   	1.902150e+01   	1.639952e+00   	10
870a30aa	8192           	0.000000e+00   	1.971540e+01   	1.115123e+00   	10
48e988e9	16384          	0.000000e+00   	1.934910e+01   	8.406537e-01   	10
...
09be3ca9	1048576        	0.000000e+00   	5.483990e+01   	7.629412e-01   	10
...
# performance model for cuda1_impl0 (Comb1)
...
09be3ca9	1048576        	0.000000e+00   	5.389290e+01   	8.083156e-01   	10
...
# performance model for cuda2_impl0 (Comb2)
...
09be3ca9	1048576        	0.000000e+00   	5.431150e+01   	4.599005e-01   	10
...
# performance model for cpu0_impl0 (Comb3)
...
a3d3725e	4096           	0.000000e+00   	5.149621e+00   	7.096558e-02   	66
...
09be3ca9	1048576        	0.000000e+00   	1.218595e+03   	4.823102e+00   	66
...

This shows that for the vector_scal kernel with a 4KB size, the average execution time on CPUs was about 5.1µs, with a 0.07µs standard deviation, over 66 samples, while it took about 19µs on GPU CUDA0, with a 1.6µs standard deviation. With a 1MB size, execution time on CPUs is 1.2ms, while it is only 54µs on GPU CUDA0.

The performance model can also be drawn by using starpu_perfmodel_plot, which will emit a gnuplot file in the current directory:

$ starpu_perfmodel_plot -s vector_scal.conan
...
[starpu][main] Gnuplot file <starpu_vector_scal.conan.gp> generated
$ gnuplot starpu_vector_scal.conan.gp

The measurements were made on CPUs, but also GPUs that support both OpenCL and CUDA. The graph shows that GPUs become more efficient for vector size beyond 20000 bytes.

We have also measured the performance of the mult kernel example, which can be drawn with

starpu_perfmodel_plot -s mult_perf_model.conan
gnuplot starpu_mult_perf_model.conan.gp

We can see a slight bump after 2MB.

The task submission included the number of flops per task, this allows to draw GFlop/s instead of just time:

starpu_perfmodel_plot -f -s mult_perf_model.conan
gnuplot starpu_gflops_mult_perf_model.conan.gp

We indeed notice a performance drop after 2MB, which corresponds to the cache size.

Here is an example on how to set the number of flops per task in a Fortran code.

real(KIND=C_DOUBLE), target :: flops
flops = 2 * (X / X_parts) * (Y / Y_parts) * Z

call fstarpu_task_insert((/ cl_mult, &
        FSTARPU_R, sub_handleA, &
        FSTARPU_R, sub_handleB, &
        FSTARPU_W, sub_handleC, &
        FSTARPU_FLOPS, c_loc(flops), &
        C_NULL_PTR /))

We can also draw the energy used by tasks:

starpu_perfmodel_plot -e -s mult_energy_model.conan
gnuplot starpu_mult_energy_model.conan.gp

We can again notice the bump after 2MB.

Again, instead of the energy, one can observe the computation efficiency thanks to the flops information:

starpu_perfmodel_plot -f -e -s mult_energy_model.conan
gnuplot starpu_gflops_mult_energy_model.conan.gp

This is much more interesting! We do indeed notice the efficiency drop after 2MB, but we can also notice an efficiency maximum around 100KB.

Measurement

To calibrate the performance models, one needs to re-compile StarPU without simgrid support and re-compile the application against it, to enable the actual execution of the kernels.

It is a good idea to check the variation before doing actual performance measurements. If the kernel has varying performance, it may be a good idea to force StarPU to continue calibrating the performance model, by using export STARPU_CALIBRATE=1 If the code of a computation kernel is modified, the performance changes, the performance model thus has to be recalibrated from start. To do so, use export STARPU_CALIBRATE=2

Energy measurement

CPUs can report their energy usage through performance counters, and NVIDIA devices can report it through the CUDA interface. StarPU provides an interface to abstract the measurement for the application. The available measurement precision is however quite coarse. The principle is thus that the application should submit a series of tasks of the same kind, and put measurement calls before and after the series, so StarPU can compute an average over the whole set.

mult_bench.c achieves this: it prepares matrices so as to generate a fair number of tasks according to the number of cpus so the measurement is long enough.

Unfortunately, with docker the performance counters cannot be read due to administrative permissions. Running the benchmark on the raw system (possibly requering root access) would allow to perform the measurement.

Task Scheduling Policy

By default, StarPU uses the lws simple greedy scheduler. This is because it provides correct load balance even if the application codelets do not have performance models: it uses a single central queue, from which workers draw tasks to work on. This however does not permit to prefetch data, since the scheduling decision is taken late.

If the application codelets have performance models, the scheduler should be changed to take benefit from that. StarPU will then really take scheduling decision in advance according to performance models, and issue data prefetch requests, to overlap data transfers and computations.

For instance, compare the lws (default) and dmdar scheduling policies:

STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=lws gemm/sgemm -xy $((256*4)) -nblocks 4

with:

STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=dmdar gemm/sgemm -xy $((256*4)) -nblocks 4

You can see most (all?) the computation have been done on GPUs, leading to better performances.

Try other schedulers, use STARPU_SCHED=help to get the list.

Also try with various sizes (keeping a 256 tile size, i.e. increase both occurrences of 4 above) and draw curves.

You can also try the double version, dgemm, and notice that GPUs get less great performance.

Sessions Part 3: MPI Support

StarPU provides support for MPI communications. It does so in two ways. Either the application specifies MPI transfers by hand, or it lets StarPU infer them from data dependencies.

The simulation way is a bit tricky (to set up a virtual cluster etc), so it is prefererable to re-compile StarPU without simgrid support.

Manual MPI transfers

Basically, StarPU provides equivalents of MPI_* functions, but which operate on DSM handles instead of void* buffers. The difference is that the source data may be residing on a GPU where it just got computed. StarPU will automatically handle copying it back to main memory before submitting it to MPI.

Both C example and Fortran example show how to mix MPI communications and task submission. It is a classical ring MPI ping-pong, but the token which is being passed on from neighbour to neighbour is incremented by a starpu task at each step.

This is written very naturally by simply submitting all MPI communication requests and task submission asynchronously in a sequential-looking loop, and eventually waiting for all the tasks to complete.

cd $HOME/tutorial/C/mpi
make clean ring_async_implicit
mpirun -np 2 $PWD/ring_async_implicit
cd $HOME/tutorial/fortran/mpi
make ring_async_implicit
mpirun -np 2 $PWD/ring_async_implicit

starpu_mpi_task_insert

A stencil application (C version or fortran version) shows a basic MPI task model application. The data distribution over MPI nodes is decided by the my_distrib function, and can thus be changed trivially. It also shows how data can be migrated to a new distribution.

cd $HOME/tutorial/C/mpi
make clean all
mpirun -np 2 $PWD/stencil5 -display
cd $HOME/tutorial/fortran/mpi
make clean all
mpirun -np 2 $PWD/stencil5 -display

More Performance Optimizations

The StarPU performance feedback chapter provides more optimization tips for further reading after this tutorial.

FxT Tracing Support

In addition to online profiling, StarPU provides offline profiling tools, based on recording a trace of events during execution, and analyzing it afterwards.

The StarPU docker image has been compiled with the FxT support. Trace generation is by default disabled, enable it by setting the variable STARPU_FXT_TRACE.

export STARPU_FXT_TRACE=1

The trace file is stored in /tmp by default. To tell StarPU to store output traces in the home directory, one can set:

export STARPU_FXT_PREFIX=$HOME/

The application should be run again, for instance:

./mult

This time a prof_file_XX_YY trace file will be generated in your home directory. This can be converted to several formats by using:

starpu_fxt_tool -i ~/prof_file_*

This will create

One can also automatically generate all the different formats by setting the variable STARPU_GENERATE_TRACE before running the application.

export STARPU_GENERATE_TRACE=1
./mult

Contact

For any questions regarding StarPU, please contact the StarPU developers mailing list starpu-devel@inria.fr