EXA2PRO 2019, StarPU Tutorial - Linköping, May 2019

Presentation

The presented slides are available.

Setup

Installing StarPU on your system

Instructions to build, install and check that StarPU is running fine are available in the StarPU handbook

Please take version 1.3.1 of StarPU.

Testing the installation

After building and installing StarPU, you can make sure that StarPU finds your hardware with:

starpu_machine_display

Note that the first time starpu_machine_display is executed, it calibrates the performance model of the bus, the results are then stored in different files in the directory $HOME/.starpu/sampling/bus.

Building the simulation version of StarPU

To be able to exercise accelerator support on your laptop which probably does not have accelerators, we will build a simulation version of StarPU. For this you will need SimGrid, which you can possibly get with apt-get install libsimgrid-dev, or by installing it by hand: first get Simgrid from the SimGrid website, for instance the latest version of simgrid. , Then build it:

sudo apt-get install cmake python3 libboost-dev
cd SimGrid-3.21/
cmake . -Denable_documentation=off
make -j 4
sudo make install
sudo ldconfig

You can then build a simulation version of StarPU:

cd starpu-1.3
./configure --enable-simgrid
make clean
make
make install
sudo ldconfig
STARPU_PATH=/usr/local

You can for instance run some examples provided by StarPU:

$ export STARPU_PERF_MODEL_DIR=$STARPU_PATH/share/starpu/perfmodels/sampling
$ export STARPU_HOSTNAME=attila
$ STARPU_WORKER_STATS=1 $STARPU_PATH/lib/starpu/examples/cholesky_implicit -size $((960*20)) -nblocks 20

and notice that a lot of the tasks went to the virtual GPU.

Tutorial Material

All files needed for the lab works are available as a tar archive available here: material.tar.gz.

To get the simulation version working, you need to set the path to the performance models:

$ cd files/
$ export STARPU_PERF_MODEL_DIR=$PWD/perfmodels
$ export STARPU_HOSTNAME=conan
$ starpu_machine_display
$ make
$ ./vector_scal_task_insert

Session Part 1: Task-based Programming Model

Application Example: Vector Scaling

This example is at the root of the material.tar.gz file.

Making it and Running it

A typical Makefile for applications using StarPU is the following:

CFLAGS += $(shell pkg-config --cflags starpu-1.3)
LDLIBS += $(shell pkg-config --libs starpu-1.3)
%.o: %.cu
	nvcc $(CFLAGS) $< -c -o $@

vector_scal_task_insert: vector_scal_task_insert.o vector_scal_cpu.o # vector_scal_cuda.o vector_scal_opencl.o

If you have CUDA or OpenCL available on your system, you can uncomment adding the corresponding files on the last line, and uncomment the corresponding link flags. Here are the source files for the application, available in the material tarball:

Run make vector_scal_task_insert, and run the resulting vector_scal_task_insert executable using the given script vector_scal.sh. It should be working: it simply scales a given vector by a given factor.

make vector_scal_task_insert

./vector_scal_task_insert

Note that if you are using the simulation version of StarPU, the computation will not be performed, and thus the final value will be equal to the initial value, but the timing provided by starpu_timing_now() will correspond to the correct execution time.

Computation Kernels

Examine the source code, starting from vector_scal_cpu.c : this is the actual computation code, which is wrapped into a vector_scal_cpu function which takes a series of DSM interfaces and a non-DSM parameter. The code simply gets the factor value from the non-DSM parameter, an actual pointer from the first DSM interface, and performs the vector scaling.

The GPU implementation, in vector_scal_cuda.cu, is basically the same, with the host part (vector_scal_cuda) which extracts the actual CUDA pointer from the DSM interface, and passes it to the device part (vector_mult_cuda) which performs the actual computation.

The OpenCL implementation in vector_scal_opencl.c and vector_scal_opencl_kernel.clis more hairy due to the low-level aspect of the OpenCL standard, but the principle remains the same.

#+begin_comment Modify the source code of the different implementations (CPU, CUDA and OpenCL) to see which ones gets executed. You can force the execution of one the implementations simply by disabling a type of device when running your application, e.g.:

# to force the implementation on a GPU device, by default, it will enable CUDA
STARPU_NCPUS=0 ./vector_scal_task_insert

# to force the implementation on a OpenCL device
STARPU_NCPUS=0 STARPU_NCUDA=0 ./vector_scal_task_insert

#+end_comment

You can set the environment variable STARPU_WORKER_STATS to 1 when running your application to see the number of tasks executed by each device. You can see the whole list of environment variables here.

STARPU_WORKER_STATS=1 ./vector_scal_task_insert

# to force the implementation on a GPU device, by default, it will enable CUDA
STARPU_WORKER_STATS=1 STARPU_NCPUS=0 ./vector_scal_task_insert

# to force the implementation on a OpenCL device
STARPU_WORKER_STATS=1 STARPU_NCPUS=0 STARPU_NCUDA=0 ./vector_scal_task_insert

Main Code

Now examine vector_scal_task_insert.c: the cl (codelet) structure simply gathers pointers on the functions mentioned above.

The main function

Data Partitioning

In the previous section, we submitted only one task. We here discuss how to /partition/ data so as to submit multiple tasks which can be executed in parallel by the various CPUs and GPUs.

Let’s examine mult.c.

Run the application with the script mult.sh, enabling some statistics:

#!/bin/bash
make mult
STARPU_WORKER_STATS=1 ./mult

Figures show how the computation were distributed on the various processing units.

Other example

gemm/xgemm.c is a very similar matrix-matrix product example, but which makes use of BLAS kernels for much better performance. The mult_kernel_common functions shows how we call DGEMM (CPUs) or cublasDgemm (GPUs) on the DSM interface.

Let’s execute it.

#!/bin/bash
make gemm/sgemm
STARPU_WORKER_STATS=1 ./gemm/sgemm

#+begin_comment We can notice that StarPU gave much more tasks to the GPU. You can also try to set num_gpu=2 to run on the machine which has two GPUs (there is only one of them, so you may have to wait a long time, so submit this in background in a separate terminal), the interesting thing here is that with no application modification beyond making it use a task-based programming model, we get multi-GPU support for free! #+end_comment

#+begin_comment

More Advanced Examples

examples/lu/xlu_implicit.c is a more involved example: this is a simple LU decomposition algorithm. The dw_codelet_facto_v3 is actually the main algorithm loop, in a very readable, sequential-looking way. It simply submits all the tasks asynchronously, and waits for them all.

examples/cholesky/cholesky_implicit.c is a similar example, but which makes use of the starpu_insert_task helper. The _cholesky function looks very much like dw_codelet_facto_v3 of the previous paragraph, and all task submission details are handled by starpu_insert_task.

Thanks to being already using a task-based programming model, MAGMA and PLASMA have been easily ported to StarPU by simply using starpu_insert_task. #+end_comment

Exercise

Take the vector example again, and add partitioning support to it, using the matrix-matrix multiplication as an example. Here we will use the starpu_vector_filter_block() filter function. You can see the list of predefined filters provided by StarPU here. Try to run it with various numbers of tasks.

Session Part 2: Optimizations

This is based on StarPU’s documentation optimization chapter

Data Management

We have explained how StarPU can overlap computation and data transfers thanks to DMAs. This is however only possible when CUDA has control over the application buffers. The application should thus use starpu_malloc() when allocating its buffer, to permit asynchronous DMAs from and to it.

Take the vector example again, and fix the allocation, to make it use starpu_malloc().

Task Submission

To let StarPU reorder tasks, submit data transfers in advance, etc., task submission should be asynchronous whenever possible. Ideally, the application should behave like that: submit the whole graph of tasks, and wait for termination.

Performance Model Calibration

Performance prediction is essential for proper scheduling decisions, the performance models thus have to be calibrated. This is done automatically by StarPU when a codelet is executed for the first time. Once this is done, the result is saved to a file in $STARPU_HOME for later re-use. The starpu_perfmodel_display tool can be used to check the resulting performance model.

$ starpu_perfmodel_display -l
file: &lt;vector_scal.conan&gt;
file: &lt;mult_perf_model.conan&gt;
file: &lt;starpu_dgemm_gemm.conan&gt;
file: &lt;starpu_sgemm_gemm.conan&gt;
$ starpu_perfmodel_display -s vector_scal
# performance model for cuda0_impl0 (Comb0)
# performance model for cuda0_impl0 (Comb0)
	Regression : #sample = 132
	Linear: y = alpha size ^ beta
		alpha = 7.040874e-01
		beta = 3.326125e-01
	Non-Linear: y = a size ^b + c
		a = 6.207150e-05
		b = 9.503886e-01
		c = 1.887639e+01
# hash		size		flops		mean (us)	stddev (us)		n
a3d3725e	4096           	0.000000e+00   	1.902150e+01   	1.639952e+00   	10
870a30aa	8192           	0.000000e+00   	1.971540e+01   	1.115123e+00   	10
48e988e9	16384          	0.000000e+00   	1.934910e+01   	8.406537e-01   	10
...
09be3ca9	1048576        	0.000000e+00   	5.483990e+01   	7.629412e-01   	10
...
# performance model for cuda1_impl0 (Comb1)
...
09be3ca9	1048576        	0.000000e+00   	5.389290e+01   	8.083156e-01   	10
...
# performance model for cuda2_impl0 (Comb2)
...
09be3ca9	1048576        	0.000000e+00   	5.431150e+01   	4.599005e-01   	10
...
# performance model for cpu0_impl0 (Comb3)
...
a3d3725e	4096           	0.000000e+00   	5.149621e+00   	7.096558e-02   	66
...
09be3ca9	1048576        	0.000000e+00   	1.218595e+03   	4.823102e+00   	66
...

This shows that for the vector_scal kernel with a 4KB size, the average execution time on CPUs was about 5.1µs, with a 0.07µs standard deviation, over 66 samples, while it took about 19µs on CUDA GPUs, with a 1.6µs standard deviation. With a 1MB size, execution time on CPUs is 1.2ms, while it is only 54µs on the CUDA GPU. It is a good idea to check the variation before doing actual performance measurements. If the kernel has varying performance, it may be a good idea to force StarPU to continue calibrating the performance model, by using export STARPU_CALIBRATE=1

The performance model can also be drawn by using starpu_perfmodel_plot, which will emit a gnuplot file in the current directory:

$ starpu_perfmodel_plot -s vector_scal       
...
[starpu][main] Gnuplot file &lt;starpu_vector_scal.gp&gt; generated
$ ./starpu_vector_scal.gp
$ gv starpu_vector_scal.eps

If the code of a computation kernel is modified, the performance changes, the performance model thus has to be recalibrated from start. To do so, use export STARPU_CALIBRATE=2

Task Scheduling Policy

By default, StarPU uses the lws simple greedy scheduler. This is because it provides correct load balance even if the application codelets do not have performance models: it uses a single central queue, from which workers draw tasks to work on. This however does not permit to prefetch data, since the scheduling decision is taken late.

If the application codelets have performance models, the scheduler should be changed to take benefit from that. StarPU will then really take scheduling decision in advance according to performance models, and issue data prefetch requests, to overlap data transfers and computations.

For instance, compare the lws (default) and dmda scheduling policies:

STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 gemm/sgemm -xy $((256*4)) -nblocks 4

with:

STARPU_BUS_STATS=1 STARPU_WORKER_STATS=1 STARPU_SCHED=dmda gemm/sgemm -xy $((256*4)) -nblocks 4

You can see most (all?) the computation have been done on GPUs, leading to better performances.

Try other schedulers, use STARPU_SCHED=help to get the list.

Also try with various sizes (keeping a 256 tile size, i.e. increase both occurrences of 4 above) and draw curves.

You can also try the double version, dgemm, and notice that GPUs get less great performance.

Sessions Part 3: MPI Support

StarPU provides support for MPI communications. It does so in two ways. Either the application specifies MPI transfers by hand, or it lets StarPU infer them from data dependencies.

Manual MPI transfers

Basically, StarPU provides equivalents of MPI_* functions, but which operate on DSM handles instead of void* buffers. The difference is that the source data may be residing on a GPU where it just got computed. StarPU will automatically handle copying it back to main memory before submitting it to MPI.

In the mpi/ subdirectory, ring_async_implicit.c shows an example of mixing MPI communications and task submission. It is a classical ring MPI ping-pong, but the token which is being passed on from neighbour to neighbour is incremented by a starpu task at each step.

This is written very naturally by simply submitting all MPI communication requests and task submission asynchronously in a sequential-looking loop, and eventually waiting for all the tasks to complete.

#+begin_comment

#!/bin/bash
make ring_async_implicit
mpirun -np 2 $PWD/ring_async_implicit

#+end_comment

starpu_mpi_insert_task

A stencil application shows a basic MPI task model application. The data distribution over MPI nodes is decided by the my_distrib function, and can thus be changed trivially. It also shows how data can be migrated to a new distribution.

#+begin_comment

#!/bin/bash
make stencil5
mpirun -np 2 $PWD/stencil5 -display

#+end_comment

Session Part 4: OpenMP Support

The Klang-Omp OpenMP Compiler

The Klang-Omp OpenMP compiler converts C/C++ source codes annotated with OpenMP 4 directives into StarPU enabled codes. Klang-Omp is source-to-source compiler based on the LLVM/CLang compiler framework.

The following shell sequence shows an example of an OpenMP version of the Cholesky decomposition compiled into StarPU code.

cd
source /gpfslocal/pub/training/runtime_june2016/openmp/environment
cp -r /gpfslocal/pub/training/runtime_june2016/openmp/Cholesky .
cd Cholesky
make
./cholesky_omp4.starpu

Homepage of the Klang-Omp OpenMP compiler: Klang-Omp

More Performance Optimizations

The StarPU documentation performance feedback chapter provides more optimization tips for further reading after this tutorial.

FxT Tracing Support

In addition to online profiling, StarPU provides offline profiling tools, based on recording a trace of events during execution, and analyzing it afterwards.

To use the version of StarPU compiled with FxT support, you need to recompile StarPU after installing FxT.

The latest version of FxT can be built as usual:

$ tar xf fxt-0.3.11.tar.gz
$ cd fxt-0.3.11
$ ./configure
$ make
$ sudo make install
$ sudo ldconfig

StarPU can then reconfigured with an addition --with-fxt

#+begin_comment To use the version of StarPU compiled with FxT support, you need to reload the module StarPU after loading the module FxT.

module unload runtime/starpu/1.1.4
module load trace/fxt/0.2.13
module load runtime/starpu/1.1.4

The trace file is stored in /tmp by default. Since execution will happen on a cluster node, the file will not be reachable after execution, we need to tell StarPU to store output traces in the home directory, by setting:

$ export STARPU_FXT_PREFIX=$HOME/

do not forget the add the line in your file .bash_profile. #+end_comment

The trace file is stored in /tmp by default. To tell StarPU to store output traces in the home directory, one can set:

$ export STARPU_FXT_PREFIX=$HOME/

The application should be run again, and this time a prof_file_XX_YY trace file will be generated in your home directory. This can be converted to several formats by using:

$ starpu_fxt_tool -i ~/prof_file_*

This will create

Contact

For any questions regarding StarPU, please contact the StarPU developers mailing list. starpu-devel@inria.fr