StarPU Task Graph Market

Table of Contents

This page gathers a series of task graphs which can be given as input to starpu_replay for replaying real-world applications.

To get starpu_replay, one needs a version of starpu configured with --enable-simgrid . One can then start the different task graph cases. See more details below.

Dense Linear Algebra

Cholesky

Cholesky factorization from the StarPU source code, benchmarked on research platforms.

Chameleon

Dense linear algebra from the Chameleon project, benchmarked on research platforms.

How to run this

  • First install simgrid. On Debian-based systems you can simply install the libsimgrid-dev and libboost-dev packages.
  • Download the latest 1.3 branch nightly snapshot of StarPU.
  • Compile it with simgrid support enabled (no need to build it all, src/ and tools/ is enough):
cd $STARPU
./configure --enable-simgrid
make -C src
make -C tools
  • Download one of the examples from the market above, for instance:
wget https://files.inria.fr/starpu/market/cholesky.tgz
tar xf cholesky.tgz
cd cholesky
  • See its README file to see some execution examples, and try them, for instance:
STARPU_SCHED=dmdas STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 40x40/tasks.rec
  • Which yields:
Read task 14000... done.
Submitted task order 14000... done.
Executed task 11000... done.
9900.77 ms	1976.13 GF/s
  • You can re-run it as many times as desired, the resulting performance will always be the same.

    Other matrix sizes can be set with the different tasks.rec files:

$ STARPU_SCHED=dmdas STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 10x10/tasks.rec 2> /dev/null
298.476 ms	1121.6 GF/s
$ STARPU_SCHED=dmdas STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 20x20/tasks.rec 2> /dev/null
1443.13 ms	1751.45 GF/s
$ STARPU_SCHED=dmdas STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 30x30/tasks.rec 2> /dev/null
4357.02 ms	1915.96 GF/s
$ STARPU_SCHED=dmdas STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 40x40/tasks.rec 2> /dev/null
9900.77 ms	1976.13 GF/s

Other scheduling algorithms can be set with STARPU_SCHED:

$ STARPU_SCHED=dmdas STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 40x40/tasks.rec 2> /dev/null
9900.77 ms	1976.13 GF/s
$ STARPU_SCHED=dmdar STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 40x40/tasks.rec 2> /dev/null
10506.7 ms	1862.16 GF/s
$ STARPU_SCHED=dmda STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 40x40/tasks.rec 2> /dev/null
10510.7 ms	1861.45 GF/s
$ STARPU_SCHED=lws   STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 40x40/tasks.rec 2> /dev/null
12403.5 ms	1577.4 GF/s

The scheduling algorithms can be tuned with e.g. STARPU_SCHED_BETA:

$ STARPU_SCHED_BETA=1 STARPU_SCHED=dmdas STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 40x40/tasks.rec 2> /dev/null
9900.77 ms	1976.13 GF/s
$ STARPU_SCHED_BETA=2 STARPU_SCHED=dmdas STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 40x40/tasks.rec 2> /dev/null
9895.55 ms	1977.17 GF/s
$ STARPU_SCHED_BETA=10 STARPU_SCHED=dmdas STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 40x40/tasks.rec 2> /dev/null
10137.7 ms	1929.94 GF/s
$ STARPU_SCHED_BETA=100 STARPU_SCHED=dmdas STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay 40x40/tasks.rec 2> /dev/null
13660.1 ms	1432.29 GF/s

The simulation itself is sequential, but you can run several of them in parallel:

( for size in 10 20 30 40 50 60 ; do
STARPU_SCHED=dmdas STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling $STARPU/tools/starpu_replay ${size}x${size}/tasks.rec 2> /dev/null | sed -e "s/^/$size /" &
done ) | sort

How to generate static scheduling

The examples above were using the StarPU dynamic schedulers. One can inject static scheduling by adding a sched.rec file into the play.

The tasks.rec file is following the recutils format: some paragraphs are separated by an empty line. Each paragraph represents a task to be executed, with a lot of information, some of which is coming from the native execution that was performed when recording the trace:

  • The Model field identifies the performance model to be used (see more in the next paragraph).
  • The JobId field uniquely identifies the task.
  • The SubmitOrder field also uniquely identifies the task, but according to task submission, which is thus stable.
  • The DependsOn field provides the list of the identifiers of the tasks that this task depends on.
  • The Priority field provides a priority as set by the application (higher is more urging).
  • The WorkerId field provides the worker on which the task was executed when the trace was recorded.
  • The MemoryNode field provides the corresponding memory node on which the task was executed.
  • The SubmitTime field provides the time when the task was submitted by the application. The scheduler usually does not care about this.
  • The StartTime and EndTime fields provides the time when the task was started and finished.
  • The GFlop field provides the number of billions of floating-point operations performed by the task.
  • The Parameters field provides a description of the task parameters.
  • The Handles field provides the pointers of the task parameters. These can be used to relate data input and output of tasks.
  • The Modes field provides the access mode of the task parameters: (R)ead-only, (R)ead-and-(W)rite, or (W)rite-only.
  • The Sizes field provides the size of the task parameters, in bytes.

The performance of tasks on the different execution units can be obtained by running starpu_perfmodel_recdump:

$ STARPU_HOSTNAME=mirage STARPU_PERF_MODEL_DIR=$PWD/sampling starpu_perfmodel_recdump

which first emits in a %rec: timing section a series of paragraphs, one per set of measurements made for the same kind of task on the same data size. Each paragraph contains:

  • The Name field which is the name of the performance model, as referenced by the Model field in a task paragraph.
  • The Architecture field describes the architecture on which the set of measurement was made
  • The Footprint field describes the data description footprint, as referenced by the Footprint field in a task paragraph. It is roughly a summary of the task parameters' sizes.
  • The Size field provides the total task parameters' size in bytes.
  • The Flops field provides the number of floating-point operations that were performed by the task.
  • The Mean field provides the average of the measurements in the set.
  • The Stddev field provides the standard deviation of the measurements in the set.
  • The Samples field provides the number of measurements that were made.

Then the %rec: worker_count section describes the target platform, with one paragraph per kind of execution unit:

  • The Architecture field provides the name of the type of execution unit, as referenced in the Architecture field of the paragraphs mentioned above.
  • The NbWorkers field provides the number of workers of this type.

Then the %rec: memory_workers section describes the memory layout of the target platform, with one paragraph per memory node:

  • The MemoryNode field provides the memory node number.
  • The Name field provides a user-friendly name for the memory node.
  • The Size field provide the amount of available space in the memory node (-1 if it is considered unbound).
  • The Workers field provides the list of worker IDs using the memory node.

Workers IDs are numbered starting from 0 and according to the order of the paragraphs in the %rec: worker_count section.

A static schedule can then be expressed by producing a sched.rec file containing one paragraph per task. Each of them must contain a SubmitOrder field containing the submission identifier (as referenced in the SubmitOrder field of tasks.rec). The reason why the JobId is not used is because StarPU may generate internal tasks, which will change job ids. The SubmitOrder, on the contrary, only depends on the application submission loop, and is thus completely stable, making it even possible to inject the static scheduling in a native execution with the real application.

The paragraph can then contain optionally several kinds of scheduling directives, either to force task placement for instance, or to guide the StarPU dynamic scheduler:

  • Priority will override the application-provided priority, and possibly be taken into account by a StarPU dynamic scheduler.
  • SpecificWorker specifies the worker ID on which this task will be executed.
  • Workers provides a list of workers that the task will be allowed to execute on. This thus allows to restrict the execution location, without necessarily deciding it completely, e.g. to specify a given memory node or worker type.
  • DependsOn provides a list of submission IDs of tasks that this task should be made to depend on, in addition to the dependencies set in tasks.rec. This thus allows to inject artificial dependencies in the task graph.
  • Workerorder allows to force a specific ordering of tasks on a given worker (whose ID must be also set with SpecificWorker). The tasks will be executed in the provided contiguous order, i.e. the worker will wait for task with workerorder 1 to be submitted, then execute it, then wait

for task with workerorder 2 to be submitted, then execute it, etc. For a given worker, the Workerorder fields of the tasks made to be executed on it thus have to be strictly contiguous, starting from 1.

For instance, a completely static schedule can be set by setting, for each task, both the SpecificWorker and the Workerorder field, thus respectively specifying for each task on which worker it shall run, and its ordering on that worker. For instance:

SubmitOrder: 0
SpecificWorker: 0
Workerorder: 2

SubmitOrder: 1
SpecificWorker: 1
Workerorder: 0

SubmitOrder: 2
SpecificWorker: 0
Workerorder: 1

will force task 0 and 2 to be executed on worker 0 while task 1 will be executed on worker 1, and 2 will be executed before task 0.

When the SpecificWorker field is set for a task, or its Workers field corresponds to only one memory node, StarPU will automatically prefetch the data during execution. One can however also set prefetches by hand in sched.rec by using a paragraph containing:

  • A Prefetch field which specifies the submission ID of the task for which data should be prefetched.
  • A MemoryNode field which specifies the memory node on which data should be prefetched.
  • A Parameters field which specifies the indexes of the task parameters that should be prefetched (currently only one at a time is supported).
  • An optional DependsOn field to make this prefetch wait for tasks, whose submission IDs are provided.

This for instance allows not to specify precise task scheduling hints, but provide data prefetch hints which will probably guide the scheduler into a given data placement.

Created: 2023-11-23 Thu 16:24

Validate