Portability is obtained by the means of a unified abstraction of the machine. StarPU offers a unified offloadable task abstraction named codelet. Rather than rewriting the entire code, programmers can encapsulate existing functions within codelets. In case a codelet can run on heterogeneous architectures, it is possible to specify one function for each architecture (e.g. one function for CUDA and one function for CPUs). StarPU takes care of scheduling and executing those codelets as efficiently as possible over the entire machine, include multiple GPUs, even of different brands and interface at the same time (e.g. CUDA + OpenCL). One can even specify several functions for each architecture (new in v1.0) as well as parallel implementations (e.g. in OpenMP), and StarPU will automatically determine which version is best for each input size (new in v0.9). StarPU can execute them concurrently, e.g. one per socket, provided that the task implementations support it (which is the case for MKL, but unfortunately most often not for OpenMP).


The StarPU programming interface is very generic. For instance, various data structures are supported mainline (vectors, dense matrices, CSR/BCSR/COO sparse matrices, …), but application-specific data structures can also be supported, provided that the application describes how data is to be transferred (e.g. a series of contiguous blocks). That was for instance used for hierarchically-compressed matrices (h-matrices).

Data Transfers

To relieve programmers from the burden of explicit data transfers, a high-level data management library enforces memory coherency over the machine: before a codelet starts (e.g. on an accelerator), all its data are automatically made available on the compute resource. Data are also kept on e.g. GPUs as long as they are needed for further tasks. When a device runs out of memory, StarPU uses an LRU strategy to evict unused data. StarPU also takes care of automatically prefetching data, which thus permits to overlap data transfers with computations (including GPU-GPU direct transfers) to achieve the most of the architecture. Support is included for for matrices, vectors, n-dimension tensors, compressed matrices (CSR, BCSR, COO), and the application can define its own data interface for arbitrary data representation (that can even be e.g. a C++ or Python object).


StarPU behaves completely asynchronously, provided that the application-provided kernels are properly using e.g. the non-default kernel stream provided by StarPU. Data transfers and computation kernels are submitted to GPUs in a pipeline and completion is polled to keep the pipeline flowing and never let GPUs idle.


Dependencies between tasks can be given either of several ways, to provide the programmer with the best flexibility:

These dependencies are computed in a completely decentralized way, and can be introduced completely dynamically as tasks get submitted by the application while tasks previously submitted are being executed.

StarPU also supports an OpenMP-like reduction access mode (new in v0.9).

It also supports a commute access mode to allow data access commutativity (new in v1.2).

It also supports transparent dependencies tracking between hierarchical sub-pieces of data through asynchronous partitioning, allowing seamless concurrent read access to different partitioning levels (new in v1.3).

Heterogeneous Scheduling

StarPU obtains portable performances by efficiently (and easily) using all computing resources at the same time. StarPU also takes advantage of the heterogeneous nature of a machine, for instance by using scheduling strategies based on auto-tuned performance models. These determine the relative performance achieved by the different processing units for the various kinds of task, and thus permits to automatically let processing units execute the tasks they are the best for. Various strategies and variants are available. Some of them are centralized, but most of them are completely distributed to properly scale with large numbers of cores. /dmdas/ (a data-locality-aware MCT strategy, thus similar to heft but starts executing tasks before the whole task graph is submitted, thus allowing dynamic task submission and a decentralized scheduler, as well as an energy optimizing extension), /eager/ (dumb centralized queue), /lws/ (distributed locality-aware work-stealing), … The overhead per task is typically around the order of magnitude of a microsecond. Tasks should thus be a few orders of magnitude bigger, such as 100 microseconds or 1 millisecond, to make the overhead negligible. The application can also provide its own scheduler, to better fit its own needs.


To deal with clusters, StarPU can nicely integrate with MPI, through explicit or implicit support, according to the application’s preference.

Out of Core

When memory is not big enough for the working set, one may have to resort to using disks. StarPU makes this seamless thanks to its out of core support (new in v1.2). StarPU will automatically evict data from the main memory in advance, and prefetch back required data before it is needed for tasks.

Fortran Interface

StarPU comes with native Fortran bindings and examples.

OpenMP 4 -compatible Interface

K’Star provides an OpenMP 4 -compatible interface on top of StarPU. This allows to just rebuild OpenMP applications with the K’Star source-to-source compiler, then build it with the usual compiler, and the result will use the StarPU runtime.

K’Star also provides some extensions to the OpenMP 4 standard, to let the StarPU runtime perform online optimizations.

OpenCL-compatible Interface

StarPU provides an OpenCL-compatible interface, SOCL which allows to simply run OpenCL applications on top of StarPU (new in v1.0).

Simulation Support

StarPU can very accurately simulate an application execution and measure the resulting performance thanks to using the SimGrid simulator (new in v1.1). This allows to quickly experiment with various scheduling heuristics, various application algorithms, and even various platforms (available GPUs and CPUs, available bandwidth)!

All in All

All that means that the following sequential source code of a tiled version of the classical Cholesky factorization algorithm using BLAS is also (an almost) valid StarPU code, possibly running on all the CPUs and GPUs, and given a data distribution over MPI nodes, it is even a distributed version!

for (k = 0; k < tiles; k++) {
  for (m = k+1; m < tiles; m++)
    trsm(A[k,k], A[m,k])
  for (m = k+1; m < tiles; m++)
    syrk(A[m,k], A[m, m])
  for (m = k+1, m < tiles; m++)
    for (n = k+1, n < m; n++)
      gemm(A[m,k], A[n,k], A[m,n])

The complete example can be found inside the StarPU source, examples/cholesky/cholesky_implicit.c for single-node, and mpi/examples/matrix_decomposition/mpi_cholesky_codelets.c for distributed execution.

And Even More

StarPU provides a Python interface (new in v1.4), a Java interface (new in v1.4), an Eclipse plugin (new in v1.4)

Technical Details

Supported Architectures

Arbitrary mixtures of the following architectures are supported (except CUDA+HIP, just because ROCm refuses to support it):

Supported Operating Systems


StarPU is checked every night with

Performance Analysis Tools


In order to understand the performance obtained by StarPU, it is helpful to visualize the actual behavior of the applications running on complex heterogeneous multicore architectures. StarPU therefore makes it possible to generate Pajé traces that can be visualized thanks to the ViTE (Visual Trace Explorer) open source tool.

Example: LU decomposition on 3 CPU cores and a GPU using a very simple greedy scheduling strategy. The green (resp. red) sections indicate when the corresponding processing unit is busy (resp. idle). The number of ready tasks is displayed in the curve on top: it appears that with this scheduling policy, the algorithm suffers a certain lack of parallelism. Measured speed: 175.32 GFlop/s

This second trace depicts the behavior of the same application using a scheduling strategy (dmda) trying to minimize load imbalance thanks to auto-tuned performance models and to keep data locality as high as possible. In this example, the Pajé trace clearly shows that this scheduling strategy outperforms the previous one in terms of processor usage. Measured speed: 239.60 GFlop/s


Temanejo can be used to debug the task graph, as shown below (new in v1.1).


StarVZ (paper) can be used to investigate performance, as shown below