The Task-Aware MPI or TAMPI library extends the functionality of standard MPI libraries by providing new mechanisms for improving the interoperability between parallel task-based programming models, such as OpenMP and OmpSs-2, and MPI communications. This library allows the safe and efficient execution of MPI operations from concurrent tasks and guarantees the transparent management and progress of these communications.
The TAMPI library is not an MPI implementation. Instead, TAMPI is an independent library
that works over any standard MPI library that supports the MPI_THREAD_MULTIPLE
level.
At building time, the TAMPI library is linked to the specified MPI library.
By following the MPI Standard, programmers must pay close attention to avoid deadlocks that may occur in hybrid applications (e.g., MPI+OpenMP) where MPI calls take place inside tasks. This is given by the out-of-order execution of tasks that consequently alter the execution order of the enclosed MPI calls. The TAMPI library ensures a deadlock-free execution of such hybrid applications by implementing a cooperation mechanism between the MPI library and the parallel task-based runtime system.
TAMPI provides two main mechanisms: the blocking mode and the non-blocking mode. The blocking mode targets the efficient and safe execution of blocking MPI operations (e.g., MPI_Recv) from inside tasks, while the non-blocking mode focuses on the efficient execution of non-blocking or immediate MPI operations (e.g., MPI_Irecv), also from inside tasks.
On the one hand, TAMPI is currently compatible with two task-based programming model implementations: a derivative version of the LLVM/OpenMP and OmpSs-2. However, the derivative OpenMP does not support the full set of features provided by TAMPI. OpenMP programs can only make use of the non-blocking mode of TAMPI, whereas OmpSs-2 programs can leverage both blocking and non-blocking modes.
On the other hand, TAMPI is compatible with mainstream MPI implementations that support the
MPI_THREAD_MULTIPLE
threading level, which is the minimum requirement to provide its task-aware
features. The following sections describe in detail the blocking (OmpSs-2) and non-blocking
(OpenMP & OmpSs-2) modes of TAMPI.
The current library implements significant optimizations using a delegation technique. All the communications are delegated to a polling task, so the MPI interface is mostly accessed by this task. By delegating communications, we avoid threading contention at the MPI layer and obtain the MPI performance of single-threaded scenarios. Furthermore, another polling task handles the post-processing of tickets and calls the tasking runtime system. We call TAMPI-OPT to this optimized library.
We have dropped some features and others are not supported yet in TAMPI-OPT. The following changes apply:
- TAMPI-OPT does not support request-based MPI operations. The following functions are no
longer implemented:
MPI_Wait
,MPI_Waitall
,TAMPI_Iwait
,TAMPI_Iwaitall
. Please use use blocking operations (e.g.,MPI_Recv
,MPI_Send
) or non-blocking TAMPI operations (e.g.,TAMPI_Isend
,TAMPI_Irecv
). These latter do not provide a request and are the recommended for performance. Check these variants insrc/include/TAMPI_Wrappers.h
. - All point-to-point and collective operations are supported, except
MPI_Sendrecv
andMPI_Sendrecv_replace
. - Fortran applications are not supported.
- The documentation in the following sections may be outdated.
The blocking mode of TAMPI targets the safe and efficient execution of blocking MPI operations (e.g., MPI_Recv) from inside tasks. This mode virtualizes the execution resources (e.g., hardware threads) of the underlying system when tasks call blocking MPI functions. When a task calls a blocking operation, and it cannot complete immediately, the task is paused, and the underlying execution resource is leveraged to execute other ready tasks while the MPI operation does not complete. This means that the execution resource is not blocked inside MPI. Once the operation completes, the paused task is resumed, so that it will eventually continue its execution and it will return from the blocking MPI call.
All this is done transparently to the user, meaning that all blocking MPI functions maintain the blocking semantics described in the MPI Standard. Also, this TAMPI mode ensures that the user application can make progress although multiple communication tasks are executing blocking MPI operations.
This virtualization prevents applications from blocking all execution resources inside MPI (waiting for the completion of some operations), which could result in a deadlock due to the lack of progress. Thus, programmers are allowed to instantiate multiple communication tasks (that call blocking MPI functions) without the need of serializing them with dependencies, which would be necessary if this TAMPI mode was not enabled. In this way, communication tasks can run in parallel and their execution can be re-ordered by the task scheduler.
This mode provides support for the following set of blocking MPI operations:
- Blocking primitives: MPI_Recv, MPI_Send, MPI_Bsend, MPI_Rsend and MPI_Ssend.
- Blocking collectives: MPI_Gather, MPI_Scatter, MPI_Barrier, MPI_Bcast, MPI_Scatterv, etc.
- Waiters of a complete set of requests: MPI_Wait and MPI_Waitall.
- MPI_Waitany and MPI_Waitsome are not supported yet; the standard behavior is applied.
As stated previously, this mode is only supported by OmpSs-2.
This library provides a header named TAMPI.h
(or TAMPIf.h
in Fortran). Apart from other
declarations and definitions, this header defines a new MPI level of thread support named
MPI_TASK_MULTIPLE
, which is monotonically greater than the standard MPI_THREAD_MULTIPLE
. To
activate this mode from an application, users must:
- Include the
TAMPI.h
header in C orTAMPIf.h
in Fortran. - Initialize MPI with
MPI_Init_thread
and requesting the newMPI_TASK_MULTIPLE
threading level.
The blocking TAMPI mode is considered activated once MPI_Init_thread
successfully returns,
and the provided threading level is MPI_TASK_MULTIPLE
. A valid and safe usage of TAMPI's blocking
mode is shown in the following OmpSs-2 + MPI example:
#include <mpi.h>
#include <TAMPI.h>
int main(int argc, char **argv)
{
int provided;
MPI_Init_thread(&argc, &argv, MPI_TASK_MULTIPLE, &provided);
if (provided != MPI_TASK_MULTIPLE) {
fprintf(stderr, "Error: MPI_TASK_MULTIPLE not supported!");
return 1;
}
int *data = (int *) malloc(N * sizeof(int));
// ...
if (rank == 0) {
for (int n = 0; n < N; ++n) {
#pragma oss task in(data[n]) // T1
{
MPI_Ssend(&data[n], 1, MPI_INT, 1, n, MPI_COMM_WORLD);
// Data buffer could already be reused
}
}
} else if (rank == 1) {
for (int n = 0; n < N; ++n) {
#pragma oss task out(data[n]) // T2
{
MPI_Status status;
MPI_Recv(&data[n], 1, MPI_INT, 0, n, MPI_COMM_WORLD, &status);
check_status(&status);
fprintf(stdout, "data[%d] = %d\n", n, data[n]);
}
}
}
#pragma oss taskwait
//...
}
In this example, the first MPI rank sends the data
buffer of integers to the second rank
in messages of one single integer, while the second rank receives the integers and prints
them. On the one hand, the sender rank creates a task (T1) for sending each single-integer
MPI message with an input dependency on the corresponding position of the data
buffer
(MPI_Ssend reads from data
). Note that these tasks can run in parallel.
On the other hand, the receiver rank creates a task (T2) for receiving each integer.
Similarly, each task declares an output dependency on the corresponding position of the data
buffer (MPI_Recv writes on data
). After receiving the integer, each task checks the status
of the MPI operation, and finally, it prints the received integer. These tasks can also
run in parallel.
This program would be incorrect when using the standard MPI_THREAD_MULTIPLE
since it could
result in a deadlock situation depending on the task scheduling policy. Because of the
out-of-order execution of tasks in each rank, all available execution resources could end up
blocked inside MPI, hanging the application due to the lack of progress. However, with TAMPI's
MPI_TASK_MULTIPLE
threading level, execution resources are prevented from blocking inside
blocking MPI calls and they can execute other ready tasks while communications are taking
place, guaranteeing application progress.
See the articles listed in the References section for more information.
The non-blocking mode of TAMPI focuses on the execution of non-blocking or immediate MPI operations from inside tasks. As the blocking TAMPI mode, the objective of this one is to allow the safe and efficient execution of multiple communication tasks in parallel, but avoiding the pause of these tasks.
The idea is to allow tasks to bind their completion to the finalization of one or more MPI requests. Thus, the completion of a task is delayed until (1) it finishes the execution of its body code and (2) all MPI requests that it bound during its execution complete. Notice that the completion of a task usually implies the release of its dependencies, the freeing of its data structures, etc.
For that reason, TAMPI defines two asynchronous and non-blocking functions named TAMPI_Iwait and TAMPI_Iwaitall, which have the same parameters as their standard synchronous counterparts MPI_Wait and MPI_Waitall, respectively. They bind the completion of the calling task to the finalization of the MPI requests passed as parameters, and they return "immediately" without blocking the caller. The completion of the calling task will take place once it finishes its execution and all bound MPI requests complete.
Since they are non-blocking and asynchronous, a task after calling TAMPI_Iwait or TAMPI_Iwaitall passing some requests cannot assume that the corresponding operations have already finished. For this reason, the communication buffers related to those requests should not be consumed or reused inside that task. The proper way is to correctly annotate the communication tasks (the ones calling TAMPI_Iwait or TAMPI_Iwaitall) with the dependencies on the corresponding communication buffers, and then, annotating also the tasks that will reuse or consume the data buffers. In this way, these latter will become ready once the data buffers are safe to be accessed (i.e., once the communications have been completed). Defining the correct dependencies of tasks is essential to guarantee a correct execution order.
Calling any of these two functions from outside a task will result in undefined behavior. Note
that the requests passed to these functions could be generated by calling non-blocking MPI
operations (e.g., MPI_Irecv) from the same tasks that call TAMPI_Iwait or TAMPI_Iwaitall, from
calls of other previously executed tasks or even from the main function. Also, notice that all
requests with value MPI_REQUEST_NULL
will be ignored.
As stated in the introduction, this non-blocking mode is supported by both a derivative version of LLVM/OpenMP and OmpSs-2.
To activate this mode from applications, users must:
- Include the
TAMPI.h
header in C orTAMPIf.h
in Fortran. - Initialize the MPI with
MPI_Init_thread
requesting at least the standardMPI_THREAD_MULTIPLE
threading level.
The non-blocking mode of TAMPI is considered activated once MPI_Init_thread
returns successfully,
and the provided threading level is at least MPI_THREAD_MULTIPLE
. If MPI is not initialized in that
way, any call to TAMPI_Iwait or TAMPI_Iwaitall will be ignored.
Notice that this mode is orthogonal to the blocking one, which we presented in the previous section.
An OmpSs-2 program could activate both modes by initializing MPI with MPI_TASK_MULTIPLE
, so that
both mechanisms could operate at the same time. This does not apply to OpenMP programs since the
aforementioned derivative implementation of OpenMP does not support the blocking mode.
The following OpenMP + MPI example shows a valid and safe usage of this mode:
#include <mpi.h>
#include <TAMPI.h>
int main(int argc, char **argv)
{
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
if (provided != MPI_THREAD_MULTIPLE) {
fprintf(stderr, "Error: MPI_THREAD_MULTIPLE not supported!");
return 1;
}
int *data = (int *) malloc(N * sizeof(int));
//...
// Must be alive during the execution of the communication tasks
MPI_Status statuses[N];
#pragma omp parallel
#pragma omp single
{
if (rank == 0) {
for (int n = 0; n < N; ++n) {
#pragma omp task depend(in: data[n]) // T1
{
MPI_Request request;
MPI_Issend(&data[n], 1, MPI_INT, 1, n, MPI_COMM_WORLD, &request);
TAMPI_Iwait(&request, MPI_STATUS_IGNORE);
// Data buffer cannot be reused yet!
// Other unrelated computation...
}
}
} else if (rank == 1) {
for (int n = 0; n < N; ++n) {
#pragma omp task depend(out: data[n], statuses[n]) // T2
{
MPI_Request request;
MPI_Irecv(&data[n], 1, MPI_INT, 0, n, MPI_COMM_WORLD, &request);
TAMPI_Iwaitall(1, &request, &statuses[n]);
// Data buffer and status cannot be accessed yet!
// Other unrelated computation...
}
#pragma omp task depend(in: data[n], statuses[n]) // T3
{
check_status(&statuses[n]);
fprintf(stdout, "data[%d] = %d\n", n, data[n]);
}
}
}
#pragma omp taskwait
}
//...
}
The above code does the same as the example presented in the previous section but exclusively using the non-blocking mode. In this case, the sender rank creates tasks (T1) that use the immediate version of the MPI_Ssend operation, they bind their release of dependencies to the completion of the resulting request with TAMPI_Iwait, and they finish their execution "immediately".
The receiver rank instantiates tasks (T2) that use the immediate version of MPI_Recv and bind to themselves
the resulting request. In this case, they use the TAMPI_Iwaitall function to register the request and to
indicate where the status should be saved. Note that the status location must be alive after these tasks
finish their execution, so it cannot be stored in the task's stack. Each task must declare the output dependency
on the status position, in addition to the dependency on the data
buffer. Since TAMPI_Iwait/TAMPI_Iwaitall
are non-blocking and asynchronous, the buffer and status cannot be consumed inside those tasks.
In order to consume the received data or the resulting status, the receiver can instantiate a task (T3)
with the corresponding dependencies. Once the request registered by a task T2 completes, the corresponding
task T3 will become ready, and it will be able to print the received integer and check the status of the
operation. Moreover, tasks T3 could be removed and the data or statuses could be consumed by the main
after the taskwait.
See the articles listed in the References section for more information.
IMPORTANT: These wrappers are no longer supported in TAMPI-OPT.
Typically, application developers try to adapt their code to different parallel programming models and platforms, so that an application can be compiled enabling a specific programming model or even a combination of various programming models, e.g., hybrid programming. However, developing and maintaining such type of applications can be difficult and tedious. To facilitate the task of developing this kind of multi-platform applications, TAMPI provides a set of wrapper functions and macros that behave differently depending on whether the non-blocking mode of TAMPI is enabled or not.
The idea behind those wrapper functions is to allow users to use the same code for pure MPI and hybrid configurations. In this way, the user does not need to duplicate the code or to use conditional compilation directives in order to keep correct both variants. For instance, when the application is compiled only with MPI (i.e., the non-blocking mode of TAMPI cannot be enabled) the wrappers will behave differently than when when compiling with MPI, OpenMP and enabling the non-blocking mode of TAMPI. In the first case, the OpenMP pragmas are ignored, and the standard operating of MPI is enforced, whereas in the second case, the OpenMP pragmas are enabled and the non-blocking mechanisms of TAMPI are used.
The following table shows in the first column the name of each wrapper function. We provide a wrapper function for each standard non-blocking communication operation (e.g., MPI_Issend), be it primitives or collectives. The wrapper's name of a standard non-blocking function is composed of the original MPI name and adding the prefix TA. For instance, the corresponding wrapper function for MPI_Ibrecv is TAMPI_Ibrecv.
The actual behavior of a wrapper function, when the non-blocking TAMPI is disabled (e.g., in pure MPI mode) is shown in the second column. The third column shows the behavior when the non-blocking mode of TAMPI is enabled, and therefore, in this case, OpenMP or OmpSs-2 must be enabled.
Regardless of whether TAMPI is enabled or not, the wrapper functions call the corresponding standard non-blocking MPI function. Additionally, if the non-blocking mode is active, that call is followed by a call to TAMPI_Iwait passing the request generated by the previous non-blocking function. In this way, the calling task binds its completion to the finalization of that MPI request, as previously explained in this section.
Function | TAMPI Disabled | TAMPI Enabled |
---|---|---|
TAMPI_Isend |
MPI_Isend |
MPI_Isend + TAMPI_Iwait |
TAMPI_Irecv |
MPI_Irecv |
MPI_Irecv + TAMPI_Iwait |
TAMPI_Ibcast |
MPI_Ibcast |
MPI_Ibcast + TAMPI_Iwait |
TAMPI_Ibarrier |
MPI_Ibarrier |
MPI_Ibarrier + TAMPI_Iwait |
TAMPI_I... |
MPI_I... |
MPI_I... + TAMPI_Iwait |
The parameters of the wrapper functions are the same as their corresponding standard non-blocking MPI functions. The exception is TAMPI_Irecv because it requires an additional status parameter, which is not required in the standard MPI_Irecv. The following snippet shows the C prototype of both TAMPI_Irecv and TAMPI_Isend.
int TAMPI_Irecv(void *buf, int count, MPI_Datatype datatype, int src,
int tag, MPI_Comm comm, MPI_Request *request,
MPI_Status *status);
int TAMPI_Isend(const void *buf, int count, MPI_Datatype datatype, int dst,
int tag, MPI_Comm comm, MPI_Request *request);
With those wrappers, we cover the conventional phase of MPI applications where they start the communications with multiple calls to non-blocking MPI functions (e.g., MPI_Isend and MPI_Irecv). Usually, after this phase, MPI applications wait for the completion of all those issued operations by calling MPI_Wait or MPI_Waitall from the main thread. However, this technique is not optimal in hybrid applications that want to parallelize both computation and communication phases, since it considerably reduces the parallelism in each communication phase.
For this reason, TAMPI provides two special wrapper functions that are essential for facilitating the development of multi-platform applications. These two wrappers are TAMPI_Wait and TAMPI_Waitall, and their behavior is shown in the following table.
Function | TAMPI Disabled | TAMPI Enabled |
---|---|---|
TAMPI_Wait |
MPI_Wait |
- |
TAMPI_Waitall |
MPI_Waitall |
- |
As the previously presented wrappers, they behave differently depending on whether the non-blocking mode of TAMPI is enabled or not. If TAMPI is not active, those wrappers only call directly to MPI_Wait or MPI_Waitall, respectively. However, if TAMPI is active, they do nothing. With all these wrappers, multi-platform applications can have a hybrid MPI + OpenMP + TAMPI code that also can correctly work when compiling only with MPI.
The following MPI + OpenMP example in Fortran tries to demonstrate the idea behind all those wrapper functions:
#include "TAMPIf.h"
! ...
nreqs=0
tag=10
do proc=1,nprocs
if (sendlen(proc) > 0) then
nreqs=nreqs+1
len=sendlen(proc)
!$OMP TASK DEFAULT(shared) FIRSTPRIVATE(proc,tag,len,nreqs) &
!$OMP& PRIVATE(err) DEPEND(IN: senddata(:,proc))
call TAMPI_Isend(senddata(:,proc),len,MPI_REAL8,proc-1,tag,MPI_COMM_WORLD, &
requests(nreqs),err)
!$OMP END TASK
end if
if (recvlen(proc) > 0) then
nreqs=nreqs+1
len=recvlen(proc)
!$OMP TASK DEFAULT(shared) FIRSTPRIVATE(proc,tag,len,nreqs) &
!$OMP& PRIVATE(err) DEPEND(OUT: recvdata(:,proc))
call TAMPI_Irecv(recvdata(:,proc),len,MPI_REAL8,proc-1,tag,MPI_COMM_WORLD, &
requests(nreqs),statuses(nreqs),err)
!$OMP END TASK
end if
end do
call TAMPI_Waitall(nreqs,requests(:),statuses(:),err)
! Computation tasks consuming or reusing communications buffers
! ...
The previous example sends/receives a message, if needed, to/from the rest of MPI processes. Notice that this code would be correct for both a pure MPI execution (i.e., ignoring the OpenMP directives) and an execution with MPI+OpenMP+TAMPI. In the first case, the TAMPI_Isend, TAMPI_Irecv, and TAMPI_Waitall calls will become calls to MPI_Isend, MPI_Irecv and MPI_Waitall, respectively.
However, if the non-blocking TAMPI mode is active, being OpenMP or OmpSs-2 also enabled obligatorily, TAMPI_Isend and TAMPI_Irecv will be executed from inside tasks. The TAMPI_Isend call would become an MPI_Isend call followed by a TAMPI_Iwait call passing the generated request as a parameter. In this way, the calling task will bind its completion to the finalization of that MPI request. The same will happen to TAMPI_Irecv. Finally, TAMPI_Waitall would do nothing since each task would already bound its own MPI request.
It is important to mention that both TAMPI_Irecv and TAMPI_Waitall have a status location as a parameter. These statuses should be consistent in both calls; a status location passed to a TAMPI_Irecv should be the same as the status location that corresponds to that request in the TAMPI_Waitall. It could also be correct to specify the status to be ignored in both calls.
More information about this work can be found in the articles listed below. Citations to the TAMPI library should reference these articles:
- Sala, K., Teruel, X., Perez, J. M., Peña, A. J., Beltran, V., & Labarta, J. (2019). Integrating blocking and non-blocking MPI primitives with task-based programming models. Parallel Computing, 85, 153-166. [Article link]
- Sala, K., Bellón, J., Farré, P., Teruel, X., Perez, J. M., Peña, A. J., Holmes, D., Beltran. V., & Labarta, J. (2018, September). Improving the interoperability between MPI and task-based programming models. In Proceedings of the 25th European MPI Users' Group Meeting (p. 6). ACM. [Article link]
Furthermore, several works have demonstrated the performance and programmability benefits of leveraging the TAMPI library in hybrid applications:
- Ciesko, J., Martínez-Ferrer, P. J., Veigas, R. P., Teruel, X., & Beltran, V. (2020). HDOT—An approach towards productive programming of hybrid applications. Journal of Parallel and Distributed Computing, 137, 104-118. [Article link]
- Sala, K., Rico, A., & Beltran, V. (2020, September). Towards Data-Flow Parallelization for Adaptive Mesh Refinement Applications. In 2020 IEEE International Conference on Cluster Computing (CLUSTER) (pp. 314-325). IEEE. [Article link]
- Maroñas, M., Teruel, X., Bull, J. M., Ayguadé, E., & Beltran, V. (2020, September). Evaluating Worksharing Tasks on Distributed Environments. In 2020 IEEE International Conference on Cluster Computing (CLUSTER) (pp. 69-80). IEEE. [Article link]
This work was financially supported by the PRACE and INTERTWinE projects funded in part by the European Union’s Horizon 2020 Research programme (2014-2020) under grant agreements 823767 and 671602, respectively. This research has also received funding through The European PILOT project from the European Union's Horizon 2020/EuroHPC research and innovation programme under grant agreement 101034126. Project PCI2021-122090-2A was funded by MCIN/AEI /10.13039/501100011033 and European Union NextGenerationEU/PRTR. The Departament de Recerca i Universitats de la Generalitat de Catalunya also funded the Programming Models research group at BSC-UPC under grant agreement 2021 SGR01007. This research received funding also from Huawei SoW1.
This section describes the software requirements of TAMPI, the building and installation process and how the library can be used from user applications.
The Task-Aware MPI library requires the installation of the following tools and libraries:
- Automake, autoconf, libtool, make and a C++ compiler supporting C++17.
- An MPI library supporting the
MPI_THREAD_MULTIPLE
threading level. - Boost library version 1.59 or greater.
- ovni instrumentation library version 1.5.0 or greater (optional).
- One of the following parallel task-based programming models (required when compiling a user application):
TAMPI uses the standard GNU automake and libtool toolchain. When cloning from a repository, the building environment must be prepared through the following command:
$ autoreconf -fiv
When the code is distributed through a tarball, it usually does not need that command.
Then execute the following commands:
$ ./configure --prefix=$INSTALLATION_PREFIX --with-boost=$BOOST_HOME ..other options..
$ make
$ make install
where $INSTALLATION_PREFIX
is the directory into which to install TAMPI, and $BOOST_HOME
is the
prefix of the Boost installation. An MPI installation with multi-threading support must be available
when configuring the library. The MPI frontend compiler for C++ (usually mpicxx
) must be provided
when configuring by either adding the binary's path to the PATH
environment variable (i.e.,
executing export PATH=/path/to/mpi/bin:$PATH
) or by setting the MPICXX
environment variable.
Other optional configuration flags are:
--with-ovni
: Build the instrumentation with ovni and allow enabling ovni tracing at run-time through theTAMPI_INSTRUMENT
enviornment variable.--enable-debug
: Adds compiler debug flags and enables additional internal debugging mechanisms. Note that this flag can downgrade the overall performance. This option replaces the original option--enable-debug-mode
, which is now deprecated. Debug flags are disabled by default.--enable-asan
: Adds compiler and linker flags to enable address sanitizer support. Enabling debug is also recommended in this case. Address sanitizer support is disabled by default.--disable-blocking-mode
: Disables the blocking mode of TAMPI.MPI_TASK_MULTIPLE
threading level is never provided, so that calls to blocking MPI procedures are directly converted to calls to the same blocking procedures of the underlying MPI library. Also, by disabling this mode, TAMPI does not require the runtime system to provide the block/unblock API. The blocking mode is enabled by default.--disable-nonblocking-mode
: Disables the non-blocking mode of TAMPI. Any call to TAMPI_Iwait or TAMPI_Iwaitall procedures will be ignored. The non-blocking mode is enabled by default.
Once TAMPI is built and installed, e.g, in $TAMPI_HOME
, the installation folder will contain the libraries
in $TAMPI_HOME/lib
(or $TAMPI_HOME/lib64
) and the headers in $TAMPI_HOME/include
. There are three
different libraries and each one has the static and dynamic variant:
libtampi
: The complete TAMPI library supporting both C/C++ and Fortran languages.libtampi-c
: The TAMPI library for C/C++.libtampi-fortran
: The TAMPI library for Fortran.
There are two header files which can be included from user applications that declare all needed
functions and constants. TAMPI.h
is the header for C/C++ applications and TAMPIf.h
is the header
for Fortran programs.
The TAMPI library has some run-time options that can be set through environment variables. These are:
-
TAMPI_POLLING_PERIOD
(default100
us): The TAMPI library periodically checks the in-flight MPI requests by running a transparent task. This task is called the polling task. It is scheduled every some microseconds to check the pending MPI requests that have been generated by the TAMPI operations. The polling period of such a task can be static or dynamic. A static period means that the task will execute after the same amount of time. A dynamic period means that the period may flucutate between a minimum and maximum values depending on the workload. The task consumes a CPU while running and yields it to the tasking runtime system while not running. The period may be decreased by the user in communication-intensive applications or increased in applications with low communication weights.The envar must follow the format
TAMPI_POLLING_PERIOD=<minperiod>[:<maxperiod>[:<policy>]]
. Theminperiod
andmaxperiod
are integers values specifying the minimum and maximum periods in microseconds. When setting the envar, the user must specify at least minimum period. In this case, the polling will be static. To enable a dynamic polling, the maximum period must be defined also. Then, the polling period will fluctuate within the range ofminperiod
andmaxperiod
.The
policy
parameters chooses a specific policy for the dynamic polling. Notice that this policy does not apply if the polling is static orminperiod
andmaxperiod
have the same value. Currently, the only accepted policy isslowstart
, which applies a Slow Start-based algorithm that aggresively lowers the polling period and guarantees lower communication latency. This is the default policy when enabling dynamic polling.The default polling is static at 100 microseconds (equivalent to
TAMPI_POLLING_PERIOD=100
). Setting the envar to0
means that the task should be always running. -
TAMPI_INSTRUMENT
(defaultnone
): The TAMPI library leverages ovni for instrumenting and generating Paraver traces. For builds with the capability of extracting Paraver traces, the TAMPI library should be configured passing a valid ovni installation through the--with-ovni
. Then, at run-time, define theTAMPI_INSTRUMENT=ovni
environment variable to generate an ovni trace. After the execution, the ovni trace can be converted to a Paraver trace with theovniemu
tool. You can find more information regarding ovni tracing at https://github.com/bsc-pm/ovni.
IMPORTANT: The TAMPI_POLLING_FREQUENCY
environment variable has been removed and will not be considered
anymore. Please use the TAMPI_POLLING_PERIOD
envar instead.
User applications should be linked against the MPI library (e.g, using mpicc
or mpicxx
compiler), the
parallel task-based runtime system and the TAMPI library. A hybrid OpenMP + MPI application in C++ named
app.cpp
could be compiled and linked using the following command:
$ mpicxx -cxx=clang++ -fopenmp=libompv -I${TAMPI_HOME}/include app.cpp -o app.bin -ltampi -L${TAMPI_HOME}/lib
Similarly, a hybrid OmpSs-2 + MPI application in C++ named app.cpp
could be compiled and linked using the
following command with the LLVM/Clang compiler with OmpSs-2 support:
$ mpicxx -cxx=clang++ -fompss-2 -I${TAMPI_HOME}/include app.cpp -o app.bin -ltampi -L${TAMPI_HOME}/lib
Or using the Mercurium legacy compiler:
$ mpicxx -cxx=mcxx --ompss-2 -I${TAMPI_HOME}/include app.cpp -o app.bin -ltampi -L${TAMPI_HOME}/lib
Please note that the options passed to mpicc
or mpicxx
may differ between different MPI implementations.
The -cxx
option indicates that mpicxx
should use the compiler passed as a parameter to compile and link
the source files. This can also be indicated with the environment variables MPICH_CC
or MPICH_CXX
for
MPICH and MVAPICH, I_MPI_CC
or I_MPI_CXX
for Intel MPI, and OMPI_CC
or OMPI_CXX
for OpenMPI. For
instance, the OmpSs-2 application could be compiled and linked using MPICH with the following command:
$ MPICH_CXX=clang++ mpicxx -fompss-2 -I${TAMPI_HOME}/include app.cpp -o app.bin -ltampi -L${TAMPI_HOME}/lib
Finally, both OpenMP and OmpSs-2 applications can be launched as any traditional hybrid program.
IMPORTANT: Fortran applications are not supported in TAMPI-OPT for the moment.
The TAMPIf.h
header for Fortran defines some preprocessor macros. Therefore, to correctly use TAMPI
in Fortran programs, users must include the header with #include "TAMPIf.h"
at the starting lines of the
program, as shown in the following example. Additionally, the program must be preprocessed when compiling
the code. The preprocessing is enabled by default when the code file has the extension .F90
, .F95
, etc.,
by passing the -cpp
option to gfortran
compiler or by passing the --pp
option to the Mercurium compiler.
#include "TAMPIf.h"
module test_mod
! ...
end module test_mod
program test_prog
! ...
end program test_prog
Finally, due to some technical limitations in the Fortran language, the wrapper functions of the non-blocking TAMPI mode cannot be written in all combinations of lowercase and uppercase letters. For instance, the only accepted ways to write in a Fortran program a call to the TAMPI_Issend procedure are:
TAMPI_Issend
TAMPI_issend
tampi_issend
This limitation is applied to all other wrappers presented in the Wrapper Functions section, including both TAMPI_Wait and TAMPI_Waitall.
The Task-Aware MPI library relies on the ALPI interface to communicate with the underlying tasking runtime system. This interface is internally used by TAMPI to spawn internal tasks, to block user tasks, or add external events to them. These lowlevel functionalities provide support for the TAMPI blocking and non-blocking modes.
The required interface is ALPI 1.0 (or any compatible) and it is included in the ALPI.hpp header. Any tasking runtime system can support this TAMPI library by providing support to this interface version.
This section answers some of the most common questions:
Q1: Why does linking my application to the TAMPI library fail with undefined symbols?
Some MPI libraries, such as OpenMPI, do not provide the MPI symbols for Fortran by default. In the case you are getting undefined symbol errors and your application is C/C++, you can link to libtampi-c instead of libtampi.
Q2: Why does my application not perform as expected when using TAMPI?
One of the first aspects to check when your application does not perform as expected is the polling frequency
in which TAMPI background services are working. By default, the TAMPI services check the internal MPI requests
every 100us (see TAMPI_POLLING_PERIOD
envar). This period should be enough for most applications,
including communication-intensive ones. However, your application may need a higher polling frequency (i.e.,
reducing the envar's value) or a lower frequency (i.e., increasing the envar's value).
See Run-time Options for more information regarding the TAMPI_POLLING_PERIOD
envar.
Q3: Why does my application hang or report an error when using TAMPI for point-to-point communication?
Most MPI or hybrid MPI+OpenMP applications do not use specific MPI tags when sending and receiving messages. That's quite common because these applications usually issue MPI operations from a single thread and are always in the same order on both the sender and receiver sides. They rely on the message ordering guarantees of MPI, so they use an arbitrary MPI tag for multiple or all messages.
That becomes a problem when issuing MPI operations from different concurrent threads or tasks. The order of sending and receiving messages in the sender and receiver sides may be different, and thus, messages can arrive and match in any order. To avoid this issue, you should use distinct tags for the different messages that you sending and receive. For instance, if you are exchanging multiple blocks of data, and you want to send a message per block (encapsulated in a separate task), you could use the block id as the MPI tag of the corresponding message.
Q4: Why does my application hang or report an error when using TAMPI for collective communication?
This issue is quite related to the previous one. The MPI standard does not allow identifying collective operations with MPI tags. The standard enforces the user to guarantee a specific order when issuing multiple collectives of the same type and through the same communicator in all involved processes. That limits the parallelism that we can achieve when executing multiple collective operations of the same time. If you want to run multiple collectives of the same type in parallel (e.g., different concurrent tasks), we recommend using separate MPI communicators. Notice that having many MPI communicators could damage the application's performance depending on the MPI implementation.
Q5: Does TAMPI support the MPI Session model introduced in MPI 4.0?
The TAMPI library does not support the MPI Session model for the moment. Right now, TAMPI only provides support for the MPI World model.
Q6: Why does my application report an error regarding an incompatibility with the ALPI tasking interface?
From Task-Aware MPI 3.0 and later, the library uses a new generic tasking interface to communicate with the tasking runtime system in the background. Previous versions of Task-Aware MPI used the API of the Nanos6 runtime system, but that API is not supported anymore. Make sure your OmpSs-2 or LLVM/OpenMP meets the minimum requirements from the Software Requirements section. More information about the ALPI interface is shown in the ALPI Tasking Interface section.
Q7: Is TAMPI an MPI implementation?
No, the TAMPI library is not an MPI implementation. Instead, TAMPI is an independent library that works over any
standard MPI library. The only requirement is that the underlying MPI implementation has to support the standard
MPI_THREAD_MULTIPLE
level.