v24.11.00
This is a closed-source release, governed by the following EULA: https://docs.nvidia.com/legate/24.11/eula.pdf.
Linux x86 and ARM conda packages with multi-node support (based on UCX or GASNet) are available at https://anaconda.org/legate/legate (GASNet-based packages are under the gex
label).
Documentation for this release can be found at https://docs.nvidia.com/legate/24.11/.
New features
- Provide an MPI wrapper, that the user can compile against their local MPI installation, and integrate with an existing build of Legate. This is useful when a user needs to use an MPI installation different from the one Legate was compiled against.
- Add support for using GASNet as the networking backend, useful on platforms not currently supported by UCX, e.g. Slingshot11. Provide scripts for the user to compile GASNet on their local machine, and integrate with an existing build of Legate.
- Automatic machine configuration; Legate will now detect the available hardware resources at startup, and no longer needs to be provided information such as the amount of memory to allocate.
- Print more information on what data is taking up memory when Legate encounters an out-of-memory error.
- Support scalar parameters, default arguments and reduction privileges in Python tasks.
- Add support for a
concurrent_task_barrier
, useful in preventing NCCL deadlocks. - Allow tasks to specify that CUDA context synchronization at task exit can be skipped, reducing latency.
- Experimental support for distributed hdf5 and zarr I/O.
- Experimental support for single-CPU/GPU fast-path task execution (skipping the tasking runtime dependency analysis).
- Experimental implementation of a "bloated" instance prefetching API, which instructs the runtime to create instances encompassing multiple slices of a store ahead of time, potentially reducing intermediate memory usage.
- full changelog
Known issues
The GPUDirectStorage backend of the hdf5 I/O module (off by default, and enabled with LEGATE_IO_USE_VFD_GDS=1
) is not currently working (enabling it will result in a crash). We are working on a fix.
Legate's auto-configuration heuristics will attempt to split CPU cores and system memory evenly across all instantiated OpenMP processors, not accounting for the actual core count and memory limits of each NUMA domain. In cases where the number of OpenMP groups does not evenly divide the number of NUMA domains, this bug may cause unsatisfiable core and memory allocations, resulting in error messages such as:
not enough cores in NUMA domain 0 (72 < 284)
reservation ('OMP0 proc 1d00000000000005 (worker 8)') cannot be satisfied
insufficient memory in NUMA node 4 (102533955584 > 102005473280 bytes) - skipping allocation
These issues should only affect performance if you are actually running computations on the OpenMP cores (rather than using the GPUs for computation). You can always adjust the automatically derived configuration values through LEGATE_CONFIG
, see https://docs.nvidia.com/legate/latest/usage.html#resource-allocation.