Releases: StanfordLegion/legion
Releases · StanfordLegion/legion
Version 22.03.0 (March 27, 2022)
- Build
- Minimum supported cmake version is now 3.7. (Some optional features continue to require even newer versions.)
- Realm
- Numerous bug fixes in the
gasnetex
network layer - CUDA and HIP support allow direct specification of which gpus to use via
-ll:gpu_ids
command-line option - Added support for copy paths using Cuda IPC between gpus on the same physical node
- For applications using CUDA without the runtime API hijack AND only submitting work to the default CUDA stream,
-cuda:legacysync 1
improves the overhead of detecting the completion of device-side work launched by a task - Realm reduction copies may now indicate exclusive access to the destination instance, improving performance by allowing simple load/store instead of atomic operations
- Custom reduction operations (including Legion's built-in ones) can provide HIP implementations, permitting in-place reductions in HIP device memory
- Numerous bug fixes in the
- Regent
- Support for custom serialization of types in task parameters and results
- New experimental timing library under std/timing
Version 21.12.0 (December 31, 2021)
- Realm
- Performance improvements for multi-dimensional copies, especially
inter-process transfers - Support for loading CUDA driver (if present) at runtime instead of
link time, allowing same binary to be used on systems with and without
CUDA-capable GPUs (enabled with -DLegion_CUDA_DYNAMIC_LOAD=ON in
cmake build) - A separate
Memory
is now created per process for external (system)
memory instances. This memory has no capacity for creating instances
and can confuse applications or Legion mappers that assume exactly
one Memory of kindSYSTEM_MEM
exists. Old behavior can be obtained
with-ll:ext_sysmem 0
, but this can fail for configurations that
register system memory with the network and/or GPUs - The
MemoryQuery
now supports ahas_capacity
predicate to restrict
results to just memories with sufficient total (not current!) capacity
to allocate an instance of a specified size
- Performance improvements for multi-dimensional copies, especially
- Build
- Cmake allows control of max nodes (-DLegion_MAX_NUM_NODES=...) and
max processors/node (-DLegion_MAX_NUM_PROCS=...) supported by
Legion build - Added dependency tracking to make-based builds
- Cmake allows control of max nodes (-DLegion_MAX_NUM_NODES=...) and
Version 21.09.0 (September 28, 2021)
- Realm
- Numerous bug fixes in the
gasnetex
network layer - Support for HIP memory type registration with GASNet (with GASNet version 2021.9.0+)
- Arguments to spawned tasks may now be arbitrarily large (network-specific limits have been eliminated)
- Numerous bug fixes in the
- Regent
- Improved support for dynamic checks on index launches with potential interference between different region arguments
- Extensive fixes for separate compilation. This mode has now been verified to work with large-scale applications
- Removed long-obsolete support for
__demand(__external)
- Pygion
- Add support for layout constraints
Version 21.06.0 (June 24, 2021)
- Build
- Version information is now compiled into Realm and Legion. This takes
the form of a string (e.g. "legion-21.06.0") rather than anything
that can be compared (i.e. no semantic versioning here). Compile-time
definesREALM_VERSION
andLEGION_VERSION
are available as well as
run-time callsRealm::Runtime::get_library_version
and
Legion::Runtime::get_library_version
.
- Version information is now compiled into Realm and Legion. This takes
- Regent
- Support for dynamic checks on projection functors, enabling a
much larger class of loops to be supported as index launches - Support for local tasks (i.e., without going through the
runtime) via__demand(__local)
- Support for dynamic checks on projection functors, enabling a
- Realm
- Windows (MSVC) builds are now tested in CI and and therefore more likely
to work - Realm runtime can now be shutdown and reinitialized in the same process.
(Exception: GASNet-based network layers do not support this.) - Registration of host memory with CUDA driver is skipped for host
memories larger than 1GB by default due to CUDA driver overhead.
This threshold can be increased (or decreased) with-cuda:hostreg
- Windows (MSVC) builds are now tested in CI and and therefore more likely
- Tools
- New Rust implementation of Legion Prof is 5-15x faster than the
original (even with PyPy). For more details, see:
https://legion.stanford.edu/profiling/#rust-legion-prof
- New Rust implementation of Legion Prof is 5-15x faster than the
Version 21.03.0 (March 30, 2021)
- Build
- Cmake can build an embedded copy of GASNet as part of the Legion build
with-DLegion_EMBED_GASNet=ON
- Cmake can build an embedded copy of GASNet as part of the Legion build
- Regent
- Contains three breaking changes to the Regent calling convention:
- Reductions are now aggregated into region requirements and
sorted by the index of the first field in the field space
among the set of fields for each reduction. - Task arguments may be passed through either
args
or
local_args
for index launched tasks. (Previously Regent
only usedlocal_args
.) - Region values passed via
args
to an index-launched task may
be bogus. Instead the region requirement should be used to
obtain the original region.
- Reductions are now aggregated into region requirements and
- Support for constant time index launches. These are enabled
automatically, but can be forced on or off with__demand
or
__forbid
with__constant_time_launches
. This should
improve scalability at extreme node counts. - Support for
rescape
andremit
to generate metaprogrammed
code more easily. - Experimental support for separate compilation via
-fspeparate 1
allows Regent programs to be compiled in parts (potentially in
parallel). Note that separate compilation currently cannot be
used with Bishop and requires one of either parallel or
incremental compilation ifregentlib.start
is used (does not
apply toregentlib.saveobj
orregentlib.save_tasks
).
- Contains three breaking changes to the Regent calling convention:
- Legion
- In the control replication branch users will find a new implementaiton
of Legion's physical analysis that uses heuristics to select which
sub-trees should be used for performing the analysis. Disjoint and
complete partitions are especially helpful in aiding the runtime. - There is a new implementation of the index space math inside of the
runtime that now soundly and precisely detect congruences between
index space math operations. This fixes a long-running class of bugs
that would cause memory explosions in the physical analysis. - In the control replication branch users can now map future values into
memories the same as they do with regions. This means that future
payloads can be placed directly on devices like GPUs. Similarly, the
runtime now accepts future data from tasks that also reside in any
memory in the machine including device memories. - Both the master and control replication branches have support for
index space attach operations. - Expensive transitive reductions on traces are now computed in the
background allowing trace replays to begin replaying immediately
with only partial optimizations.
- In the control replication branch users will find a new implementaiton
- Realm
- Custom reduction operations (including Legion's built-in ones) can
provide CUDA implementations, permitting in-place reductions in
CUDA device memory - Support for CUDA managed memory (via
-ll:msize
) that is coherent for
both host and device access. Includes support for__managed__
variables (only single-GPU if using CUDA runtime hijack mode) Event::wait
may be called outside of Realm tasks, having the same
thread-blocking behavior asEvent::external_wait
- Experimental support for AMD HIP. Note that testing coverage is
incomplete, and breakages may occur in between releases. For more
details, see: #1028
- Custom reduction operations (including Legion's built-in ones) can
Version 20.12.0 (December 28, 2020)
- Build
- Legion and Realm now require a compiler with (at least) c++11 support
- Python scripts (e.g. legion_prof and legion_spy) require Python 3.5
- Realm
- Improved performance of inter-node instance copies when data is not
contiguous in source and/or destination - Improved responsiveness of utility processors by not using them for
background work by default - Experimental support for building on Windows with MSVC
- Improved performance (and correctness) when running CUDA tasks without
the runtime hijack enabled - Added
gasnetex
network layer that uses GASNet-EX's native API (instead
of the legacy GASNet-1 API support). Requires GASNet version 2020.11.0
or newer. For more details, see: #986
- Improved performance of inter-node instance copies when data is not
- Legion
- The mapping interface no longer requires the runtime to return valid
instances for empty regions (e.g. regions with no points their index space)
- The mapping interface no longer requires the runtime to return valid
- Tools
- Legion Spy now has support for arbitrary number of dimensions
- Examples
examples/nccl
gives a simple example of using NCCL with Legion
Version 20.06.0 (June 29, 2020)
- Regent
- Support for
std/format
module for type-safe formatted printing - Support for documentation with LDoc
- Support for
__future
operator to import a C API future
- Support for
- Legion
- Support for inlining tasks into leaf contexts
- Support for global registration callbacks inside of tasks
- Added semantic tags for source file and line location
- Support for multi-region accessors for region requirements with
co-location constraints - Changes to semantics of deletion for index spaces, field spaces, and
logical regions. For details, see: #812 - Support for creating fields spaces with initial fields
- Realm
- Subgraphs can be used to capture a template of Realm operations
that will be executed repeatedly. Subgraph definitions include
support for "interpolating" values into individual operations'
arguments on each instantiation of the subgraph template create_weighted_subspaces
supportssize_t
weights for precise
control over the size of each subspace- Added support for
omp critical
constructs and dynamic loop
schedules in OpenMP tasks - Added support for
cudaStreamLegacy
andcudaStreamPerThread
in
CUDA tasks - Realm logs now include a timestamp (relative to runtime init)
by default. This behavior can be disabled with-logtime 0
- Performance improvements for copies/fills of 3D instances spaces in
GPU device memory - Added ability to compute a set of "covering rectangles" for sparse
index spaces, allowing more compact representation in memory - Added
MultiAffineAccessor
for accessing compact instances - Added ability to delete a
ProcessorGroup
- Subgraphs can be used to capture a template of Realm operations
Version 20.03.0 (March 31, 2020)
- Regent
- Behavior change:
__fields
and__physical
now both require explicit field names, i.e.,__fields(r.{x, y})
rather than__fields(r)
. This makes the behavior more unambiguous and helps to avoid bugs - Added
complete
andincomplete
keywords that can be used to mark partitions as such - Added support for setting mapper ID and tag via
t:set_mapper_id()
andt:set_mapping_tag_id()
- Initial support for predicated execution of
if
andwhile
statements - Fixed several bugs, memory leaks and improved compile times
- Behavior change:
- Legion
- Introduction of Fortran bindings for Legion
- Support for creating deferred index spaces from future values
- Support for construction of partitions from a map of domains or from a future map
- Support for reducing a future map to a single future asynchronously
- Realm
- Support for Kokkos parallel launch constructs in Realm (and therefore Legion) tasks. Currently supported Kokkos execution spaces are: Serial, OpenMP, CUDA. Application data remains in logical regions, but accessors can be converted to Kokkos (unmanaged) Views if needed. See the
kokkos_interop
example - Introduction of experimental MPI-based network layer, enabled with
REALM_NETWORKS=mpi
(make) or-DRealm_NETWORKS=mpi
(cmake). UseREALM_NETWORKS=gasnet1
(or USE_GASNET=1, which still works) for the GASNet-based network layer (which works with GASNet-1 or GASNet-EX) - CUDA Runtime API interposer (a.k.a. "hijack") can now be disabled with
USE_CUDART_HIJACK=0
(make) or-DLegion_HIJACK_CUDART=OFF
(cmake). This can reduce effectivenes of task-parallelism for CUDA tasks, so use only if needed - More control over GPU selection via:
-cuda:skipgpus N
which leaves the first N GPUs available for other uses,-cuda:skipbusy
which skips over busy GPUs, and-cuda:minavailmem M
which skips GPUs with less than M device memory available - Reduction in memory usage of Realm internal data structures
- Support for Kokkos parallel launch constructs in Realm (and therefore Legion) tasks. Currently supported Kokkos execution spaces are: Serial, OpenMP, CUDA. Application data remains in logical regions, but accessors can be converted to Kokkos (unmanaged) Views if needed. See the
- Tools
- There is a now a generic launcher script for running Python code with Legion that will execute an aribtrary Python program in the top-level task of a Legion program. This script mirrors the interface to CPython as closely as possible.
- Legion Spy now supports verification and rendering of indirection copies
- Legion Prof supports Instance layout constraints related to dimension ordering and field alignnment
- Legion Prof contains a menu option for viewing ready state of operations
Version 19.12.0 (December 31, 2019)
- Build
- Both builds (Make and CMake) now generate
legion_defines.h
and
realm_defines.h
. By default these headers are generated in
the source directory (Make) or build directory (CMake). This
means that languages such as Regent and Python no longer
require MAX_DIM to be specified explicitly
- Both builds (Make and CMake) now generate
- Regent
- Support for CUDA 10
- Support for field polymorphic tasks
- Substantially improved the generality of the index launch
optimization. Task arguments of the form p[i+k] may now be
used, where k is a variable defined outside of the loop - Add flag
-foverride-demand-index-launch
which can be used to
force loops to be index launched in cases where the compiler
cannot prove the disjointness of read-write region
arguments - Added reductions for complex64
- The scripts
install.py
andsetup_env.py
now use CMake to
build Terra by default, which should improve portability on
most machines - The behavior of
-fcuda 1
has changed: this flag will now issue
an error if CUDA cannot be enabled (e.g. because the build
does not support CUDA, or because the machine has no
GPUs). Omitting this flag will now enable CUDA if it is
available (and will not error if it is not available).
The behavior of-fopenmp 1
has changed similarly. - The behavior of
__demand(__cuda)
has changed. This will now
issue an error if a loop is not eligible for the CUDA
transformation, regardless of whether CUDA is actually
available on the current machine or not. The behavior of
__demand(__openmp)
has changed similarly. - The annotation
__allow(__cuda)
is now permitted, and permits
(but does not require) tasks to be optimized with CUDA. - Experimental support for 2D kernel launch in the CUDA code generation
- Python
- Add support for copies
- Copies and fills now support multiple fields
- Tasks (including index launches) now support setting the mapper
ID and tag
- Legion
- A major overhaul of the Legion physical analysis to use an
approach based on bounding volume hierarchies. The change is
not visible to users, but will likely impact performance. Most
programs will get faster; programs that create many partitions
frequently on the fly may get slower. The later case will be fixed
in an upcoming release. - Added support for indirect copy operations such as gather and
scatter onto existing copy launchers
- A major overhaul of the Legion physical analysis to use an
- Realm
Event::subscribe
allows polling viaEvent::has_triggered
to
(eventually) succeed- Addition of
CompletionQueue
objects that allow multiple unordered
Event
triggers to be efficiently handled by a single consumer - Support for
omp_get_level
,omp_in_parallel
, and
omp_set_num_threads
in tasks running on OpenMP processors - Support for unstructured scatter and/or gather in copies. (Handling
structured cases as well as fills/reductions remains a work in
progress.) - Removed all calls to
Event::wait
from inside other Realm API calls.
Applications now must make sure that index spaces and instance
metadata are valid before use. For details, see: #465
Version 19.09.0 (September 9, 2019)
- Regent
__demand(__index_launch)
has been added as an alternative to__demand(__parallel)
on for loops that avoids confusion with the auto-parallelizer.__demand(__parallel)
on for loops is deprecated and now issues a warning; in a future release this warning will be upgraded to an error. For details, see: #520- Multi-field expasion is deprecated and now issues an error. The error can be temporarily downgraded to a warning, but it is advised that users migrate codes away from this syntax as it will become a hard error in a future release. For details, see: #501
- Legion
- Support for a built-in collection of reduction operators including sum, product, max, and min over a variety of types for CPUs and GPUs
- Realm
- assorted bug, performance, and memory leak fixes
- fills to attached HDF5 instances are orders of magnitude faster
- support for reusing HDF5 file handles with
-hdf5:openfiles
option - control which rank opens an HDF5 file with a
rank=nnn:
filename prefix
- Build System
- Makefile-based flow attempts to detect CUDA location and GASNet conduit if they are not specified
- Makefile-based flow defaults to building CUDA fat binaries, but can still be overridden with the
GPU_ARCH
setting, which now accepts SM arch numbers (e.g. "70") as well as names (e.g. "volta")