-
Notifications
You must be signed in to change notification settings - Fork 145
Legion: Control Replication Quickstart
Control replication is a transformation that makes implicitly parallel execution in Legion dramatically more scalable by replicating the execution of the "control" part of task
Consider the non-control-replicated scenario where
When we execute the same program with control replication, a separate copy of
Note that since
- Using data structures such as hash tables, that do not guarantee iteration order, depending on how exactly the data structure is used. For example, launching a subtask only if an element is present in the hash table is fine, but iterating over a hash table and launching a task for each element is not, since different copies of
on different processes might iterate over elements in a different order (again, unless the hash table is deterministic). Similar constraints apply to STL-like data structures such as std:set
andstd::map
if they are keyed on types like pointers. In languages like Python, data structures likeset
that are keyed by hashed values can lead to violations of replicability, whereas data structures likedict
are less likely to cause problems since their iteration order is deterministic on insertion order (in Python 3). - Using random number generators in a way that impacts the control flow or arguments passed to Legion API calls. (Unless the random number generator and its seeds are deterministic and the user has verified they produce the same values in every shard.)
- Polling on whether futures or future maps are finished and using that information to impact control flow or determine what tasks or other operations to have Legion launch.
- Performing I/O without using explicit Legion attach/detach operations or the
Runtime::print_once
orRuntime::log_once
runtime functions.
There are exceptions to all these guidelines that can be done safely, but please be careful. You can run your program with -lg:safe_ctrlrepl 1
to ask the runtime to perform (expensive) checks that will detect violations of this property.
- Choose the task
that you want to replicate. This will most often be the top-level task. - Verify that at least one variant of
is replicable (see discussion above). - When registering the task variant for
, declare it as replicable in the TaskVariantRegistrar
usingTaskVariantRegistrar::set_replicable()
.
The above are sufficient if you are control replicating the top-level task of your program and relying on the default mapper as the default mapper will automatically replicate one copy of the top-level task on each node of multi-node run, presuming it can find a replicable task variant. The following steps are recommended for all other use cases:
- Create and register a custom mapper.
- Use your custom mapper when launching
, and when launching any subtask within . - Set
replicate=true
inselect_task_options
forin your mapper. - Implement
replicate_task
in your mapper, which decides how to instantiate shards. Different ranks/nodes can be assigned different numbers of shards, and not every rank/node must be assigned a shard. The most common configuration is 1 shard per rank and 1 rank per node. - Create and register one or more
ShardingFunctor
objects. This code will select, for each sub-task, which shard will execute it. - Implement
select_sharding_functor
in your mapper, to pick which sharding functor will be used for each task/copy/etc launch.
Now if you perform an index launch within slice_task
, each shard will first apply the sharding functor to select the points it's responsible for, then it will get a local slice_task
callback to (locally) distribute those point tasks.
Regent automatically checks every task to see if it is replicable, and marks the task as such. If you wish to ensure that a task is replicable, you may add the annotation __demand(__replicable)
to the top of the task. Note that this does not change the compiler analysis in any way, but throws an error if the marked task is not in fact replicable. It is recommended to mark your top-level task with this annotation so that you can catch any potential replicability failures at compile time.
Otherwise, the mapping advice above generally applies to Regent programs as well.
- If you use
printf
for console printing insidethen the output will be duplicated in every shard. Use the LEGION_PRINT_ONCE
macro instead, or explicitly call theRuntime::print_once
orRuntime::log_once
functions. - If you are using Legion in a programming language with a garbage collector, you can help ensure your tasks are replicable by making sure that any operations launched from the garbage collector (e.g. detach or deletion operations) pass
true
for theunordered
parameter. In such cases you do not need to guarantee that those operations are performed in the same order across the shards. Legion will match up such operations once it has seen one in every shard and automatically insert them in a deterministic way into the stream of sub-tasks/operations launched. You are still responsible for making sure each shard issues such deletions or detach operations in every shard, but the ordering does not matter. - In addition to control replication, Legion also supports normal (non-control) replication of tasks. A task that is replicated (not control-replicated) simply runs
times. You can replicate leaf tasks to run the same computation in different places to produce the same data. The same mapper paths are used for replicating leaf tasks. If the selected task variant is a leaf variant, then Legion will automatically know to perform normal replication and not control replication of the task.