Skip to content

Latest commit

 

History

History
72 lines (58 loc) · 4.52 KB

4_DISTRIBUTED.md

File metadata and controls

72 lines (58 loc) · 4.52 KB

Distributed experiments:

When correctly configured, two 5Genesis platforms can perform the execution of a distributed experiment, in which both platforms execute tasks in a coordinated manner and exchange information with each other. In order to use this functionality, the following conditions must be met:

  • On each platform, a test case that defines the set of actions (including any necessary coordination) of that side exists.
  • The East/West interface of the ELCM in both sides is enabled, there is connectivity between the two instances and connection details for the remote side's ELCM are defined.
  • The remote platforms are registered in the Dispatcher of both sides (see the Dispatcher documentation).

Optionally, in order to ease the creation of a valid experiment descriptor:

  • The East/West interface of the Portal in both sides is enabled, there is connectivity between the two instances and connection details for the remote side's Portal are defined.

Defining a distributed experiment

The creation of a distributed experiment is a collaborative activity between the two platforms involved in the execution of the experiment. Each platform is responsible for the definition of their set of actions, as only they have the required knowledge on the usage of their equipment, but must agree with the other platform's administrators about any necessary coordination and information exchange that is required in order to successfully execute the test case.

The actual definition of the test case is very similar to that of a normal (non-distributed) experiment, but with the following differences:

  • The test case definition yaml must include an additional key: Distributed: True
  • A distributed experiment cannot be Custom (i.e. cannot define Parameters)
  • Additional task types are available (for coordination and information exchange)

The general workflow during a distributed experiment is as follows:

  • The Dispatcher of one of the platforms (the Main platform) receives a distributed experiment execution request, either from the Portal or through the Open APIs.
  • The Dispatcher performs the initial coordination, contacting with the ELCM of its own platform and the Dispatcher of the remote platform (the Secondary platform).
  • Once the initial coordination is completed, the ELCM on both sides communicate directly for the rest of the experiment execution.
  • Each side performs the execution of their tasks as normal, unless they reach a point where they must coordinate:
    • If one of the platforms must wait until the remote side has performed some actions:
      • The waiting platform can use the Remote.WaitForMilestone task.
      • The other platform can indicate that the actions have been performed using the Run.AddMilestone task.
    • If one of the platforms requires certain information from the remote side:
      • The querying platform can use the Remote.GetValue task.
      • The other platform can set the value requested using any of the Run.Publish, Run.PublishFromFile and Run.PublishFromPreviousTaskLog tasks.
  • Once both platforms execute all their tasks, the Main platform requests all the generated files and results to the Secondary platform, so that they are saved along with the ones generated by the Main and available to the experimenter.

Distributed-specific tasks

Remote.WaitForMilestone

Halts the execution of additional tasks until the remote side specifies that a certain milestone has been reached (using the Run.AddMilestone task). Configuration values:

  • Milestone: Name of the milestone to wait for.
  • Timeout: Custom timeout for this particular request. If not specified, the value configured in the East/West section of the configuration is used.

Init, PreRun, Run, PostRun, Finished, Cancelled and Errored are valid milestone names that are automatically added (if/when reached) in all experiment executions.

Remote.GetValue

Halts the execution of additional tasks until a certain value can be obtained from the remote side (using any of the Run.Publish, Run.PublishFromFile and Run.PublishFromPreviousTaskLog tasks). When received, the value will be published internally and available for variable expansion. Configuration values:

  • Value: Name of the value to request.
  • PublishName: Name to use when publishing the value. If not specified the same Value name will be used.
  • Timeout: Custom timeout for this particular request. If not specified, the value configured in the East/West section of the configuration is used.