New feature: live monitoring of CalcJobs
#5621
Unanswered
sphuber
asked this question in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Various users have expressed the desire for AiiDA to automatically "monitor" the progress of a
CalcJob
and potentially kill it under some conditions. See for example this issue by @espenfl . A concrete example would be to monitor aPwCalculation
orVaspCalculation
and look at the self-consistent convergence, which if it seems to stagnate one might prefer to kill the job as opposed to wasting valuable compute resources by letting it run out the allotted wallclock time.@ramirezfranciscof and I have been looking at implementing this feature and have come up with two working proof-of-concepts. We want to shortly present them here, giving a quick overview of the technical implementation and the design choices.
It would be good to get feedback from you whether this satisfies any use cases you may have or whether another design or options would required.
We will present this very quickly (max. 10 minutes) during the AiiDA team meeting of September 7. This will serve just to very briefly show-case the problem and solution and see whether other people are interested in discussing this in a separate meeting. We will not be discussing details during the AiiDA meeting itself.
Current implementations:
As mentioned, there are currently two implementations:
Stand-alone
@ramirezfranciscof implemented a version that is compatible with the current version of
aiida-core
. It is provided by means of a stand-alone plugin packageaiida-calcmonitor
. This was implemented to solve one of the requirements for an upcoming deliverable of a project he is working on, so a lot of the constraints he was working under were imposed more by a practical time-limitation rather than being actual design choices. Although a bit clunky, it still showcases some of the important desired properties (such as the pluginability through the Data node sub-typing) and remains useful as a reference for the expected behaviour of the use case.The basic gist is that when a
CalcJob
is launched that should be monitored, a separateCalcJob
instance, calledCalcjobMonitor
, is launched in parallel that performs the operation. Monitors are Python classes that can be registered using entry points and attached to aCalcJob
through suchCalcjobMonitor
jobs. The desired monitors are inputs to theCalcJobMonitor
. An example can be found on the repository:Integrated
An alternative solution is to integrate the functionality directly into the engine of AiiDA. However, this requires changes to
aiida-core
. A working proof-of-concept can be found in this branch. The concept is explained in detail in this section of the documentation.To give a quick idea of how it would work.
Define a monitor function:
Register it with an entry point in the
aiida.calculations.monitors
group:Launch a
CalcJob
and add any number of monitors through themonitors
input:When this job gets started and the engine will go in the scheduler update phase, polling the scheduler in intervals to check whether it has finished, it will also at each scheduler check call all monitors one by one. It will pass the
CalcJobNode
for reference as well as an instance of theTransport
that gives access to the working directory of the job on the remote computer. This can be used to retrieve and inspect output files. If a monitor returns a string, the engine will issue a kill command for the job. Note that the transport will be retrieved by the engine from the pool so all connection throttling mechanisms will be respected.Comments
I chose to use aThis is now actually implemented.Dict
as the type for themonitors
input namespace, even though it currently only contains theentry_point
key. One could imagine that monitors could allow certain options to tune their logic. As long as these options would be JSON serializable, we could add anoptions
key to eachDict
which could then be passed to the monitor asmonitor(node, transport, **options)
. This would allow each monitor to definekwargs
to provide some customizability.Current limitations:
bash
can be run on the remote by default, a user defined monitor has to be bash (which is probably not that user friendly) and the script will be scheduler dependent. There might be solutions to this, but unless this is a critical use-case, it is probably not worth looking at this at the moment.Beta Was this translation helpful? Give feedback.
All reactions