New feature: live monitoring of `CalcJobs` #5621

sphuber · 2022-08-26T11:43:25Z

sphuber
Aug 26, 2022
Maintainer

Various users have expressed the desire for AiiDA to automatically "monitor" the progress of a CalcJob and potentially kill it under some conditions. See for example this issue by @espenfl . A concrete example would be to monitor a PwCalculation or VaspCalculation and look at the self-consistent convergence, which if it seems to stagnate one might prefer to kill the job as opposed to wasting valuable compute resources by letting it run out the allotted wallclock time.

@ramirezfranciscof and I have been looking at implementing this feature and have come up with two working proof-of-concepts. We want to shortly present them here, giving a quick overview of the technical implementation and the design choices.

It would be good to get feedback from you whether this satisfies any use cases you may have or whether another design or options would required.

We will present this very quickly (max. 10 minutes) during the AiiDA team meeting of September 7. This will serve just to very briefly show-case the problem and solution and see whether other people are interested in discussing this in a separate meeting. We will not be discussing details during the AiiDA meeting itself.

Current implementations:

As mentioned, there are currently two implementations:

Stand-alone (@ramirezfranciscof)
Integrated (@sphuber)

Stand-alone

@ramirezfranciscof implemented a version that is compatible with the current version of aiida-core. It is provided by means of a stand-alone plugin package aiida-calcmonitor. This was implemented to solve one of the requirements for an upcoming deliverable of a project he is working on, so a lot of the constraints he was working under were imposed more by a practical time-limitation rather than being actual design choices. Although a bit clunky, it still showcases some of the important desired properties (such as the pluginability through the Data node sub-typing) and remains useful as a reference for the expected behaviour of the use case.

The basic gist is that when a CalcJob is launched that should be monitored, a separate CalcJob instance, called CalcjobMonitor, is launched in parallel that performs the operation. Monitors are Python classes that can be registered using entry points and attached to a CalcJob through such CalcjobMonitor jobs. The desired monitors are inputs to the CalcJobMonitor. An example can be found on the repository:

import time

from aiida import orm
from aiida.engine import submit
from aiida.plugins import CalculationFactory, DataFactory

# SUBMIT TEST
ToymodelCalcjob = CalculationFactory('calcmonitor.toymodel_calcjob')
toymodel_builder = ToymodelCalcjob.get_builder()
toymodel_builder.runtime_seconds = orm.Int(300)
toymodel_builder.code = orm.load_code('toymodel@localhost')
calcjob_node = submit(toymodel_builder)

while not 'remote_folder' in calcjob_node.outputs:
    time.sleep(0.1)
    theoutputs = list(calcjob_node.outputs)
    print(theoutputs)

# MONITOR TEST
ToymodelMonitor = DataFactory('calcmonitor.monitor.toymodel')
monitor_protocol = ToymodelMonitor({
    'sources': {
        'toy_output': {'filepath': 'tester.out', 'refresh_rate': 10},
        'extra_file': {'filepath': 'inexistent.txt', 'refresh_rate': 10},
    },
    'options': {},
    'retrieve': ['tester.out'],
})
MonitorCalcjob = CalculationFactory('calcmonitor.calcjob_monitor')
monitor_builder = MonitorCalcjob.get_builder()
monitor_builder.code = orm.load_code('monitor@localhost-aiida')
monitor_builder.monitor_protocols = {'monitor1': monitor_protocol}
monitor_builder.monitor_folder = calcjob_node.outputs.remote_folder
calcjob_node = submit(monitor_builder)

Integrated

An alternative solution is to integrate the functionality directly into the engine of AiiDA. However, this requires changes to aiida-core. A working proof-of-concept can be found in this branch. The concept is explained in detail in this section of the documentation.

To give a quick idea of how it would work.

Define a monitor function:

def monitor(node: CalcJobNode, transport: Transport, **kwargs) -> str | None:
    """Retrieve and inspect files in working directory of job to determine whether the job should be killed.

    :param node: The node representing the calculation job.
    :param transport: The transport that can be used to retrieve files from remote working directory.
    :returns: A string if the job should be killed, `None` otherwise.
    """
    with tempfile.NamedTemporaryFile('w+') as handle:
        transport.getfile('some-file.txt', handle.name)
        handle.seek(0)
        output = handle.read()
    if 'problem' in output:
        return 'The calculation has encountered a problem so were aborting.'

Register it with an entry point in the aiida.calculations.monitors group:

[project.entry-points."aiida.calculations.monitors"]
"core.always_kill" = "aiida.calculations.monitors.base:always_kill"

Launch a CalcJob and add any number of monitors through the monitors input:

builder = load_code('add@localhost').get_builder()
builder.x = Int(1)
builder.y = Int(2)
builder.monitors = {'always_kill': Dict({'entry_point': 'core.always_kill'})}
_, node = run.get_node(builder)
assert node.is_killed

When this job gets started and the engine will go in the scheduler update phase, polling the scheduler in intervals to check whether it has finished, it will also at each scheduler check call all monitors one by one. It will pass the CalcJobNode for reference as well as an instance of the Transport that gives access to the working directory of the job on the remote computer. This can be used to retrieve and inspect output files. If a monitor returns a string, the engine will issue a kill command for the job. Note that the transport will be retrieved by the engine from the pool so all connection throttling mechanisms will be respected.

Comments

Currently the monitoring interval is tied to the interval of the scheduler update. Although it is probably technically possible to decouple this and give the monitoring its own interval, this will require some more work to the engine. It is also not quite clear if this will always really be beneficial since anyway the monitoring, just like the updating, will be restricted by the interval of the transport becoming available through the throttling mechanism. Even if the monitoring were to be more frequent than the scheduler updates, in practice they might often follow the same rhythm.
I chose to use a Dict as the type for the monitors input namespace, even though it currently only contains the entry_point key. One could imagine that monitors could allow certain options to tune their logic. As long as these options would be JSON serializable, we could add an options key to each Dict which could then be passed to the monitor as monitor(node, transport, **options). This would allow each monitor to define kwargs to provide some customizability. This is now actually implemented.

Current limitations:

For both solutions, the monitoring is only active while AiiDA is running. This means that if an active is being monitored, it won't be stopped if AiiDA is shutdown. Technically it would be possible to come up with a design that would be decoupled, but it will become a lot more complicated. It would require running a background process on the remote computer where the monitored job is running that can autonomously perform the intermittent checking of the job and kill it when needed. If we assume that nothing but bash can be run on the remote by default, a user defined monitor has to be bash (which is probably not that user friendly) and the script will be scheduler dependent. There might be solutions to this, but unless this is a critical use-case, it is probably not worth looking at this at the moment.
The use of the entry point system for the monitors is two-fold: it guarantees that the monitored job can be sent to the daemon and the daemon workers can properly load and execute the monitor functions. It also helps towards the reproducibility. Allowing users to define a monitor on the fly, for example in a shell or notebook, would increase the user-friendliness, but it would also increase the complexity of the implementation. I think it should be possible to store the function somehow in the database (probably as a pickle in the repository), but I am not sure yet how difficult it would be to have the engine automatically serialize the input and have the daemon worker deserialize it and execute it. If people think this is useful, I can investigate in this direction to see how to implement it and see if there are any limitations or problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AiiDA team

New feature: live monitoring of `CalcJobs` #5621

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

AiiDA team

New feature: live monitoring of CalcJobs #5621

sphuber Aug 26, 2022 Maintainer

Current implementations:

Stand-alone

Integrated

Comments

Current limitations:

Replies: 0 comments

New feature: live monitoring of `CalcJobs` #5621

sphuber
Aug 26, 2022
Maintainer