Introduce Time Limits #210

pbelmann · 2022-07-20T15:16:37Z

Assumption

We do not have the information whether a process failed because of a timeout, a failure in the execution environment or
because not enough resources were assigned. Evaluating the exit code of tools doe not always work. Even in the case
of Metahit and Metaspades it is just a guess.

Issue definition

Nextflow time limits should help us regarding three specific issues:

If a process fails because not enough resources were assigned to it,
the process should be retried with more resources. Increasing the time limit is not necessary since the process has access to more resources. More resources (cores/RAM) should lead to a faster execution.

@nkleinbo posted a possible so solution, that looks like the following example without the time limit:

withLabel:process_high {
        cpus   = { check_max( 12    * task.attempt, 'cpus'    ) }
        memory = { check_max( 72.GB * task.attempt, 'memory'  ) }
        //time   = { check_max( 16.h  * task.attempt, 'time'    ) }
    }

We apply a similar pattern already for megahit were we try to evaluate the exit code and based on the code we increase memory or just retry the process.
However, for most processes we have a default setting in our resources list that should allow all processes (with the exception of megahit) to run without out of memory errors. If the user modifies the flavour definitions, the processes can fail.

There are processes that have enough resources but could run forever because of the runtime of their algorithm. One example is
Smetana that without restrictions could run months without finishing. Other processes are scapp (see explanation here: (SCAPP running for a long time and then, segfault Shamir-Lab/SCAPP#19)).
A process could run quite long because of errors in the execution environment (Example: fastp stuck #199 ).

Proposed Solution

Regarding 1): We could adopt nf-core`s solution which should use our resource definitions. However in my opinion it should be possible to disable this mode if our default resource definitions are not modified. The reason is that if there is an error in the execution environment (e.g. disk errors on a node) than all jobs of that node would be scheduled to a larger VM.

Regarding 2): For these jobs a dedicated time limit should be specified. The ideal solution would be to get the reason why the process
failed (e.g. timeout). If it was a timeout then it should not be retried again. Otherwise it would be ok to retry with more resources.
Since this is not possible we can use the proposed solution in 1) with a fixed time limit.

Regarding 3): In this case a retry on a timeout without modifying the resource limits would be enough.

Todo

1. All processes should be able to request more resources in case of an error. This mode should be an option that can be disabled in the config file because of the aforementioned concerns.
2. Specific processes such as SCAPP and SMETANA must use a time limit which should be set to multiple days. This time limit must be scaled with the assigned resources. Example: If SCAPP should have a time limit of 5 days when 14 cores are assigned, then
the time limit should be doubled if just 7 cores are assigned by the user.
3. All processes must have a time limit defined that will never be reached on a normal execution with large datasets without errors in the execution environment (Example Time Limit for Fastp: 1 day). The time limit should also be increased if the assigned resources are decreased (See todo point 2).

The text was updated successfully, but these errors were encountered:

pbelmann mentioned this issue Jul 21, 2022

feat(pipeline): define general time limit attribute #212

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Time Limits #210

Introduce Time Limits #210

pbelmann commented Jul 20, 2022 •

edited

Loading

Introduce Time Limits #210

Introduce Time Limits #210

Comments

pbelmann commented Jul 20, 2022 • edited Loading

Assumption

Issue definition

Proposed Solution

Todo

pbelmann commented Jul 20, 2022 •

edited

Loading