Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Time Limits #210

Open
3 tasks
pbelmann opened this issue Jul 20, 2022 · 0 comments
Open
3 tasks

Introduce Time Limits #210

pbelmann opened this issue Jul 20, 2022 · 0 comments

Comments

@pbelmann
Copy link
Member

pbelmann commented Jul 20, 2022

Assumption

We do not have the information whether a process failed because of a timeout, a failure in the execution environment or
because not enough resources were assigned. Evaluating the exit code of tools doe not always work. Even in the case
of Metahit and Metaspades it is just a guess.

Issue definition

Nextflow time limits should help us regarding three specific issues:

  1. If a process fails because not enough resources were assigned to it,
    the process should be retried with more resources. Increasing the time limit is not necessary since the process has access to more resources. More resources (cores/RAM) should lead to a faster execution.

@nkleinbo posted a possible so solution, that looks like the following example without the time limit:

withLabel:process_high {
        cpus   = { check_max( 12    * task.attempt, 'cpus'    ) }
        memory = { check_max( 72.GB * task.attempt, 'memory'  ) }
        //time   = { check_max( 16.h  * task.attempt, 'time'    ) }
    }

We apply a similar pattern already for megahit were we try to evaluate the exit code and based on the code we increase memory or just retry the process.
However, for most processes we have a default setting in our resources list that should allow all processes (with the exception of megahit) to run without out of memory errors. If the user modifies the flavour definitions, the processes can fail.

  1. There are processes that have enough resources but could run forever because of the runtime of their algorithm. One example is
    Smetana that without restrictions could run months without finishing. Other processes are scapp (see explanation here: (SCAPP running for a long time and then, segfault Shamir-Lab/SCAPP#19)).

  2. A process could run quite long because of errors in the execution environment (Example: fastp stuck #199 ).

Proposed Solution

Regarding 1): We could adopt nf-core`s solution which should use our resource definitions. However in my opinion it should be possible to disable this mode if our default resource definitions are not modified. The reason is that if there is an error in the execution environment (e.g. disk errors on a node) than all jobs of that node would be scheduled to a larger VM.

Regarding 2): For these jobs a dedicated time limit should be specified. The ideal solution would be to get the reason why the process
failed (e.g. timeout). If it was a timeout then it should not be retried again. Otherwise it would be ok to retry with more resources.
Since this is not possible we can use the proposed solution in 1) with a fixed time limit.

Regarding 3): In this case a retry on a timeout without modifying the resource limits would be enough.

Todo

  • 1. All processes should be able to request more resources in case of an error. This mode should be an option that can be disabled in the config file because of the aforementioned concerns.

  • 2. Specific processes such as SCAPP and SMETANA must use a time limit which should be set to multiple days. This time limit must be scaled with the assigned resources. Example: If SCAPP should have a time limit of 5 days when 14 cores are assigned, then
    the time limit should be doubled if just 7 cores are assigned by the user.

  • 3. All processes must have a time limit defined that will never be reached on a normal execution with large datasets without errors in the execution environment (Example Time Limit for Fastp: 1 day). The time limit should also be increased if the assigned resources are decreased (See todo point 2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant