fix: #41 cgroup confinement #18

cmeesters · 2024-02-29T12:17:43Z

see snakemake/snakemake-executor-plugin-slurm#41

Will require adapting the snakemake-executor-plugin-slurm, too.

…to srun

…kemake-executor-plugin-slurm-jobstep into fix/#41_cgroup_confinement

johanneskoester · 2024-03-11T20:43:39Z

snakemake_executor_plugin_slurm_jobstep/__init__.py

+            # the job can utilize the c-group's resources.
+            # Note, if a job asks for more threads than cpus_per_task, we need to
+            # limit the number of cpus to the number of threads.
+            cpus = min(job.resources.get("cpus_per_task", 1), job.threads)


shouldn't this be max instead of min, according to your description?

edit: ~~yes~~ @johanneskoester actually, no. Scenarios:

the cpus_per_tak resource setting is set, then it is >=1 and if job.threads is bigger the job cannot utilize it; hence the minimum must be taken or else the job is throttled in the c-group

cpus_per_task is unset, so implicitly 1. The minimum of 1 must be taken (else job.threads would mean oversubscription, which does not work either).

ideally, cpus_per_taks and job.thread are in no conflict because of the slurm-executor superseding job.thread with cpus_per_task taken from the resource section. In essence, it would not matter, the min() function was intended as a precaution as remedy to the aforementioned scenarios.

I hope my edit makes sense ...

Doesn't this mean that if cpus_per_task are not set, then cpus will always be 1? Sorry if this is a stupid question and you answered it before, but that seems wrong to me.

That is correct. However, threads will set cpus_per_task, if cpus_per_task is not set. See lines 128 to 138 of the submit plugin.

In fact, this whole check is redundant. Should we better erase it altogether and leave a note referring to the submit plugin, I wonder?

makes sense to my. Why having code here if it is not needed because cpus_per_task is always configured in the right way by the slurm plugin.

cmeesters · 2024-04-04T09:38:51Z

not sure, why the test breaks now, will look into into it ...

johanneskoester · 2024-04-05T12:06:09Z

snakemake_executor_plugin_slurm_jobstep/__init__.py

-
-        if "mpi" in job.resources.keys():
+
+        if job.is_group():


This was deactivated by me because it neglected the fact that resources within a group will be mostly unknown beforehand, and further the topology does not really properly predict diverging runtimes of jobs, and it is therefore unnecessarily unflexible to group them by topology levels. However, I have not thought this through entirely and I might be wrong with my thoughts.

Thing is: right now, group jobs do not work correctly. As far as I see it, there are two places which need editing:

the submit executor; hence the CPU resources are threads (or cpus_per_task) times number of jobs, confined to a single node, unless we allow oversubscription

the jobstep executor to handle one group job after the other (non-blocking, of course)

ok, lets discuss this in a VC :-)

johanneskoester · 2024-04-11T14:25:25Z

snakemake_executor_plugin_slurm_jobstep/__init__.py

+                # SLURM's limits: it is not able to limit the memory if we divide the
+                # job per CPU by itself.
+
+                level_mem = level_job.resources.get("mem_mb")


Suggested change

level_mem = level_job.resources.get("mem_mb")

level_mem = level_job.resources.get("mem_mb", 0)

In addition, first check if mem_per_cpu is set.

johanneskoester · 2024-04-11T15:11:06Z

This can likely be closed in favor of #23.

cmeesters and others added 9 commits February 29, 2024 13:08

fix: fixes #41 (slurm_executor) about explicit cpus_per_task setting …

941b606

…to srun

fix: new test will be skipped on windows

aaa0bdd

Merge branch 'main' into fix/#41_cgroup_confinement

af4199a

fix: attempt to fix tests

078919a

Merge branch 'fix/#41_cgroup_confinement' of github.com:snakemake/sna…

b36cbd3

…kemake-executor-plugin-slurm-jobstep into fix/#41_cgroup_confinement

fix: attempt fixing linting issue

6bf11dd

fix: tinkering - last attempt to fix test case

2823de3

fix: skipped skipping on windows

5ddf650

fix: skipping test alltogether

5f4da32

cmeesters requested a review from johanneskoester February 29, 2024 14:55

cmeesters and others added 4 commits February 29, 2024 15:55

Merge branch 'main' into fix/#41_cgroup_confinement

95b516e

fix: attempt to deal without env variables

8f48dba

Merge branch 'fix/#41_cgroup_confinement' of github.com:snakemake/sna…

890e479

…kemake-executor-plugin-slurm-jobstep into fix/#41_cgroup_confinement

fix: removed unneeded import

aab1f76

cmeesters changed the title ~~Fix/#41 cgroup confinement~~ fix/#41 cgroup confinement Feb 29, 2024

cmeesters changed the title ~~fix/#41 cgroup confinement~~ fix: #41 cgroup confinement Feb 29, 2024

johanneskoester reviewed Mar 11, 2024

View reviewed changes

fix: relying on submit plugin to set the cpu resource, correctly

c9b10a7

cmeesters mentioned this pull request Apr 2, 2024

fix: ensuring setting of job.resources.cpus_per_task snakemake/snakemake-executor-plugin-slurm#59

Closed

fix: getting cpus_per_task from resources, correctly

cd27a58

fix: non-reliable fix of the group job feature

fd8edbf

johanneskoester reviewed Apr 5, 2024

View reviewed changes

johanneskoester requested changes Apr 11, 2024

View reviewed changes

cmeesters closed this Apr 11, 2024

cmeesters deleted the fix/#41_cgroup_confinement branch April 11, 2024 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: #41 cgroup confinement #18

fix: #41 cgroup confinement #18

cmeesters commented Feb 29, 2024

johanneskoester Mar 11, 2024

cmeesters Mar 12, 2024 •

edited

Loading

cmeesters Mar 21, 2024

johanneskoester Apr 2, 2024

cmeesters Apr 2, 2024 •

edited

Loading

johanneskoester Apr 5, 2024

cmeesters commented Apr 4, 2024

johanneskoester Apr 5, 2024

cmeesters Apr 5, 2024

johanneskoester Apr 6, 2024

johanneskoester Apr 11, 2024

johanneskoester Apr 11, 2024

johanneskoester commented Apr 11, 2024

	level_mem = level_job.resources.get("mem_mb")
	level_mem = level_job.resources.get("mem_mb", 0)

fix: #41 cgroup confinement #18

fix: #41 cgroup confinement #18

Conversation

cmeesters commented Feb 29, 2024

Choose a reason for hiding this comment

cmeesters Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmeesters Apr 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmeesters commented Apr 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johanneskoester commented Apr 11, 2024

cmeesters Mar 12, 2024 •

edited

Loading

cmeesters Apr 2, 2024 •

edited

Loading