Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: #41 cgroup confinement #18

Closed
wants to merge 16 commits into from
Closed

Conversation

cmeesters
Copy link
Member

see snakemake/snakemake-executor-plugin-slurm#41

Will require adapting the snakemake-executor-plugin-slurm, too.

@cmeesters cmeesters changed the title Fix/#41 cgroup confinement fix/#41 cgroup confinement Feb 29, 2024
@cmeesters cmeesters changed the title fix/#41 cgroup confinement fix: #41 cgroup confinement Feb 29, 2024
# the job can utilize the c-group's resources.
# Note, if a job asks for more threads than cpus_per_task, we need to
# limit the number of cpus to the number of threads.
cpus = min(job.resources.get("cpus_per_task", 1), job.threads)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be max instead of min, according to your description?

Copy link
Member Author

@cmeesters cmeesters Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

edit: yes @johanneskoester actually, no. Scenarios:

  • the cpus_per_tak resource setting is set, then it is >=1 and if job.threads is bigger the job cannot utilize it; hence the minimum must be taken or else the job is throttled in the c-group
  • cpus_per_task is unset, so implicitly 1. The minimum of 1 must be taken (else job.threads would mean oversubscription, which does not work either).
  • ideally, cpus_per_taks and job.thread are in no conflict because of the slurm-executor superseding job.thread with cpus_per_task taken from the resource section. In essence, it would not matter, the min() function was intended as a precaution as remedy to the aforementioned scenarios.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope my edit makes sense ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean that if cpus_per_task are not set, then cpus will always be 1? Sorry if this is a stupid question and you answered it before, but that seems wrong to me.

Copy link
Member Author

@cmeesters cmeesters Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. However, threads will set cpus_per_task, if cpus_per_task is not set. See lines 128 to 138 of the submit plugin.

In fact, this whole check is redundant. Should we better erase it altogether and leave a note referring to the submit plugin, I wonder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to my. Why having code here if it is not needed because cpus_per_task is always configured in the right way by the slurm plugin.

@cmeesters
Copy link
Member Author

not sure, why the test breaks now, will look into into it ...


if "mpi" in job.resources.keys():

if job.is_group():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was deactivated by me because it neglected the fact that resources within a group will be mostly unknown beforehand, and further the topology does not really properly predict diverging runtimes of jobs, and it is therefore unnecessarily unflexible to group them by topology levels. However, I have not thought this through entirely and I might be wrong with my thoughts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is: right now, group jobs do not work correctly. As far as I see it, there are two places which need editing:

  • the submit executor; hence the CPU resources are threads (or cpus_per_task) times number of jobs, confined to a single node, unless we allow oversubscription
  • the jobstep executor to handle one group job after the other (non-blocking, of course)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, lets discuss this in a VC :-)

# SLURM's limits: it is not able to limit the memory if we divide the
# job per CPU by itself.

level_mem = level_job.resources.get("mem_mb")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
level_mem = level_job.resources.get("mem_mb")
level_mem = level_job.resources.get("mem_mb", 0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, first check if mem_per_cpu is set.

@johanneskoester
Copy link
Contributor

This can likely be closed in favor of #23.

@cmeesters cmeesters closed this Apr 11, 2024
@cmeesters cmeesters deleted the fix/#41_cgroup_confinement branch April 11, 2024 15:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants