Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[guidance request] #351

Open
gongwei-130 opened this issue Nov 2, 2024 · 5 comments
Open

[guidance request] #351

gongwei-130 opened this issue Nov 2, 2024 · 5 comments

Comments

@gongwei-130
Copy link

Details

  • Slurm Version: 21.08.5
  • Python Version: 3.9.20
  • Cython Version: 3.0.11
  • PySlurm Branch: v21.08.4
  • Linux Distribution:Ubuntu 22.04.3 LTS

Issue

I am using pyslurm building a service, where user would use it to submit the slurm job. But looks like slurm would treat all jobs be submitted by the user who started the service. Anyway to let service use pyslurm to submit job on behave of different user?

@tazend
Copy link
Member

tazend commented Nov 6, 2024

Hi @gongwei-130,

you can do something like this:

import pyslurm
desc = pyslurm.JobSubmitDescription(
    user_id = <user-name or uid>,
    group_id = <group-name or gid>,
   ... other job args ...
)
id = desc.submit()
....

It should be noted though that submitting a Job under a different User/Group ID requires root permissions.

@gongwei-130
Copy link
Author

@tazend forget get this back to you, it works fine.

Additional question, how to use "job_desc.environment"? I have following setting which would introduce error, which would go away if I remove the environment set.

job_desc.environment = {
    "MY_VAR": "Hello",
    "MY_OTHER_VAR": "World",
}

error:
srun: error: mk-xii-01: task 0: Exited with exit code 2
slurmstepd-mk-xii-01: error: execve(): bash: No such file or directory

job_desc = pyslurm.JobSubmitDescription()
job_desc.name = 'wei_test'
job_desc.ntasks_per_node = 1
job_desc.cpus_per_task = 175
job_desc.memory_per_node = "50G"

job_desc.working_directory = '/home/weigong'
job_desc.standard_error = "/home/weigong/output/job_%j_node_%N_task_%t.err"
job_desc.standard_output = "/home/weigong/output/job_%j_node_%N_task_%t.out"
    
job_desc.nodes = 2
job_desc.gres_per_node = "gpu:8"

job_desc.partitions = 'onehour'

job_desc.environment = {
    "MY_VAR": "Hello",
    "MY_OTHER_VAR": "World",
}

@tazend
Copy link
Member

tazend commented Nov 14, 2024

Hi,

mhh interesting, not sure why that happens. I will need to try and reproduce/debug this on my side to see whats going on.

@gongwei-130
Copy link
Author

@tazend more questions:

  1. How to submit multi-node jobs?
    for multi-nodes job, I try to submit a script and execute it on all nodes, as follows. But find that if I don't have a "srun" before the "bash test.sh", it would only run on compute node 0. Is a srun requirement expected? I didn't see it in any document or example for that.

  2. If run with container, how to kill the container when the job is cancelled? The second script I used try to catch the sigTerm and stop the container, but it cannot catch the signal. Any advise? Thanks.

def _build_start_script():
    start_command = "#! /bin/bash\n"
    start_command += "srun bash test.sh"
    return start_command

job_desc.script = _build_start_script()
job_id = job_desc.submit()
container_name="${SLURM_JOB_ID}_container"

# Function to clean up Docker container on exit
cleanup() {
    echo "Stopping and removing container..."
    sudo docker stop "$container_name" > /dev/null 2>&1
    sudo docker rm "$container_name" > /dev/null 2>&1
    echo "Container stopped and removed."
}

trap cleanup SIGTERM
trap cleanup SIGKILL

sudo docker run -d --rm --tty --name $container_name --gpus all --ipc host bash_script.sh
sudo docker logs -f $container_name

wait

@tazend
Copy link
Member

tazend commented Nov 26, 2024

Hi,

srun makes sure that whatever amount of ntasks you have requested gets spawned on all the compute nodes in the allocation assigned to your job. If you want to launch processes accross multiple nodes, srun is required yes. But this is nothing special to pyslurm - it is the standard way in slurm to launch multi-node jobs, usually backed by something like MPI.

To your second question, not sure if it will work properly the way you have setup the script here. Note that SIGKILL cannot be catched for example. The SIGTERM trap might be ignored due to the way slurm handles signals, not sure though. (you could test around with something like scancel --full.

Instead of docker you could also have a look at an alternative called apptainer (singularity), which can also execute containers (and also docker containers out of the box). It doesn't require any daemon or sudo setup like with docker, and is (in my opinion) better suited for integration with Slurm (and HPC in general):

https://apptainer.org/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants