[guidance request] #351

gongwei-130 · 2024-11-02T07:46:22Z

Details

Slurm Version: 21.08.5
Python Version: 3.9.20
Cython Version: 3.0.11
PySlurm Branch: v21.08.4
Linux Distribution:Ubuntu 22.04.3 LTS

Issue

I am using pyslurm building a service, where user would use it to submit the slurm job. But looks like slurm would treat all jobs be submitted by the user who started the service. Anyway to let service use pyslurm to submit job on behave of different user?

tazend · 2024-11-06T09:02:34Z

Hi @gongwei-130,

you can do something like this:

import pyslurm
desc = pyslurm.JobSubmitDescription(
    user_id = <user-name or uid>,
    group_id = <group-name or gid>,
   ... other job args ...
)
id = desc.submit()
....

It should be noted though that submitting a Job under a different User/Group ID requires root permissions.

gongwei-130 · 2024-11-12T18:12:53Z

@tazend forget get this back to you, it works fine.

Additional question, how to use "job_desc.environment"? I have following setting which would introduce error, which would go away if I remove the environment set.

job_desc.environment = {
    "MY_VAR": "Hello",
    "MY_OTHER_VAR": "World",
}

error:
srun: error: mk-xii-01: task 0: Exited with exit code 2
slurmstepd-mk-xii-01: error: execve(): bash: No such file or directory

job_desc = pyslurm.JobSubmitDescription()
job_desc.name = 'wei_test'
job_desc.ntasks_per_node = 1
job_desc.cpus_per_task = 175
job_desc.memory_per_node = "50G"

job_desc.working_directory = '/home/weigong'
job_desc.standard_error = "/home/weigong/output/job_%j_node_%N_task_%t.err"
job_desc.standard_output = "/home/weigong/output/job_%j_node_%N_task_%t.out"
    
job_desc.nodes = 2
job_desc.gres_per_node = "gpu:8"

job_desc.partitions = 'onehour'

job_desc.environment = {
    "MY_VAR": "Hello",
    "MY_OTHER_VAR": "World",
}

tazend · 2024-11-14T12:48:03Z

Hi,

mhh interesting, not sure why that happens. I will need to try and reproduce/debug this on my side to see whats going on.

gongwei-130 · 2024-11-14T20:52:24Z

@tazend more questions:

How to submit multi-node jobs?
for multi-nodes job, I try to submit a script and execute it on all nodes, as follows. But find that if I don't have a "srun" before the "bash test.sh", it would only run on compute node 0. Is a srun requirement expected? I didn't see it in any document or example for that.
If run with container, how to kill the container when the job is cancelled? The second script I used try to catch the sigTerm and stop the container, but it cannot catch the signal. Any advise? Thanks.

def _build_start_script():
    start_command = "#! /bin/bash\n"
    start_command += "srun bash test.sh"
    return start_command

job_desc.script = _build_start_script()
job_id = job_desc.submit()

container_name="${SLURM_JOB_ID}_container"

# Function to clean up Docker container on exit
cleanup() {
    echo "Stopping and removing container..."
    sudo docker stop "$container_name" > /dev/null 2>&1
    sudo docker rm "$container_name" > /dev/null 2>&1
    echo "Container stopped and removed."
}

trap cleanup SIGTERM
trap cleanup SIGKILL

sudo docker run -d --rm --tty --name $container_name --gpus all --ipc host bash_script.sh
sudo docker logs -f $container_name

wait

tazend · 2024-11-26T06:33:59Z

Hi,

srun makes sure that whatever amount of ntasks you have requested gets spawned on all the compute nodes in the allocation assigned to your job. If you want to launch processes accross multiple nodes, srun is required yes. But this is nothing special to pyslurm - it is the standard way in slurm to launch multi-node jobs, usually backed by something like MPI.

To your second question, not sure if it will work properly the way you have setup the script here. Note that SIGKILL cannot be catched for example. The SIGTERM trap might be ignored due to the way slurm handles signals, not sure though. (you could test around with something like scancel --full.

Instead of docker you could also have a look at an alternative called apptainer (singularity), which can also execute containers (and also docker containers out of the box). It doesn't require any daemon or sudo setup like with docker, and is (in my opinion) better suited for integration with Slurm (and HPC in general):

https://apptainer.org/

tazend mentioned this issue Nov 26, 2024

[guidance request] slurm_kill_job authorization #353

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[guidance request] #351

[guidance request] #351

gongwei-130 commented Nov 2, 2024

tazend commented Nov 6, 2024

gongwei-130 commented Nov 12, 2024

tazend commented Nov 14, 2024

gongwei-130 commented Nov 14, 2024

tazend commented Nov 26, 2024

[guidance request] #351

[guidance request] #351

Comments

gongwei-130 commented Nov 2, 2024

Details

Issue

tazend commented Nov 6, 2024

gongwei-130 commented Nov 12, 2024

tazend commented Nov 14, 2024

gongwei-130 commented Nov 14, 2024

tazend commented Nov 26, 2024