-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD GPU subset selection does not work #21454
Comments
@giuseppe PTAL |
I wonder if the different behavior is because we are not configuring the devices cgroup, since an unprivileged user cannot use it. Do you see a different output if you run podman as root user (i.e. If running Podman as root, still behaves the same, please share the output of the following command, both for Docker and Podman:
|
Hi @giuseppe , thanks for your fast answer. You are right, with root it just mounts one GPU. Is there the possibility to get that behavior with unprivileged users? If it is now possible, is there any reason why? I will be happy to help to include it (but I would appreciate some guidance) many thanks |
Hi, no that is not possible because the kernel doesn't allow it. On cgroup v2, the devices cgroup requires eBPF and it is not usable from a user namespace. On cgroup v1, similar problem since delegation is not safe and we do not use cgroups at all with unprivileged users |
My tests are on Slurm + Podman rootless + AMD GPU, and as far I understand Podman rootless does not fit Slurm jobs when jobs are not node exclusive (select partial number of GPUs) |
Issue Description
Podman AMD GPUs subset selection should work by selecting devices in /dev/dri/IDs, but it ways takes every GPU in node.
Steps to reproduce the issue
Steps to reproduce the issue
Describe the results you received
Rocm-smi retrieve every GPU in node, instead of the selected subset.
Describe the results you expected
It should just use the GPUs mounted in the devices list as docker does.
I the following image we see output from both podman and docker by mounting one device into the container (/dev/dri/renderD128)
podman info output
Podman in a container
No
Privileged Or Rootless
Rootless
Upstream Latest Release
No
Additional environment details
No response
Additional information
Additional information like issue happens only occasionally or issue happens with a particular architecture or on a particular setting
The text was updated successfully, but these errors were encountered: