Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: jail have problem when machines have drivers of different versions #184

Open
CrackedPoly opened this issue Nov 16, 2024 · 4 comments
Open

Comments

@CrackedPoly
Copy link
Contributor

The jail will only symlink libnvidia-ml.so.1 from one specific version, which may be empty in other machines. For example, I have two machines with drivers of 535.104.05 and 550.90.12. One jailed job runs into error.

# --------------------------- Jobs in jail environment
root@login-0:~# srun --gpus-per-task=8 --ntasks=2 ls -lh /usr/lib/x86_64-linux-gnu/|grep libnvidia-ml.so
lrwxrwxrwx   1 root root    25 Nov 16 08:13 libnvidia-ml.so.1 -> libnvidia-ml.so.550.90.12
-rwxr-xr-x.  1 root root  1.8M Jul  3 09:52 libnvidia-ml.so.535.104.05
-rwxr-xr-x   1 root root     0 Nov 16 08:13 libnvidia-ml.so.550.90.12
lrwxrwxrwx   1 root root    25 Nov 16 08:13 libnvidia-ml.so.1 -> libnvidia-ml.so.550.90.12
-rwxr-xr-x   1 root root     0 Nov 16 08:13 libnvidia-ml.so.535.104.05
-rwxr-xr-x.  1 root root  2.0M Sep 24 06:11 libnvidia-ml.so.550.90.12

root@login-0:~# srun --gpus-per-task=8 --ntasks=2 nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
srun: error: worker-1: task 0: Exited with exit code 12
Sat Nov 16 08:27:08 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|

# --------------------------- In workers
root@worker-1:/# ls -lh /usr/lib/x86_64-linux-gnu/|grep libnvidia-ml
lrwxrwxrwx   1 root root    17 Nov 16 08:13 libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx   1 root root    26 Nov 16 08:13 libnvidia-ml.so.1 -> libnvidia-ml.so.535.104.05
-rwxr-xr-x.  1 root root  1.8M Jul  3 09:52 libnvidia-ml.so.535.104.05
root@worker-1:/# ls -lh /mnt/jail/usr/lib/x86_64-linux-gnu/|grep libnvidia-ml
lrwxrwxrwx   1 root root    25 Nov 16 08:13 libnvidia-ml.so.1 -> libnvidia-ml.so.550.90.12
-rwxr-xr-x.  1 root root  1.8M Jul  3 09:52 libnvidia-ml.so.535.104.05
-rwxr-xr-x   1 root root     0 Nov 16 08:13 libnvidia-ml.so.550.90.12

root@worker-0:/# ls -lh /usr/lib/x86_64-linux-gnu/|grep libnvidia-ml
lrwxrwxrwx   1 root root    17 Nov 16 08:13 libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx   1 root root    25 Nov 16 08:13 libnvidia-ml.so.1 -> libnvidia-ml.so.550.90.12
-rwxr-xr-x.  1 root root  2.0M Sep 24 06:11 libnvidia-ml.so.550.90.12
root@worker-0:/# ls -lh /mnt/jail/usr/lib/x86_64-linux-gnu/|grep libnvidia-ml
lrwxrwxrwx   1 root root    25 Nov 16 08:13 libnvidia-ml.so.1 -> libnvidia-ml.so.550.90.12
-rwxr-xr-x   1 root root     0 Nov 16 08:13 libnvidia-ml.so.535.104.05
-rwxr-xr-x.  1 root root  2.0M Sep 24 06:11 libnvidia-ml.so.550.90.12
@CrackedPoly
Copy link
Contributor Author

"Although users can install or modify software in the shared environment, it doesn't apply to some low-level packages directly bound to GPUs (CUDA, NVIDIA drivers, NVIDIA container toolkit, enroot, etc.)."

https://github.com/nebius/soperator/blob/main/docs/limitations.md#some-software-cant-be-upgraded-when-the-cluster-is-running

Sry, now I see ...

@CrackedPoly
Copy link
Contributor Author

I have dug into this a little further: when two worker slurmd container call nvidia-container-cli to mount local drivers into jail, there exist two different versions of mounted nvdia libs but each worker can only one valid versioned-lib file. Then later, the last worker (race condition) calls ldconfig to symlink the non-versioned-lib to the versioned-lib. So only the last worker can follow the link to the valid lib file.

@Uburro
Copy link
Collaborator

Uburro commented Nov 16, 2024

Hi, yes. To prevent application version pins from being overwritten, we plan to use AppArmor. In Q4, we will release an example demonstrating how to achieve this using the Security Profiles Operator.

@CrackedPoly
Copy link
Contributor Author

Hi, yes. To prevent application version pins from being overwritten, we plan to use AppArmor. In Q4, we will release an example demonstrating how to achieve this using the Security Profiles Operator.

Good! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants