Can not checkpoint container created with nvidia-container-runtime (mounted gpu) #4522

gflarity · 2024-11-11T22:02:54Z

Description

Hi,

Recently support was added for criu to checkpoint cuda applications. I've tested this on plain old processes and it seems to work as advertised.

I wanted to try this runc/containers as well. So I created a container using the nvidia-container-runtime shim which seems to just modify the config.json, adding a prestart hook that does the heavy lifting. After which I can call runc start and invoke the test cuda application I created just fine. However when I try to take a snapshot, runc checkpoint just hangs regardless of if the cuda application is even running. Taking a look at the dump.log and I can see that criu error'ed out. Here are the last few lines, full dump attached below:

(07.507264) Error (criu/mount.c:1088): mnt: Mount 251 ./proc/driver/nvidia/gpus/0000:00:04.0 (master_id: 5 shared_id: 0) has unreachable sharing. Try --enable-external-masters.
(07.507282) net: Unlock network
(07.507285) Running network-unlock scripts
(07.507287) 	RPC
(07.519624) cuda_plugin: finished cuda_plugin stage 0 err -1
(10.996267) cuda_plugin: resuming devices on pid 404642
(10.996295) cuda_plugin: Restore thread pid 404694 found for real pid 404642

I'm happy to keep digging and and see if I can find a way to try the equivalent of --enable-external-masters with criu rpc. But I wanted to file this issue incase the more experienced had any pointers. I'm specifically wondering if the external masters thing is just a rabbit hole or not? It's not so easy to just 'try' this flag with RPC than I can see. But if it solves it I'm happy to submit a PR.

Please advise, thanks!

Steps to reproduce the issue

sudo nvidia-container-runtime create test
sudo runc run test
sudo runc checkpoint --image-path ./dump --work-path ./workdir/ --leave-running=false test

I've attached the config.json.
config.json

Describe the results you received and expected

runc checkpoint just hangs, but if you take a look at the dump.log you can criu errored out. Dump attached.

dump.log

What version of runc are you using?

runc --version
runc version 1.2.1+dev
commit: v1.2.1-4-g2327ec22
spec: 1.2.0
go: go1.23.3
libseccomp: 2.5.5

riu --version
Version: 4.0

Host OS information

cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Host kernel information

uname -a
Linux geoff-dev-testing 6.8.0-1015-gcp #17-Ubuntu SMP Mon Sep 2 17:57:02 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not checkpoint container created with nvidia-container-runtime (mounted gpu) #4522

Can not checkpoint container created with nvidia-container-runtime (mounted gpu) #4522

gflarity commented Nov 11, 2024

Can not checkpoint container created with nvidia-container-runtime (mounted gpu) #4522

Can not checkpoint container created with nvidia-container-runtime (mounted gpu) #4522

Comments

gflarity commented Nov 11, 2024

Description

Steps to reproduce the issue

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information