Options for CUDA, podman and docker updated with nvidia-container sup… #896

emrahbillur · 2024-11-15T10:36:44Z

Description of changes

Configuration options below:

ghaf.development.cuda option is introduced for supported platforms.
ghaf.virtualization.podman.daemon is added for podman support with nvidia containers and docker compatibility options.
ghaf.virtualization.docker.daemon is updated for docker support with nvidia containers

Both docker and podman can coexist together as an option.
Planned to be removed or moved to work inside vm's later but this option is required for nvidia containers and ML software.

Checklist for things done

Instructions for Testing

List all targets that this applies to:
Is this a new feature
- List the test steps to verify:
  - For Docker:
  - In your ghaf configuration (preferably your flake-module.nix of your target platform) add ghaf.virtualization.docker.daemon.enable=true;
  - rebuild your config
  - For x86_64 platforms:
    - Run sudo docker run --rm --device=nvidia.com/gpu=all ubuntu nvidia-smi and the top part of your output will be like
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
      |-------------------------------+----------------------+----------------------+
      | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
      where second line shows CUDA version.
  - For nvidia jetson platforms we don't have nvidia-smi so we use python torch:
    - Run sudo docker run -it --rm --device=nvidia.com/gpu=all --network host nvcr.io/nvidia/l4t-pytorch:r35.2.1-pth2.0-py3 and from the docker shell you run: python3 -c 'import torch; print(torch.cuda.is_available())' and expecting output True
  - For Podman:
  - In your ghaf configuration (preferably your flake-module.nix of your target platform) add ghaf.virtualization.podman.daemon.enable=true;
  - rebuild your config
  - For x86_64 platforms:
    - Run sudo podman run --rm --device=nvidia.com/gpu=all ubuntu nvidia-smi and the top part of your output will be like
      +-----------------------------------------------------------------------------+
      | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
      |-------------------------------+----------------------+----------------------+
      | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
      where second line shows CUDA version.
  - For nvidia jetson platforms we don't have nvidia-smi so we use python torch:
    - Run sudo podman run -it --rm --device=nvidia.com/gpu=all nvcr.io/nvidia/l4t-pytorch:r35.2.1-pth2.0-py3 /bin/bash and from the docker shell you run: python3 -c 'import torch; print(torch.cuda.is_available())' and expecting output True

Note: You can test podman with docker commands as podman has docker compatibility option (will not work in case of having both docker and podman daemons together.

If it is an improvement how does it impact existing functionality?
Docker daemon was already in ghaf but nvidia containers and cuda support is also added to docker daemon. Podman is a new feature.

modules/common/virtualization/docker.nix

modules/common/virtualization/podman.nix

Mic92 · 2024-11-15T14:22:03Z

modules/common/development/cuda.nix

+
+  config = mkIf cfg.enable {
+    #Enabling CUDA on any supported system requires below settings. 
+    nixpkgs.config.allowUnfree = lib.mkForce true;


Did we not pass in a high-level nixpkgs instance or is this false memory?
In that case nixpkgs.config wouldn't work.

But I think if there is no error, than the answer is no.

…port with cdi fix for docker Signed-off-by: Emrah Billur <[email protected]>

emrahbillur · 2024-11-21T14:37:07Z

Finally the docker nvidia container issue is solved with cdi devices forcing. Only a single issue left with cross compilation of libnvidia-container where the compile option -m64 fails with -from-x86_64 builds.

denzilferreira · 2024-12-03T09:54:32Z

modules/common/development/cuda.nix

@@ -0,0 +1,24 @@
+# Copyright 2022-2024 TII (SSRC) and the Ghaf contributors
+# SPDX-License-Identifier: Apache-2.0
+{ lib, config, ... }:


nitpick: { config, lib, ... } for consistency with other modules

emrahbillur temporarily deployed to internal-build-workflow November 15, 2024 10:36 — with GitHub Actions Inactive

emrahbillur self-assigned this Nov 15, 2024

emrahbillur force-pushed the main branch from cd5a6eb to eb5bfd3 Compare November 15, 2024 10:41

emrahbillur temporarily deployed to internal-build-workflow November 15, 2024 10:41 — with GitHub Actions Inactive

Mic92 reviewed Nov 15, 2024

View reviewed changes

modules/common/virtualization/docker.nix Outdated Show resolved Hide resolved

Mic92 reviewed Nov 15, 2024

View reviewed changes

modules/common/virtualization/podman.nix Outdated Show resolved Hide resolved

Mic92 reviewed Nov 15, 2024

View reviewed changes

emrahbillur force-pushed the main branch from eb5bfd3 to 7597c0e Compare November 15, 2024 15:04

emrahbillur temporarily deployed to internal-build-workflow November 15, 2024 15:04 — with GitHub Actions Inactive

emrahbillur force-pushed the main branch from 7597c0e to 53e035f Compare November 15, 2024 15:19

emrahbillur temporarily deployed to internal-build-workflow November 15, 2024 15:19 — with GitHub Actions Inactive

emrahbillur requested a review from brianmcgillion November 15, 2024 16:18

emrahbillur force-pushed the main branch from 53e035f to f079c69 Compare November 21, 2024 09:13

emrahbillur temporarily deployed to internal-build-workflow November 21, 2024 09:13 — with GitHub Actions Inactive

emrahbillur force-pushed the main branch from f079c69 to fc2dcad Compare November 21, 2024 14:32

emrahbillur temporarily deployed to internal-build-workflow November 21, 2024 14:32 — with GitHub Actions Inactive

emrahbillur force-pushed the main branch from fc2dcad to 486969b Compare November 21, 2024 14:32

emrahbillur temporarily deployed to internal-build-workflow November 21, 2024 14:32 — with GitHub Actions Inactive

emrahbillur temporarily deployed to internal-build-workflow November 21, 2024 14:35 — with GitHub Actions Inactive

Options for CUDA, podman and docker updated with nvidia-container sup…

eb44eeb

…port with cdi fix for docker Signed-off-by: Emrah Billur <[email protected]>

emrahbillur force-pushed the main branch from 99b7553 to eb44eeb Compare November 21, 2024 14:35

emrahbillur temporarily deployed to internal-build-workflow November 21, 2024 14:35 — with GitHub Actions Inactive

denzilferreira reviewed Dec 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Options for CUDA, podman and docker updated with nvidia-container sup… #896

Options for CUDA, podman and docker updated with nvidia-container sup… #896

emrahbillur commented Nov 15, 2024 •

edited

Loading

Mic92 Nov 15, 2024

Mic92 Nov 15, 2024

emrahbillur commented Nov 21, 2024

denzilferreira Dec 3, 2024

Options for CUDA, podman and docker updated with nvidia-container sup… #896

Are you sure you want to change the base?

Options for CUDA, podman and docker updated with nvidia-container sup… #896

Conversation

emrahbillur commented Nov 15, 2024 • edited Loading

Description of changes

Checklist for things done

Instructions for Testing

Mic92 Nov 15, 2024

Choose a reason for hiding this comment

Mic92 Nov 15, 2024

Choose a reason for hiding this comment

emrahbillur commented Nov 21, 2024

denzilferreira Dec 3, 2024

Choose a reason for hiding this comment

emrahbillur commented Nov 15, 2024 •

edited

Loading