Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options for CUDA, podman and docker updated with nvidia-container sup… #896

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

emrahbillur
Copy link
Contributor

@emrahbillur emrahbillur commented Nov 15, 2024

Description of changes

Configuration options below:

  • ghaf.development.cuda option is introduced for supported platforms.
  • ghaf.virtualization.podman.daemon is added for podman support with nvidia containers and docker compatibility options.
  • ghaf.virtualization.docker.daemon is updated for docker support with nvidia containers

Both docker and podman can coexist together as an option.
Planned to be removed or moved to work inside vm's later but this option is required for nvidia containers and ML software.

Checklist for things done

  • Summary of the proposed changes in the PR description
  • More detailed description in the commit message(s)
  • Commits are squashed into relevant entities - avoid a lot of minimal dev time commits in the PR
  • Contribution guidelines followed
  • Ghaf documentation updated with the commit - https://tiiuae.github.io/ghaf/
  • PR linked to architecture documentation and requirement(s) (ticket id)
  • Test procedure described (or includes tests). Select one or more:
    • Tested on Lenovo X1 x86_64
    • Tested on Jetson Orin NX or AGX aarch64
    • Tested on Polarfire riscv64
  • Author has run make-checks and it passes
  • All automatic Github Action checks pass - see actions
  • Author has added reviewers and removed PR draft status
  • Change requires full re-installation
  • Change can be updated with nixos-rebuild ... switch

Instructions for Testing

  • List all targets that this applies to:
  • Is this a new feature
    • List the test steps to verify:
      • For Docker:

      • In your ghaf configuration (preferably your flake-module.nix of your target platform) add ghaf.virtualization.docker.daemon.enable=true;

      • rebuild your config

      • For x86_64 platforms:

        • Run sudo docker run --rm --device=nvidia.com/gpu=all ubuntu nvidia-smi and the top part of your output will be like
          +-----------------------------------------------------------------------------+
          | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
          |-------------------------------+----------------------+----------------------+
          | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
          where second line shows CUDA version.
      • For nvidia jetson platforms we don't have nvidia-smi so we use python torch:

        • Run sudo docker run -it --rm --device=nvidia.com/gpu=all --network host nvcr.io/nvidia/l4t-pytorch:r35.2.1-pth2.0-py3 and from the docker shell you run: python3 -c 'import torch; print(torch.cuda.is_available())' and expecting output True
      • For Podman:

      • In your ghaf configuration (preferably your flake-module.nix of your target platform) add ghaf.virtualization.podman.daemon.enable=true;

      • rebuild your config

      • For x86_64 platforms:

        • Run sudo podman run --rm --device=nvidia.com/gpu=all ubuntu nvidia-smi and the top part of your output will be like
          +-----------------------------------------------------------------------------+
          | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
          |-------------------------------+----------------------+----------------------+
          | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
          where second line shows CUDA version.
      • For nvidia jetson platforms we don't have nvidia-smi so we use python torch:

        • Run sudo podman run -it --rm --device=nvidia.com/gpu=all nvcr.io/nvidia/l4t-pytorch:r35.2.1-pth2.0-py3 /bin/bash and from the docker shell you run: python3 -c 'import torch; print(torch.cuda.is_available())' and expecting output True

Note: You can test podman with docker commands as podman has docker compatibility option (will not work in case of having both docker and podman daemons together.

  • If it is an improvement how does it impact existing functionality?
    Docker daemon was already in ghaf but nvidia containers and cuda support is also added to docker daemon. Podman is a new feature.


config = mkIf cfg.enable {
#Enabling CUDA on any supported system requires below settings.
nixpkgs.config.allowUnfree = lib.mkForce true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we not pass in a high-level nixpkgs instance or is this false memory?
In that case nixpkgs.config wouldn't work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I think if there is no error, than the answer is no.

…port with cdi fix for docker

Signed-off-by: Emrah Billur <[email protected]>
@emrahbillur
Copy link
Contributor Author

Finally the docker nvidia container issue is solved with cdi devices forcing. Only a single issue left with cross compilation of libnvidia-container where the compile option -m64 fails with -from-x86_64 builds.

@@ -0,0 +1,24 @@
# Copyright 2022-2024 TII (SSRC) and the Ghaf contributors
# SPDX-License-Identifier: Apache-2.0
{ lib, config, ... }:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: { config, lib, ... } for consistency with other modules

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants