Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU Passthrough role #28

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions roles/gpu_passthrough/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# stackhpc.linux.iommu

## Example playbook

```
---
- name: Enable GPU Passthrough
hosts: gpu_passthrough
tasks:
- import_role:
name: stackhpc.linux.gpu_passthrough
handlers:
- name: reboot
fail:
msg: "Please reboot your hypervisor and re-run your host configure to continue"
become: true

```

Or if you want the machine to reboot automatically:

```
---
- name: Enable GPU Passthrough
hosts: gpu_passthrough
tasks:
- import_role:
name: stackhpc.linux.gpu_passthrough
handlers:
- name: reboot
reboot:
become: true

```
1 change: 1 addition & 0 deletions roles/gpu_passthrough/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
---
20 changes: 20 additions & 0 deletions roles/gpu_passthrough/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
- name: Regenerate initramfs (RedHat)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to run after the reboot handler

listen: Regenerate initramfs
ansible.builtin.shell: |-
#!/bin/bash
set -eux
dracut -v -f /boot/initramfs-$(uname -r).img $(uname -r)
become: true
changed_when: true
when: ansible_facts.os_family == 'RedHat'

- name: Regenerate initramfs (Debian)
listen: Regenerate initramfs
ansible.builtin.shell: |-
#!/bin/bash
set -eux
update-initramfs -u -k $(uname -r)
become: true
changed_when: true
when: ansible_facts.os_family == 'Debian'
45 changes: 45 additions & 0 deletions roles/gpu_passthrough/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
- name: Blacklist nouveau
ansible.builtin.blockinfile:
path: /etc/modprobe.d/blacklist-nouveau.conf
block: |
blacklist nouveau
options nouveau modeset=0
mode: "0664"
owner: root
group: root
create: true
become: true
notify:
- Regenerate initramfs
- reboot # no-qa

- name: Ignore unsupported model specific registers
# Occasionally, applications running in the VM may crash unexpectedly,
# whereas they would run normally on a physical machine. If, while
# running dmesg -wH, you encounter an error mentioning MSR, the reason
# for those crashes is that KVM injects a General protection fault (GPF)
# when the guest tries to access unsupported Model-specific registers
# (MSRs) - this often results in guest applications/OS crashing. A
# number of those issues can be solved by passing the ignore_msrs=1
# option to the KVM module, which will ignore unimplemented MSRs.
# source: https://wiki.archlinux.org/index.php/QEMU
ansible.builtin.blockinfile:
path: /etc/modprobe.d/kvm.conf
block: |
options kvm ignore_msrs=Y
# This option is not available in centos 7 as the kernel is too old,
# but it can help with dmesg spam in newer kernels (centos8?). Sample
# dmesg log message:
# [ +0.000002] kvm [8348]: vcpu0, guest rIP: 0xffffffffb0a767fa ignored rdmsr: 0x619
# options kvm report_ignored_msrs=N
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe should uncomment by default

mode: "0664"
owner: root
group: root
create: true
become: true
notify: reboot # no-qa

- name: Add IOMMU config to kernel command line
ansible.builtin.include_role:
name: stackhpc.linux.iommu
16 changes: 16 additions & 0 deletions roles/iommu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,19 @@
become: true

```

Or if you want the node to reboot automatically

```
---
- name: Enable IOMMU
hosts: iommu
tasks:
- import_role:
name: stackhpc.linux.iommu
handlers:
- name: reboot
reboot:
become: true

```
20 changes: 20 additions & 0 deletions roles/iommu/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
- name: Regenerate initramfs (RedHat)
listen: Regenerate initramfs
ansible.builtin.shell: |-
#!/bin/bash
set -eux
dracut -v -f /boot/initramfs-$(uname -r).img $(uname -r)
become: true
changed_when: true
when: ansible_facts.os_family == 'RedHat'

- name: Regenerate initramfs (Debian)
listen: Regenerate initramfs
ansible.builtin.shell: |-
#!/bin/bash
set -eux
update-initramfs -u -k $(uname -r)
become: true
changed_when: true
when: ansible_facts.os_family == 'Debian'
45 changes: 43 additions & 2 deletions roles/iommu/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,55 @@
---
- name: Template dracut config for vfio
ansible.builtin.blockinfile:
path: /etc/dracut.conf.d/gpu-vfio.conf
block: |
add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw warnings about this configuration:

2024-08-27T12:52:11+0000 INFO /etc/dracut.conf.d/gpu-vfio.conf:add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd"

dracut: WARNING: <key>+=" <values> ": <values> should have surrounding white spaces!
dracut: WARNING: This will lead to unwanted side effects! Please fix the configuration file.

dracut-install: Failed to find module 'vfio_virqfd'
dracut: FAILED:  /usr/lib/dracut/dracut-install -D /var/tmp/dracut.kNu52l/initramfs --kerneldir /lib/modules/5.14.0-362.24.1.el9_3.0.1.x86_64/ -m vfio vfio_iommu_type1 vfio_pci vfio_virqfd
/etc/dracut.conf.d/gpu-vfio.conf:add_drivers+="vfio vfio_iommu_type1 vfio_pci vfio_virqfd"

dracut: WARNING: <key>+=" <values> ": <values> should have surrounding white spaces!
dracut: WARNING: This will lead to unwanted side effects! Please fix the configuration file.

dracut-install: Failed to find module 'vfio_virqfd'
dracut: FAILED:  /usr/lib/dracut/dracut-install -D /var/tmp/dracut.gebJKd/initramfs --kerneldir /lib/modules/5.14.0-362.24.1.el9_3.0.1.x86_64/ -m vfio vfio_iommu_type1 vfio_pci vfio_virqfd

Please fix the syntax.

owner: root
group: root
mode: "0660"
create: true
become: true
when:
- iommu_vfio_pci_ids is defined
- ansible_facts.os_family == 'Debian'
notify:
- Regenerate initramfs
- reboot

- name: Add vfio to modules-load.d
ansible.builtin.blockinfile:
path: /etc/modules-load.d/vfio.conf
block: |
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd
owner: root
group: root
mode: "0664"
create: true
become: true
when: iommu_vfio_pci_ids is defined
notify: reboot

- name: Add iommu to kernel command line (Intel)
ansible.builtin.include_role:
name: stackhpc.linux.grubcmdline
vars:
kernel_cmdline: # noqa: var-naming[no-role-prefix]
- intel_iommu=on
kernel_cmdline: "{{ ['intel_iommu=on'] }}" # noqa: var-naming[no-role-prefix]
kernel_cmdline_remove: # noqa: var-naming[no-role-prefix]
- ^intel_iommu=
when: ansible_facts.processor | select('search','Intel')

- name: Add vfio pci ids to kernel command line
ansible.builtin.include_role:
name: stackhpc.linux.grubcmdline
vars:
kernel_cmdline: "{{ ['vfio-pci.ids=' + iommu_vfio_pci_ids] }}" # noqa: var-naming[no-role-prefix]
kernel_cmdline_remove: # noqa: var-naming[no-role-prefix]
- ^vfio-pci\.ids=
when: iommu_vfio_pci_ids is defined

jovial marked this conversation as resolved.
Show resolved Hide resolved
- name: Set iommu=pt
ansible.builtin.include_role:
name: stackhpc.linux.grubcmdline
Expand Down