-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVME unsopported #4
Comments
Ok, I know what's need now, I installed only one NVME card, but GDS need at least RAID 0 which needs at least two NVME card. |
Hi @dearsxx0918 NVMe : Unsupported After that, I saw your post and configured RAID 0 but it still shows Unsupported. Do I need to re-install MELLANOX OFED driver and CUDA? Thank you |
Even with RAID 0 configuration, I have been seeing only |
Still not know what's going on there. |
I meet the same problem。I have read a lot documents, but still have no idea how to resolve it。 - - |
Any update? |
Maybe I can help, I'm using only 1 out of 2 NVMe cards for GDS: Have you mounted the NVMe in data=ordered mode? https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#mount-local-fs |
Yes, it is in data=ordered mode. I only have one NVME device, where the system and everything else is installed. I did everything as the guide asks and even got a "Supported" flag, yet still it doesn't work. My conclusion for now is that I need a separate NVME device on RAID0 else it doesn't work, right? |
Maybe try installing OFED with NVMe options? That helps in my case. |
Can my environment be helpful to you?
[1] OS Version
[2] CUDA & GDS Package(deb files):
[3] check1 : Loaded Kernel modules
[4] check2 : IOMMU is disable
[5] check3 : PCIe Topology => GPU and NVMe devices on the same PXL Switch
[6] check4 : ACS Control is disable => You get "ACSViol-"
[7] check5: Whether the NVMe drive is formatted as ext4 or xfs, and LVM isn't used? |
Hello all, thank you for your comments and specailly for the checklist. Here are my current results: [1] OS Version ➜ ~ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.2 LTS
Release: 22.04
Codename: jammy
➜ ~ uname -r
5.19.0-38-generic I couldn't find my OFED version, but I'm sure it's installed. If there is a command for it, please tell me. [2] CUDA & GDS Package(deb files): ➜ ~ dpkg -l | grep cuda-tools
ii cuda-tools-12-1 12.1.1-1 amd64 CUDA Tools meta-package
➜ ~ dpkg -l | grep gds
ii gds-tools-12-1 1.6.1.9-1 amd64 Tools for GPU Direct Storage
ii nvidia-gds 12.1.1-1 amd64 GPU Direct Storage meta-package
ii nvidia-gds-12-1 12.1.1-1 amd64 GPU Direct Storage 12.1 meta-package
➜ ~ dpkg -l | grep nvidia-fs
ii nvidia-fs 2.15.3-1 amd64 NVIDIA filesystem for GPUDirect Storage
ii nvidia-fs-dkms 2.15.3-1 amd64 NVIDIA filesystem DKMS package [3] check1 : Loaded Kernel modules ➜ ~ lsmod | grep nvidia_fs
nvidia_fs 262144 0
➜ ~ lsmod | grep nvme_core
nvme_core 147456 7 nvme,nvme_fabrics
mlx_compat 20480 14 rdma_cm,ib_ipoib,mlxdevm,nvme,iw_cm,nvme_core,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core [4] check2 : IOMMU is disabled ➜ ~ sudo dmesg | grep -i iommu
[ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.19.0-38-generic root=UUID=a8c08d82-da23-4ec4-be78-3fa59ddedb73 ro intel_iommu=off quiet splash split_lock_detect=off vt.handoff=7
[ 0.082366] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.19.0-38-generic root=UUID=a8c08d82-da23-4ec4-be78-3fa59ddedb73 ro intel_iommu=off quiet splash split_lock_detect=off vt.handoff=7
[ 0.082389] DMAR: IOMMU disabled
[ 0.175791] DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 0
[ 0.429342] iommu: Default domain type: Translated
[ 0.429342] iommu: DMA domain TLB invalidation policy: lazy mode 5] check3 : PCIe Topology => GPU and NVMe devices on the same PXL Switch lspci -tv | grep Samsung -A 3 | grep -v Intel
+-1b.4-[04]----00.0 Samsung Electronics Co Ltd Device a80c
+-1c.0-[05]--
+-1d.0-[07]-- It seems my NVMe is not on the best slot for GPU comm. Is this a requirement? I will try to fix it soon. [6] check4 : ACS Control is disable => You get "ACSViol-" Ok, so this command has an empty result. I`m not sure what to do in this case. [7] check5: Whether the NVMe drive is formatted as ext4 or xfs, and LVM isn't used? ➜ ~ lsblk -f | grep nvme
nvme0n1
├─nvme0n1p1 vfat FAT32 BA30-5B86 86.5M 7% /boot/efi
├─nvme0n1p2 ext4 1.0 a8c08d82-da23-4ec4-be78-3fa59ddedb73 57.2G 32% /var/snap/firefox/common/host-hunspell
├─nvme0n1p3 swap 1 8d915ce4-8a76-414d-8416-15a8c82cf34f [SWAP]
└─nvme0n1p4 ext4 1.0 645675cd-945d-4bd8-a1a1-8e7568294cdf 911.7G 41% /home My intention is to use nvme0n1p4. I hope this configuration is not the problem. Thank you very much for the help. |
Hi, @Pedrexus [1] Can I see your environment?
[2] How to disable ACS(Access Control Service) EX: Supermicro [3] Check using (1) Mount NVMe
(2) Create test file for gdsio_verify
(3) Check gdsio_verify on GPUID0
(4) Create directory for gdsio
(5) Write test on CPU only mode
-x (xfer_type) option: (6) Write test on GDS of GPUID 0
➜ Display error when GDS does not work (7) Write test on CPU -> GPUID 0
(Ex) Read test on GDS of GPUID 0
Note: |
@wakaba-best Before Changes
After Changes to cufile.json
|
@wakaba-best
|
how to change the directIO size (KB) to 512M ? |
@wakaba-best , please check the cufile.log below lines for the above failing command with 500M io size
|
@wakaba-best , anything on above if you can share ? |
Recently, I am struggling with supporting nvme, could you please show me your installation process? thanks |
@Murphy-AI Can you tell me the steps, how are you installing GDS. I am following the documentation but it is not working. |
Hi,
I'm setup GDS on my machine, but get the below output:
NVMe : Unsupported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported
And I can't use GDS for nvme now, can you give me some advice?
The text was updated successfully, but these errors were encountered: