Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVME unsopported #4

Open
dearsxx0918 opened this issue Dec 1, 2021 · 19 comments
Open

NVME unsopported #4

dearsxx0918 opened this issue Dec 1, 2021 · 19 comments

Comments

@dearsxx0918
Copy link

Hi,
I'm setup GDS on my machine, but get the below output:
NVMe : Unsupported
NVMeOF : Unsupported
SCSI : Unsupported
ScaleFlux CSD : Unsupported
NVMesh : Unsupported
DDN EXAScaler : Unsupported
IBM Spectrum Scale : Unsupported
NFS : Unsupported
WekaFS : Unsupported
Userspace RDMA : Unsupported

And I can't use GDS for nvme now, can you give me some advice?

@dearsxx0918
Copy link
Author

Ok, I know what's need now, I installed only one NVME card, but GDS need at least RAID 0 which needs at least two NVME card.
Another question is, does P100 support GDS? I get something from GDS lib that P100 support GDS. I want to confirm this information.

@tunglambk
Copy link

Hi @dearsxx0918
How did you resolve the issue?
I have two NVMe cards, but I didn't configure RAID 0 before installing the MELLANOX OFED driver and CUDA.
This is my result when checking the GDS.

NVMe : Unsupported
NVMeOF : Unsupported
SCSI : Unsupported

After that, I saw your post and configured RAID 0 but it still shows Unsupported. Do I need to re-install MELLANOX OFED driver and CUDA?

Thank you

@9prady9
Copy link

9prady9 commented Apr 5, 2022

Even with RAID 0 configuration, I have been seeing only Unsupported in gdscheck driver configuration output. I tried all the troubleshooting steps in nvidia documentation.

@dearsxx0918
Copy link
Author

Still not know what's going on there.

@zhouaoe
Copy link

zhouaoe commented Dec 29, 2022

I meet the same problem。I have read a lot documents, but still have no idea how to resolve it。 - -

@Pedrexus
Copy link

Pedrexus commented Apr 4, 2023

Any update?

@UTKRISHTPATESARIA
Copy link

Maybe I can help, I'm using only 1 out of 2 NVMe cards for GDS:

Have you mounted the NVMe in data=ordered mode?

https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#mount-local-fs

@Pedrexus
Copy link

Pedrexus commented Apr 26, 2023

Yes, it is in data=ordered mode.

I only have one NVME device, where the system and everything else is installed. I did everything as the guide asks and even got a "Supported" flag, yet still it doesn't work.

My conclusion for now is that I need a separate NVME device on RAID0 else it doesn't work, right?

@ExtremeViscent
Copy link

Maybe try installing OFED with NVMe options? That helps in my case.

@wakaba-best
Copy link

wakaba-best commented May 1, 2023

Can my environment be helpful to you?

$ gdscheck -p
 GDS release version: 1.0.1.3
 nvidia_fs version:  2.7 libcufile version: 2.4
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Supported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 =========
 GPU INFO:
 =========
 GPU index 0 Tesla V100-PCIE-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 1 Tesla V100-PCIE-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 2 Tesla V100-PCIE-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 3 Tesla V100-PCIE-32GB bar:1 bar size (MiB):32768 supports GDS
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Platform verification succeeded

[1] OS Version

  • OS : Ubuntu Server 20.04.5 LTS
  • Kernel : 5.4.0-144-generic
  • OFED : MLNX_OFED_LINUX-5.8-2.0.3.0

[2] CUDA & GDS Package(deb files):

$ dpkg -l | grep cuda-tools
ii  cuda-tools-11-4                       11.4.1-1                                amd64        CUDA Tools meta-package
$ dpkg -l | grep gds
ii  gds-tools-11-4                        1.0.1.3-1                               amd64        Tools for GPU Direct Storage
$ dpkg -l | grep nvidia-fs
ii  nvidia-fs                             2.7.50-1                                amd64        NVIDIA filesystem for GPUDirect Storage
ii  nvidia-fs-dkms                        2.7.50-1                                amd64        NVIDIA filesystem DKMS package

[3] check1 : Loaded Kernel modules

$ lsmod | grep nvidia_fs
nvidia_fs             245760  0
ib_core               348160  10 rdma_cm,ib_ipoib,nvme_rdma,iw_cm,nvidia_fs,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
$ lsmod | grep nvme_core
nvme_core             110592  3 nvme,nvme_rdma,nvme_fabrics
mlx_compat             65536  16 rdma_cm,ib_ipoib,mlxdevm,nvme,nvme_rdma,iw_cm,nvme_core,auxiliary,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core

[4] check2 : IOMMU is disable

$ dmesg | grep -i iommu
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-144-generic root=UUID=318c92d2-8567-4d37-acba-4050de3146d9 ro intel_iommu=off
[    1.355073] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-144-generic root=UUID=318c92d2-8567-4d37-acba-4050de3146d9 ro intel_iommu=off
[    1.355163] DMAR: IOMMU disabled
[    2.328217] DMAR-IR: IOAPIC id 12 under DRHD base  0xc5ffc000 IOMMU 6
[    2.328219] DMAR-IR: IOAPIC id 11 under DRHD base  0xb87fc000 IOMMU 5
[    2.328221] DMAR-IR: IOAPIC id 10 under DRHD base  0xaaffc000 IOMMU 4
[    2.328223] DMAR-IR: IOAPIC id 18 under DRHD base  0xfbffc000 IOMMU 3
[    2.328225] DMAR-IR: IOAPIC id 17 under DRHD base  0xee7fc000 IOMMU 2
[    2.328227] DMAR-IR: IOAPIC id 16 under DRHD base  0xe0ffc000 IOMMU 1
[    2.328229] DMAR-IR: IOAPIC id 15 under DRHD base  0xd37fc000 IOMMU 0
[    2.328232] DMAR-IR: IOAPIC id 8 under DRHD base  0x9d7fc000 IOMMU 7
[    2.328234] DMAR-IR: IOAPIC id 9 under DRHD base  0x9d7fc000 IOMMU 7
[    3.468127] iommu: Default domain type: Translated

[5] check3 : PCIe Topology => GPU and NVMe devices on the same PXL Switch

$ lspci -tv | grep NVMe -A 3 | grep -v Intel
 +-[0000:3a]-+-00.0-[3b-41]----00.0-[3c-41]--+-04.0-[3d]----00.0  Toshiba Corporation NVMe SSD Controller Cx5
 |           |                               +-08.0-[3e]--
 |           |                               +-0c.0-[3f]----00.0  NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
 |           |                               +-10.0-[40]----00.0  NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
 |           |                               \-14.0-[41]----00.0  Toshiba Corporation NVMe SSD Controller Cx5
--
 |           |                               +-08.0-[1b]----00.0  Toshiba Corporation NVMe SSD Controller Cx5
 |           |                               +-0c.0-[1c]----00.0  NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
 |           |                               +-10.0-[1d]----00.0  NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
 |           |                               \-14.0-[1e]----00.0  Toshiba Corporation NVMe SSD Controller Cx5

[6] check4 : ACS Control is disable => You get "ACSViol-"

$ sudo lspci -s 1D:00.0 -vvvv | grep -i acs
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

[7] check5: Whether the NVMe drive is formatted as ext4 or xfs, and LVM isn't used?

@Pedrexus
Copy link

Pedrexus commented May 2, 2023

Hello all,

thank you for your comments and specailly for the checklist. Here are my current results:

[1] OS Version

~ lsb_release -a              
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.2 LTS
Release:	22.04
Codename:	jammy

➜  ~ uname -r
5.19.0-38-generic

I couldn't find my OFED version, but I'm sure it's installed. If there is a command for it, please tell me.

[2] CUDA & GDS Package(deb files):

~ dpkg -l | grep cuda-tools
ii  cuda-tools-12-1                            12.1.1-1                                 amd64        CUDA Tools meta-package
➜  ~ dpkg -l | grep gds
ii  gds-tools-12-1                             1.6.1.9-1                                amd64        Tools for GPU Direct Storage
ii  nvidia-gds                                 12.1.1-1                                 amd64        GPU Direct Storage meta-package
ii  nvidia-gds-12-1                            12.1.1-1                                 amd64        GPU Direct Storage 12.1 meta-package
➜  ~ dpkg -l | grep nvidia-fs
ii  nvidia-fs                                  2.15.3-1                                 amd64        NVIDIA filesystem for GPUDirect Storage
ii  nvidia-fs-dkms                             2.15.3-1                                 amd64        NVIDIA filesystem DKMS package

[3] check1 : Loaded Kernel modules

~ lsmod | grep nvidia_fs
nvidia_fs             262144  0
➜  ~ lsmod | grep nvme_core
nvme_core             147456  7 nvme,nvme_fabrics
mlx_compat             20480  14 rdma_cm,ib_ipoib,mlxdevm,nvme,iw_cm,nvme_core,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core

[4] check2 : IOMMU is disabled

~ sudo dmesg | grep -i iommu
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.19.0-38-generic root=UUID=a8c08d82-da23-4ec4-be78-3fa59ddedb73 ro intel_iommu=off quiet splash split_lock_detect=off vt.handoff=7
[    0.082366] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.19.0-38-generic root=UUID=a8c08d82-da23-4ec4-be78-3fa59ddedb73 ro intel_iommu=off quiet splash split_lock_detect=off vt.handoff=7
[    0.082389] DMAR: IOMMU disabled
[    0.175791] DMAR-IR: IOAPIC id 2 under DRHD base  0xfed91000 IOMMU 0
[    0.429342] iommu: Default domain type: Translated 
[    0.429342] iommu: DMA domain TLB invalidation policy: lazy mode

5] check3 : PCIe Topology => GPU and NVMe devices on the same PXL Switch

lspci -tv | grep Samsung -A 3 | grep -v Intel
           +-1b.4-[04]----00.0  Samsung Electronics Co Ltd Device a80c
           +-1c.0-[05]--
           +-1d.0-[07]--

It seems my NVMe is not on the best slot for GPU comm. Is this a requirement? I will try to fix it soon.

[6] check4 : ACS Control is disable => You get "ACSViol-"

Ok, so this command has an empty result. I`m not sure what to do in this case.

[7] check5: Whether the NVMe drive is formatted as ext4 or xfs, and LVM isn't used?

~ lsblk -f | grep nvme
nvme0n1                                                                              
├─nvme0n1p1 vfat     FAT32       BA30-5B86                              86.5M     7% /boot/efi
├─nvme0n1p2 ext4     1.0         a8c08d82-da23-4ec4-be78-3fa59ddedb73   57.2G    32% /var/snap/firefox/common/host-hunspell
├─nvme0n1p3 swap     1           8d915ce4-8a76-414d-8416-15a8c82cf34f                [SWAP]
└─nvme0n1p4 ext4     1.0         645675cd-945d-4bd8-a1a1-8e7568294cdf  911.7G    41% /home

My intention is to use nvme0n1p4. I hope this configuration is not the problem.

Thank you very much for the help.

@wakaba-best
Copy link

Hi, @Pedrexus

[1] Can I see your environment?

$ gdscheck -p

[2] How to disable ACS(Access Control Service)
You should check the BIOS settings.
However, each manufacturer has its own way of setting up the system.

EX: Supermicro
https://www.supermicro.com/support/faqs/faq.cfm?faq=22226
https://www.supermicro.com/support/faqs/faq.cfm?faq=31883
https://www.supermicro.com/support/faqs/faq.cfm?faq=20732

[3] Check using gdsio_verify and gdsio commands
https://docs.nvidia.com/gpudirect-storage/troubleshooting-guide/index.html#gds-data-verif-tests
https://docs.nvidia.com/gpudirect-storage/configuration-guide/index.html#gdsio

(1) Mount NVMe

~$ sudo mkdir -p /mnt/gds/ext4
~$ sudo mount -o data=ordered /dev/nvme0n1 /mnt/gds/ext4
~$ mount | grep ext4
/dev/sda2 on / type ext4 (rw,relatime)
/dev/nvme0n1 on /mnt/gds/ext4 type ext4 (rw,relatime,data=ordered)

(2) Create test file for gdsio_verify

~$ sudo dd if=/dev/random of=/mnt/gds/ext4/test-fs-1G bs=1024k count=1000

(3) Check gdsio_verify on GPUID0

~$ sudo /usr/local/cuda/gds/tools/gdsio_verify -d 0 -f /mnt/gds/ext4/test-fs-1G -n 1 -s 1G
gpu index :0,file :/mnt/gds/ext4/test-fs-1G, gpu buffer alignment :0, gpu buffer offset :0, gpu devptr offset :0, file offset :0, io_requested :1073741824, io_chunk_size :1073741824, bufregister :true, sync :1, nr ios :1,
fsync :0,
Data Verification Success

(4) Create directory for gdsio

~$ sudo mkdir /mnt/gds/ext4/gds_dir

(5) Write test on CPU only mode

~$ sudo /usr/local/cuda/gds/tools/gdsio -x 1 -D /mnt/gds/ext4/gds_dir -w 8 -s 1G -i 1M -I 1
IoType: WRITE XferType: CPUONLY Threads: 8 DataSetSize: 8381440/8388608(KiB) IOSize: 1024(KiB) Throughput: 2.460552 GiB/sec, Avg_Latency: 3174.274946 usecs ops: 8185 total_time 3.248525 secs

-x (xfer_type) option:
0 - Storage->GPU (GDS)
1 - Storage->CPU
2 - Storage->CPU->GPU
3 - Storage->CPU->GPU_ASYNC
4 - Storage->PAGE_CACHE->CPU->GPU
5 - Storage->GPU_ASYNC

(6) Write test on GDS of GPUID 0

~$ sudo /usr/local/cuda/gds/tools/gdsio -x 0 -d 0 -D /mnt/gds/ext4/gds_dir -w 8 -s 1G -i 1M -I 1
IoType: WRITE XferType: GPUD Threads: 8 DataSetSize: 8381440/8388608(KiB) IOSize: 1024(KiB) Throughput: 2.441271 GiB/sec, Avg_Latency: 3199.571372 usecs ops: 8185 total_time 3.274181 secs

➜ Display error when GDS does not work

(7) Write test on CPU -> GPUID 0

~$ sudo /usr/local/cuda/gds/tools/gdsio -x 2 -d 0 -D /mnt/gds/ext4/gds_dir -w 8 -s 1G -i 1M -I 1
IoType: WRITE XferType: CPU_GPU Threads: 8 DataSetSize: 8381440/8388608(KiB) IOSize: 1024(KiB) Throughput: 2.474772 GiB/sec, Avg_Latency: 3156.123118 usecs ops: 8185 total_time 3.229859 secs

(Ex) Read test on GDS of GPUID 0

$ sudo /usr/local/cuda/gds/tools/gdsio -x 0 -d 0 -D /mnt/gds/ext4/gds_dir -w 8 -s 1G -i 1M -I 0
IoType: READ XferType: GPUD Threads: 8 DataSetSize: 7141376/8388608(KiB) IOSize: 1024(KiB) Throughput: 3.163243 GiB/sec, Avg_Latency: 2517.087344 usecs ops: 6974 total_time 2.153027 secs

Note:
read test (-I 0) with verify option (-V) should be used with files written (-I 1) with -V option
read test (-I 2) with verify option (-V) should be used with files written (-I 3) with -V option, using same random seed (-k),
same number of threads(-w), offset(-o), and data size(-s)
write test (-I 1/3) with verify option (-V) will perform writes followed by read

@karanveersingh5623
Copy link

karanveersingh5623 commented Oct 11, 2023

@wakaba-best
I am trying to run below command , its failing . I want to reach around 6 ~13GB/s Read for NVMe devices on luster filesystem. Please let me know where to make the changes .

Before Changes

[root@node002 ~]# /usr/local/cuda-11.7/gds/tools/gdscheck.py -p
 GDS release version: 1.3.1.18
 nvidia_fs version:  2.17 libcufile version: 2.12
 Platform: x86_64
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Supported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Configured
 --rdma_device_status  : Up: 0 Down: 1
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 512000
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 1
 properties.rdma_dynamic_routing_order : GPU_MEM_NVLINKS GPU_MEM SYS_MEM P2P
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.lustre.rdma_dev_addr_list : 192.168.61.92
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 3
 miscellaneous.api_check_aggressive : false
 =========
 GPU INFO:
 =========
 GPU index 0 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 1 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 2 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 GPU index 3 NVIDIA A100 80GB PCIe bar:1 bar size (MiB):131072 supports GDS, IOMMU State: Disabled
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Platform verification succeeded

After Changes to cufile.json

"profile": {
                            // nvtx profiling on/off
                            "nvtx": false,
                            // cufile stats level(0-3)
                            "cufile_stats": 3,
                            **"io_batchsize": 512**
            },

            "properties": {
                            // max IO chunk size (parameter should be multiples of 64K) used by cuFileRead/Write internally per IO request
                            **"max_direct_io_size_kb" : 524288,**
                            // device memory size (parameter should be 4K aligned) for reserving bounce buffers for the entire GPU
                            **"max_device_cache_size_kb" : 512000,**
                            // limit on maximum device memory size (parameter should be 4K aligned) that can be pinned for a given process
                            "max_device_pinned_mem_size_kb" : 33554432,
                            // true or false (true will enable asynchronous io submission to nvidia-fs driver)
                            // Note : currently the overall IO will still be synchronous
                            "use_poll_mode" : false,
                            // maximum IO request size (parameter should be 4K aligned) within or equal to which library will use polling for IO completion
                            "poll_mode_max_size_kb": 4,
                            // allow compat mode, this will enable use of cuFile posix read/writes
                            "allow_compat_mode": true,
                            // enable GDS write support for RDMA based storage
                            "gds_rdma_write_support": true,
                            // GDS batch size
                            **"io_batchsize": 512,**
                            // enable io priority w.r.t compute streams
                            // valid options are "default", "low", "med", "high"
                            "io_priority": "default",
[root@node002 ~]# /usr/local/cuda-11.7/gds/tools/gdscheck.py -p
 invalid directIO size (KB) specified: 512 min: 1 max: 256
 error reading config properties.io_batchsize
 failed to load config: /etc/cufile.json Invalid argument
 cuFile configuration load error
 cuFile initialization failed
 Platform verification error :
Invalid argument

@karanveersingh5623
Copy link

@wakaba-best
i am trying to run below command and getting some errors related to cufile buffer register

[root@node002 ~]# for i in 0 1 2 3; do if [ $i -eq 0 ] || [ $i -eq 1 ]; then /usr/local/cuda-11.7/gds/tools/gdsio -D /mnt/lustre/gds -d $i -n 0 -w 128 -s 1G -i 500M -x 0 -I 3 & else /usr/local/cuda-11.7/gds/tools/gdsio -D /mnt/lustre/gds -d $i -n 1 -w 128 -s 1G -i 500M -x 0 -I 3 & fi; done
[1] 1442728
[2] 1442729
[3] 1442730
[4] 1442731
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]#
[root@node002 ~]# cuFile buffer register failed :internal error
cuFile buffer register failed :internal error
cuFile buffer register failed :internal error
cuFile buffer register failed :internal error
cuFile buffer register failed :internal error
cuFile buffer register failed :internal error
cuFile buffer register failed :internal error
cuFile buffer register failed :internal error

@karanveersingh5623
Copy link

how to change the directIO size (KB) to 512M ?

@karanveersingh5623
Copy link

@wakaba-best , please check the cufile.log below lines for the above failing command with 500M io size

11-10-2023 15:49:50:24 [pid=1502621 tid=1502621] NOTICE  cufio-rdma:175 nvidia_peermem.ko is not loaded. Disabling UserSpace RDMA access.
 11-10-2023 15:49:50:31 [pid=1502619 tid=1502619] NOTICE  cufio-rdma:175 nvidia_peermem.ko is not loaded. Disabling UserSpace RDMA access.
 11-10-2023 15:49:50:34 [pid=1502621 tid=1502621] ERROR  cufio-dr:229 No matching pair for network device to closest GPU found in the platform
 11-10-2023 15:49:50:36 [pid=1502618 tid=1502618] NOTICE  cufio-rdma:175 nvidia_peermem.ko is not loaded. Disabling UserSpace RDMA access.
 11-10-2023 15:49:50:38 [pid=1502620 tid=1502620] NOTICE  cufio-rdma:175 nvidia_peermem.ko is not loaded. Disabling UserSpace RDMA access.
 11-10-2023 15:49:50:40 [pid=1502619 tid=1502619] ERROR  cufio-dr:229 No matching pair for network device to closest GPU found in the platform
 11-10-2023 15:49:50:44 [pid=1502618 tid=1502618] ERROR  cufio-dr:229 No matching pair for network device to closest GPU found in the platform
 11-10-2023 15:49:50:46 [pid=1502620 tid=1502620] ERROR  cufio-dr:229 No matching pair for network device to closest GPU found in the platform
 11-10-2023 15:49:51:316 [pid=1502618 tid=1503329] ERROR  0:1072 Inc-bar-usage failed: size 524288000 remaining bytes 281018368
 11-10-2023 15:49:51:317 [pid=1502618 tid=1503329] ERROR  0:410 update bar usage failed
 11-10-2023 15:49:51:317 [pid=1502618 tid=1503329] ERROR  cufio-obj:101 error allocating nvfs handle, size: 524288000
 11-10-2023 15:49:51:317 [pid=1502618 tid=1503329] ERROR  cufio:1185 cuFileBufRegister error, object allocation failed
 11-10-2023 15:49:51:317 [pid=1502618 tid=1503329] ERROR  cufio:1236 cuFileBufRegister error internal error
 11-10-2023 15:49:51:317 [pid=1502618 tid=1503320] ERROR  0:1072 Inc-bar-usage failed: size 524288000 remaining bytes 281018368
 11-10-2023 15:49:51:317 [pid=1502618 tid=1503320] ERROR  0:410 update bar usage failed
 11-10-2023 15:49:51:317 [pid=1502618 tid=1503320] ERROR  cufio-obj:101 error allocating nvfs handle, size: 524288000
 11-10-2023 15:49:51:317 [pid=1502618 tid=1503320] ERROR  cufio:1185 cuFileBufRegister error, object allocation failed
 11-10-2023 15:49:51:317 [pid=1502618 tid=1503320] ERROR  cufio:1236 cuFileBufRegister error internal error
 11-10-2023 15:49:51:318 [pid=1502618 tid=1502854] ERROR  0:1072 Inc-bar-usage failed: size 524288000 remaining bytes 281018368
 11-10-2023 15:49:51:318 [pid=1502618 tid=1502854] ERROR  0:410 update bar usage failed
 11-10-2023 15:49:51:318 [pid=1502618 tid=1502854] ERROR  cufio-obj:101 error allocating nvfs handle, size: 524288000
 11-10-2023 15:49:51:318 [pid=1502618 tid=1502854] ERROR  cufio:1185 cuFileBufRegister error, object allocation failed
 11-10-2023 15:49:51:318 [pid=1502618 tid=1502854] ERROR  cufio:1236 cuFileBufRegister error internal error
 11-10-2023 15:49:51:323 [pid=1502618 tid=1502856] ERROR  0:1072 Inc-bar-usage failed: size 524288000 remaining bytes 281018368
 11-10-2023 15:49:51:323 [pid=1502618 tid=1502856] ERROR  0:410 update bar usage failed
 11-10-2023 15:49:51:323 [pid=1502618 tid=1502856] ERROR  cufio-obj:101 error allocating nvfs handle, size: 524288000
 11-10-2023 15:49:51:323 [pid=1502618 tid=1502856] ERROR  cufio:1185 cuFileBufRegister error, object allocation failed
 11-10-2023 15:49:51:323 [pid=1502618 tid=1502856] ERROR  cufio:1236 cuFileBufRegister error internal error
 11-10-2023 15:49:51:325 [pid=1502618 tid=1503338] ERROR  0:1072 Inc-bar-usage failed: size 524288000 remaining bytes 281018368
 11-10-2023 15:49:51:325 [pid=1502618 tid=1503338] ERROR  0:410 update bar usage failed
 11-10-2023 15:49:51:325 [pid=1502618 tid=1503338] ERROR  cufio-obj:101 error allocating nvfs handle, size: 524288000

@karanveersingh5623
Copy link

@wakaba-best , anything on above if you can share ?

@Murphy-AI
Copy link

Can my environment be helpful to you?

$ gdscheck -p
 GDS release version: 1.0.1.3
 nvidia_fs version:  2.7 libcufile version: 2.4
 ============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Supported
 NVMeOF             : Supported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Enabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384
 properties.posix_pool_slab_count : 128 64 32
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 =========
 GPU INFO:
 =========
 GPU index 0 Tesla V100-PCIE-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 1 Tesla V100-PCIE-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 2 Tesla V100-PCIE-32GB bar:1 bar size (MiB):32768 supports GDS
 GPU index 3 Tesla V100-PCIE-32GB bar:1 bar size (MiB):32768 supports GDS
 ==============
 PLATFORM INFO:
 ==============
 IOMMU: disabled
 Platform verification succeeded

[1] OS Version

  • OS : Ubuntu Server 20.04.5 LTS
  • Kernel : 5.4.0-144-generic
  • OFED : MLNX_OFED_LINUX-5.8-2.0.3.0

[2] CUDA & GDS Package(deb files):

$ dpkg -l | grep cuda-tools
ii  cuda-tools-11-4                       11.4.1-1                                amd64        CUDA Tools meta-package
$ dpkg -l | grep gds
ii  gds-tools-11-4                        1.0.1.3-1                               amd64        Tools for GPU Direct Storage
$ dpkg -l | grep nvidia-fs
ii  nvidia-fs                             2.7.50-1                                amd64        NVIDIA filesystem for GPUDirect Storage
ii  nvidia-fs-dkms                        2.7.50-1                                amd64        NVIDIA filesystem DKMS package

[3] check1 : Loaded Kernel modules

$ lsmod | grep nvidia_fs
nvidia_fs             245760  0
ib_core               348160  10 rdma_cm,ib_ipoib,nvme_rdma,iw_cm,nvidia_fs,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
$ lsmod | grep nvme_core
nvme_core             110592  3 nvme,nvme_rdma,nvme_fabrics
mlx_compat             65536  16 rdma_cm,ib_ipoib,mlxdevm,nvme,nvme_rdma,iw_cm,nvme_core,auxiliary,nvme_fabrics,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core

[4] check2 : IOMMU is disable

$ dmesg | grep -i iommu
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-144-generic root=UUID=318c92d2-8567-4d37-acba-4050de3146d9 ro intel_iommu=off
[    1.355073] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-144-generic root=UUID=318c92d2-8567-4d37-acba-4050de3146d9 ro intel_iommu=off
[    1.355163] DMAR: IOMMU disabled
[    2.328217] DMAR-IR: IOAPIC id 12 under DRHD base  0xc5ffc000 IOMMU 6
[    2.328219] DMAR-IR: IOAPIC id 11 under DRHD base  0xb87fc000 IOMMU 5
[    2.328221] DMAR-IR: IOAPIC id 10 under DRHD base  0xaaffc000 IOMMU 4
[    2.328223] DMAR-IR: IOAPIC id 18 under DRHD base  0xfbffc000 IOMMU 3
[    2.328225] DMAR-IR: IOAPIC id 17 under DRHD base  0xee7fc000 IOMMU 2
[    2.328227] DMAR-IR: IOAPIC id 16 under DRHD base  0xe0ffc000 IOMMU 1
[    2.328229] DMAR-IR: IOAPIC id 15 under DRHD base  0xd37fc000 IOMMU 0
[    2.328232] DMAR-IR: IOAPIC id 8 under DRHD base  0x9d7fc000 IOMMU 7
[    2.328234] DMAR-IR: IOAPIC id 9 under DRHD base  0x9d7fc000 IOMMU 7
[    3.468127] iommu: Default domain type: Translated

[5] check3 : PCIe Topology => GPU and NVMe devices on the same PXL Switch

$ lspci -tv | grep NVMe -A 3 | grep -v Intel
 +-[0000:3a]-+-00.0-[3b-41]----00.0-[3c-41]--+-04.0-[3d]----00.0  Toshiba Corporation NVMe SSD Controller Cx5
 |           |                               +-08.0-[3e]--
 |           |                               +-0c.0-[3f]----00.0  NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
 |           |                               +-10.0-[40]----00.0  NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
 |           |                               \-14.0-[41]----00.0  Toshiba Corporation NVMe SSD Controller Cx5
--
 |           |                               +-08.0-[1b]----00.0  Toshiba Corporation NVMe SSD Controller Cx5
 |           |                               +-0c.0-[1c]----00.0  NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
 |           |                               +-10.0-[1d]----00.0  NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB]
 |           |                               \-14.0-[1e]----00.0  Toshiba Corporation NVMe SSD Controller Cx5

[6] check4 : ACS Control is disable => You get "ACSViol-"

$ sudo lspci -s 1D:00.0 -vvvv | grep -i acs
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP+ SDES- TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt+ RxOF+ MalfTLP+ ECRC- UnsupReq+ ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-

[7] check5: Whether the NVMe drive is formatted as ext4 or xfs, and LVM isn't used?

Recently, I am struggling with supporting nvme, could you please show me your installation process? thanks

@Sabiha1225
Copy link

@Murphy-AI Can you tell me the steps, how are you installing GDS. I am following the documentation but it is not working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests