Testing ib_write_bw with --use_cuda and --use_cuda_dmabuf getting error "Couldn't allocate MR with error=93" #280

hassanbabaie · 2024-08-24T19:54:19Z

Hi, I'm testing RDMA via RoCEv2 connectivity and we're using dma-buf instead of nv-peer-mem and it's failing but I'm unsure of the fix/why.

I setup the test on Ubuntu Kubernetes Pods (based on then Nvidia NGC image) and installed:

# Also installed OFED drivers
apt-get update
apt install pciutils -y
apt install libpci-dev -y
git clone https://github.com/linux-rdma/perftest.git
cd perftest
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j
make install

Then I ran the following command between them:

MLX5_SCATTER_TO_CQE=0  ./ib_write_bw -d mlx5_5 --use_cuda=3 -a -x 7 --report_gbits -F --use_cuda_dmabuf
MLX5_SCATTER_TO_CQE=0 ./ib_write_bw -d mlx5_1 --use_cuda=1 -a -x 7 --ipv6 172.16.126.66 --report_gbits -F --use_cuda_dmabuf

The output I got was:

/perftest# MLX5_SCATTER_TO_CQE=0 ./ib_write_bw -d mlx5_1 --use_cuda=1 -a -x 7 --ipv6 172.16.126.66 --report_gbits -F --use_cuda_dmabuf
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 1B:00
CUDA device 1: PCIe address is 29:00
CUDA device 2: PCIe address is 45:00
CUDA device 3: PCIe address is 4E:00

Picking device No. 1
[pid = 13194, dev = 1] device name = [NVIDIA H100 80GB HBM3]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 16777216 bytes GPU buffer
allocated GPU buffer address at 00007f4d26800000 pointer=0x7f4d26800000
using DMA-BUF for GPU buffer address at 0x7f4d26800000 aligned at 0x7f4d26800000 with aligned size 16777216
Calling ibv_reg_dmabuf_mr(offset=0, size=16777216, addr=0x7f4d26800000, fd=43) for QP #0
Couldn't allocate MR with error=93
failed to create mr
Failed to create MR
 Couldn't create IB resources
destroying current CUDA Ctx

Cuda info:

/perftest# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0

Note the non-RDMA/CUDA one works fine

Any thoughts / ideas would be appreciated

The text was updated successfully, but these errors were encountered:

sshaulnv · 2024-08-26T10:27:45Z

can you please check if the GPU driver installed as open-kernel?
run: modinfo nvidia
and check the 'license' (should be 'Dual MIT/GPL')

hassanbabaie · 2024-08-26T17:54:37Z

@sshaulnv yes it is the open driver

However we are retesting and I will post the update here (very soon) and the output of modinfo nvidia

hassanbabaie · 2024-09-05T21:47:13Z

Hi @sshaulnv , yes the output on the host is the following:

modinfo nvidia
filename:       /lib/modules/5.14.0-284.30.1.el9_2.x86_64/extra/nvidia.ko.xz
firmware:       nvidia/535.183.06/gsp_tu10x.bin
firmware:       nvidia/535.183.06/gsp_ga10x.bin
import_ns:      DMA_BUF
alias:          char-major-195-*
version:        535.183.06
supported:      external
license:        Dual MIT/GPL
rhelversion:    9.2

We removed/disable selinux off the host but no luck

Is there anyway to get a more verbose error?

Everything I have checks says that this should work....

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing ib_write_bw with --use_cuda and --use_cuda_dmabuf getting error "Couldn't allocate MR with error=93" #280

Testing ib_write_bw with --use_cuda and --use_cuda_dmabuf getting error "Couldn't allocate MR with error=93" #280

hassanbabaie commented Aug 24, 2024 •

edited

Loading

sshaulnv commented Aug 26, 2024

hassanbabaie commented Aug 26, 2024

hassanbabaie commented Sep 5, 2024 •

edited

Loading

Testing ib_write_bw with --use_cuda and --use_cuda_dmabuf getting error "Couldn't allocate MR with error=93" #280

Testing ib_write_bw with --use_cuda and --use_cuda_dmabuf getting error "Couldn't allocate MR with error=93" #280

Comments

hassanbabaie commented Aug 24, 2024 • edited Loading

sshaulnv commented Aug 26, 2024

hassanbabaie commented Aug 26, 2024

hassanbabaie commented Sep 5, 2024 • edited Loading

hassanbabaie commented Aug 24, 2024 •

edited

Loading

hassanbabaie commented Sep 5, 2024 •

edited

Loading