Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing ib_write_bw with --use_cuda and --use_cuda_dmabuf getting error "Couldn't allocate MR with error=93" #280

Open
hassanbabaie opened this issue Aug 24, 2024 · 3 comments

Comments

@hassanbabaie
Copy link

hassanbabaie commented Aug 24, 2024

Hi, I'm testing RDMA via RoCEv2 connectivity and we're using dma-buf instead of nv-peer-mem and it's failing but I'm unsure of the fix/why.

I setup the test on Ubuntu Kubernetes Pods (based on then Nvidia NGC image) and installed:

# Also installed OFED drivers
apt-get update
apt install pciutils -y
apt install libpci-dev -y
git clone https://github.com/linux-rdma/perftest.git
cd perftest
./autogen.sh && ./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h && make -j
make install

Then I ran the following command between them:

MLX5_SCATTER_TO_CQE=0  ./ib_write_bw -d mlx5_5 --use_cuda=3 -a -x 7 --report_gbits -F --use_cuda_dmabuf
MLX5_SCATTER_TO_CQE=0 ./ib_write_bw -d mlx5_1 --use_cuda=1 -a -x 7 --ipv6 172.16.126.66 --report_gbits -F --use_cuda_dmabuf

The output I got was:

/perftest# MLX5_SCATTER_TO_CQE=0 ./ib_write_bw -d mlx5_1 --use_cuda=1 -a -x 7 --ipv6 172.16.126.66 --report_gbits -F --use_cuda_dmabuf
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 1B:00
CUDA device 1: PCIe address is 29:00
CUDA device 2: PCIe address is 45:00
CUDA device 3: PCIe address is 4E:00

Picking device No. 1
[pid = 13194, dev = 1] device name = [NVIDIA H100 80GB HBM3]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 16777216 bytes GPU buffer
allocated GPU buffer address at 00007f4d26800000 pointer=0x7f4d26800000
using DMA-BUF for GPU buffer address at 0x7f4d26800000 aligned at 0x7f4d26800000 with aligned size 16777216
Calling ibv_reg_dmabuf_mr(offset=0, size=16777216, addr=0x7f4d26800000, fd=43) for QP #0
Couldn't allocate MR with error=93
failed to create mr
Failed to create MR
 Couldn't create IB resources
destroying current CUDA Ctx

Cuda info:

/perftest# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jul_11_02:20:44_PDT_2023
Cuda compilation tools, release 12.2, V12.2.128
Build cuda_12.2.r12.2/compiler.33053471_0

Note the non-RDMA/CUDA one works fine

Any thoughts / ideas would be appreciated

@sshaulnv
Copy link
Contributor

can you please check if the GPU driver installed as open-kernel?
run: modinfo nvidia
and check the 'license' (should be 'Dual MIT/GPL')

@hassanbabaie
Copy link
Author

@sshaulnv yes it is the open driver

However we are retesting and I will post the update here (very soon) and the output of modinfo nvidia

@hassanbabaie
Copy link
Author

hassanbabaie commented Sep 5, 2024

Hi @sshaulnv , yes the output on the host is the following:

modinfo nvidia
filename:       /lib/modules/5.14.0-284.30.1.el9_2.x86_64/extra/nvidia.ko.xz
firmware:       nvidia/535.183.06/gsp_tu10x.bin
firmware:       nvidia/535.183.06/gsp_ga10x.bin
import_ns:      DMA_BUF
alias:          char-major-195-*
version:        535.183.06
supported:      external
license:        Dual MIT/GPL
rhelversion:    9.2

We removed/disable selinux off the host but no luck

Is there anyway to get a more verbose error?

Everything I have checks says that this should work....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants