couldn't allocate MR while test GDR with cuda. #269

derekwin · 2024-06-22T05:39:41Z

system info

ubuntu 2204
kernel : 6.5.0-28-generic

nvidia driver and cuda version:

Driver Version: 555.42.02
CUDA Version: 12.5

I install RDMA ofed driver before installing cuda driver and cuda toolkits.

peermem module status:

nvidia_peermem         16384  0
nvidia_uvm           4943872  0
nvidia_drm            122880  0
nvidia_modeset       1368064  1 nvidia_drm
nvidia              54566912  3 nvidia_uvm,nvidia_peermem,nvidia_modeset
video                  73728  1 nvidia_modeset
ib_core               557056  9 rdma_cm,ib_ipoib,nvidia_peermem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
drm_kms_helper        274432  4 ast,nvidia_drm
drm                   765952  6 drm_kms_helper,ast,drm_shmem_helper,nvidia,nvidia_drm

error occured:
./ib_send_bw --use_cuda=0

Perftest doesn't supports CUDA tests with inline messages: inline size set to 0

************************************
* Waiting for client to connect... *
************************************
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 1B:00
CUDA device 1: PCIe address is 3E:00
CUDA device 2: PCIe address is 89:00
CUDA device 3: PCIe address is B2:00

Picking device No. 0
[pid = 3164333, dev = 0] device name = [NVIDIA GeForce RTX 4090]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007c43eac00000 pointer=0x7c43eac00000
Couldn't allocate MR
failed to create mr
Failed to create MR
 Couldn't create IB resources
destroying current CUDA Ctx

./ib_send_bw --use_cuda=0 192.168.2.244

Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 1B:00
CUDA device 1: PCIe address is 3E:00
CUDA device 2: PCIe address is 89:00
CUDA device 3: PCIe address is B2:00

Picking device No. 0
[pid = 3164350, dev = 0] device name = [NVIDIA GeForce RTX 4090]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
cuMemAlloc() of a 131072 bytes GPU buffer
allocated GPU buffer address at 00007847fac00000 pointer=0x7847fac00000
Couldn't allocate MR
failed to create mr
Failed to create MR
 Couldn't create IB resources
destroying current CUDA Ctx

The text was updated successfully, but these errors were encountered:

derekwin · 2024-06-23T04:57:10Z

sry that i didn't notice this suggestion.
8. If GPUDirect is not working, (e.g. you see "Couldn't allocate MR" error message), consider disabling Scatter to CQE feature. Set the environmental variable MLX5_SCATTER_TO_CQE=0. E.g.:
MLX5_SCATTER_TO_CQE=0 ./ib_write_bw -d ib_dev --use_cuda= -a

derekwin · 2024-06-23T05:04:02Z

after setting MLX5_SCATTER_TO_CQE=0, the problem still exist.

Jye-525 · 2024-07-10T01:38:11Z

@derekwin Did you solve this problem? I encountered the same error now.

derekwin · 2024-07-10T02:58:19Z

@derekwin Did you solve this problem? I encountered the same error now.

i still have this error. : (

Jye-525 · 2024-07-10T03:04:09Z

@derekwin I just solved it on my side. I test this across 2 nodes, but I only load the nvidia-peermem on one of the nodes. It is solved by loading nvidia-peermem on all the nodes. Here is the command:
sudo modprobe nvidia-peermem

You can use lsmod|grep nvidia_peermem to check if it is loaded. Hope it works for you too.

YuMJie · 2024-07-18T14:37:00Z

If only one of the two machines is capable of supporting GDR, and the other is a consumer-grade graphics card, how should the bandwidth of GDR be tested?

derekwin closed this as completed Jun 23, 2024

derekwin reopened this Jun 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

couldn't allocate MR while test GDR with cuda. #269

couldn't allocate MR while test GDR with cuda. #269

derekwin commented Jun 22, 2024

derekwin commented Jun 23, 2024 •

edited

Loading

derekwin commented Jun 23, 2024

Jye-525 commented Jul 10, 2024 •

edited

Loading

derekwin commented Jul 10, 2024

Jye-525 commented Jul 10, 2024

YuMJie commented Jul 18, 2024

couldn't allocate MR while test GDR with cuda. #269

couldn't allocate MR while test GDR with cuda. #269

Comments

derekwin commented Jun 22, 2024

derekwin commented Jun 23, 2024 • edited Loading

derekwin commented Jun 23, 2024

Jye-525 commented Jul 10, 2024 • edited Loading

derekwin commented Jul 10, 2024

Jye-525 commented Jul 10, 2024

YuMJie commented Jul 18, 2024

derekwin commented Jun 23, 2024 •

edited

Loading

Jye-525 commented Jul 10, 2024 •

edited

Loading