GPUDirect RDMA Performance issue #9287

kzmymmt · 2023-08-14T07:06:07Z

Describe the bug

I measured bandwidth on GPUDirect RDMA with OSU Micro-Benchmarks osu_bw D D.
Bandwidth was lower with ucx 1.15.0rc3 than with 1.13.1.

# OSU MPI-CUDA Bandwidth Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       1.60
2                       3.21
4                       6.34
8                      10.81
16                     21.42
32                     42.94
64                     83.96
128                   166.28
256                   329.42
512                   629.98
1024                 1146.88
2048                 2149.59
4096                 3761.13
8192                 6421.73
16384                1484.76
32768                1868.04
65536                2345.00
131072               2879.16
262144               3526.87
524288               4084.24
1048576              4031.50
2097152              3895.92
4194304              3840.19

Previously ucx1.13.1 had 24756.14 MB/sec with Size 4194304.
This was close to the speed of the NIC (IB NDR200, 200Gbps) and was fine.

What logs should be taken to identify the problem?

Steps to Reproduce

Command line

mpirun -host host1,host2 -np 2 -npernode 1 ./osu_bw D D

UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by ucx_info -v)

# Library version: 1.15.0
# Library path: /home/user/privateinstall/ucx-1.15.0-rc3/nvhpc22.11-cuda11.8/lib/libucs.so.0
# API headers version: 1.15.0
# Git branch '', revision 
# Configured with: --prefix=/home/user/privateinstall/ucx-1.15.0-rc3/nvhpc22.11-cuda11.8 --with-gdrcopy=/usr/local --with-cuda=/system/apps/ubuntu/20.04-202304/nvhpc/22.11/Linux_x86_64/22.11/cuda --disable-optimizations --enable-logging --disable-debug --disable-assertions --enable-mt --disable-params-check --without-java --enable-cma --with-verbs --without-knem --with-rdmacm --without-rocm --without-xpmem --without-ugni

$ ucx_info -d | grep -e cuda -e gdr
# Memory domain: cuda_cpy
#     Component: cuda_cpy
#         memory types: host (reg), cuda (access,alloc,reg,cache,detect), cuda-managed (access,alloc,reg,cache,detect)
#      Transport: cuda_copy
#         Device: cuda
# Memory domain: cuda_ipc
#     Component: cuda_ipc
#         memory types: cuda (access,reg,cache)
#      Transport: cuda_ipc
#         Device: cuda
# Memory domain: gdr_copy
#     Component: gdr_copy
#         memory types: cuda (access,reg,cache)
#      Transport: gdr_copy
#         Device: cuda
#         memory types: host (access,reg,cache), cuda (reg,cache)

Setup and versions

OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
- Ubuntu 20.04.6 LTS, 5.4.0-144-generic
- x86_64
For RDMA/IB/RoCE related issues:
- Driver version:
  - MLNX_OFED_LINUX-23.04-0.5.3.3:
- HW information from ibstat

CA 'mlx5_0'
        CA type: MT4129
        Number of ports: 1
        Firmware version: 28.37.1014
        Hardware version: 0
        Node GUID: 0xe8ebd303005197a6
        System image GUID: 0xe8ebd303005197a6
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 200
                Base lid: 44
                LMC: 0
                SM lid: 47
                Capability mask: 0xa651e848
                Port GUID: 0xe8ebd303005197a6
                Link layer: InfiniBand

For GPU related issues:
- GPU type: NVIDIA H100 PCIe
- Cuda:
  - Drivers version
  - Check if peer-direct is loaded:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 PCIe               On  | 00000000:16:00.0 Off |                    0 |
| N/A   26C    P0              46W / 350W |      4MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

nv_peer_mem            16384  0
nvidia              56430592  4 nvidia_uvm,nv_peer_mem,gdrdrv,nvidia_modeset
ib_core               348160  11 beegfs,rdma_cm,ib_ipoib,ko2iblnd,nv_peer_mem,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm

gdrdrv                 24576  0
nvidia              56430592  4 nvidia_uvm,nv_peer_mem,gdrdrv,nvidia_modeset

Additional information (depending on the issue)

OpenMPI version
Output of ucx_info -d to show transports and devices recognized by UCX
Configure result - config.log
Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"

The text was updated successfully, but these errors were encountered:

rakhmets · 2023-08-16T15:41:54Z

Hi,

I think that UCX didn't switch from eager to rendezvous protocol based on osu_bw output.

Could you please do the following to find out the cause of the issue:

Re-build UCX, configuring with --enable-logging.
Collect log using UCX_LOG_LEVEL=data, and attach it to the issue.

You command line should be changed to something like this:

mpirun -host host1,host2 -np 2 -npernode 1 -x UCX_LOG_LEVEL=data -x UCX_LOG_FILE=%h_%p.log ./osu_bw -i 1 -x 0 -m 4194304:4194304 D D

It should be enough to collect logs for one iteration only and for only one message size. So please use these osu_bw parameters: -i 1 -x 0 -m 4194304:4194304.

kzmymmt · 2023-08-21T23:45:23Z

Thanks for confirming.
I attach the log with UCX_LOG_LEVEL=data .
bnode002_1896907.log
bnode003_1235984.log

rakhmets · 2023-08-22T10:10:10Z

Thanks for the log files.
According to the logs TCP was selected for inter-node communications instead of IB.

[1692367680.473056] [bnode003:1235984:0]          select.c:636  UCX  TRACE   ep 0x14bcf80580c0: selected for high-bw remote memory access: tcp/ibp106s0 md[1] -> '<no debug data>' address[1],md[1],rsc[255] score 5930.76
...
[1692367680.473084] [bnode003:1235984:0]          select.c:636  UCX  TRACE   ep 0x14bcf80580c0: selected for high-bw remote memory access: tcp/eno1 md[1] -> '<no debug data>' address[2],md[1],rsc[255] score 3626.66
...
[1692367680.489738] [bnode003:1235984:0]      ucp_worker.c:1855 UCX  INFO    0x2d65420 inter-node cfg#3 tag(rc_mlx5/mlx5_0:1 tcp/ibp106s0 tcp/eno1)

There are a couple PRs related to the issue. However, the changes are in the release branch.
I'll come back as soon as I understand what else is missing in the release branch for filtering out TCP in case of IB availability.

It would also help me if you confirmed or denied that the issue is reproduced using master branch.

ivankochin · 2023-08-24T10:51:40Z

@kzmymmt please share the output of the following commands:

nvidia-smi topo -m
lscpu

kzmymmt · 2023-08-27T23:58:13Z

I will check the master branch later.

@ivankochin
The results of the two commands are below.

$ nvidia-smi topo -m
        GPU0    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     0-47            N/A             N/A
NIC0    SYS      X                       

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          48
On-line CPU(s) list:             0-47
Thread(s) per core:              1
Core(s) per socket:              48
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           143
Model name:                      Intel(R) Xeon(R) Platinum 8468
Stepping:                        8
Frequency boost:                 enabled
CPU MHz:                         2850.813
CPU max MHz:                     2101.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4200.00
Virtualization:                  VT-x
L1d cache:                       2.3 MiB
L1i cache:                       1.5 MiB
L2 cache:                        96 MiB
L3 cache:                        105 MiB
NUMA node0 CPU(s):               0-47
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts 
                                 acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art ar
                                 ch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_f
                                 req pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdc
                                 m pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c r
                                 drand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single 
                                 cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid 
                                 ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512
                                 dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx5
                                 12vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_loca
                                 l avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke waitpkg avx
                                 512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid 
                                 cldemote movdiri movdir64b md_clear pconfig flush_l1d arch_capabilities

kzmymmt · 2023-08-28T07:18:29Z

@rakhmets
I tried it on master branch 5f00157 with good results.
Is there anything else I should try?

# OSU MPI-CUDA Bandwidth Test v7.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
# Datatype: MPI_CHAR.
1                       2.55
2                       5.11
4                      10.29
8                      20.49
16                     41.14
32                     81.88
64                    145.21
128                   289.50
256                   551.32
512                  1014.81
1024                 1526.49
2048                 3085.97
4096                 6242.26
8192                 2133.20
16384                4252.78
32768                8684.41
65536               16772.52
131072              22168.71
262144              23288.10
524288              23849.57
1048576             24261.52
2097152             24526.07
4194304             22769.60

shamisp · 2023-08-29T23:37:09Z

Can it be closed

kzmymmt · 2023-08-31T05:59:06Z

I'll check it out in future releases as well.
Thank you very much.

kzmymmt added the Bug label Aug 14, 2023

yosefe assigned rakhmets Aug 14, 2023

rakhmets mentioned this issue Aug 28, 2023

UCS/SYS/TOPO: Added bw estimation for Sapphire Rapids family - v1.15.x #9318

Merged

kzmymmt closed this as completed Aug 31, 2023

angainor mentioned this issue Nov 2, 2023

Performance issue, NVidia H100 #9462

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPUDirect RDMA Performance issue #9287

GPUDirect RDMA Performance issue #9287

kzmymmt commented Aug 14, 2023

rakhmets commented Aug 16, 2023 •

edited

Loading

kzmymmt commented Aug 21, 2023

rakhmets commented Aug 22, 2023 •

edited

Loading

ivankochin commented Aug 24, 2023

kzmymmt commented Aug 27, 2023 •

edited

Loading

kzmymmt commented Aug 28, 2023 •

edited

Loading

shamisp commented Aug 29, 2023

kzmymmt commented Aug 31, 2023

GPUDirect RDMA Performance issue #9287

GPUDirect RDMA Performance issue #9287

Comments

kzmymmt commented Aug 14, 2023

Describe the bug

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

rakhmets commented Aug 16, 2023 • edited Loading

kzmymmt commented Aug 21, 2023

rakhmets commented Aug 22, 2023 • edited Loading

ivankochin commented Aug 24, 2023

kzmymmt commented Aug 27, 2023 • edited Loading

kzmymmt commented Aug 28, 2023 • edited Loading

shamisp commented Aug 29, 2023

kzmymmt commented Aug 31, 2023

rakhmets commented Aug 16, 2023 •

edited

Loading

rakhmets commented Aug 22, 2023 •

edited

Loading

kzmymmt commented Aug 27, 2023 •

edited

Loading

kzmymmt commented Aug 28, 2023 •

edited

Loading