Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ucx_perftest crash in uct_ib_mlx5_completion_with_err #7863

Closed
cgorac opened this issue Jan 21, 2022 · 16 comments
Closed

ucx_perftest crash in uct_ib_mlx5_completion_with_err #7863

cgorac opened this issue Jan 21, 2022 · 16 comments
Labels
Milestone

Comments

@cgorac
Copy link

cgorac commented Jan 21, 2022

I have MLNX_OFED installed on a couple of RH 7.9 machines, with ConnectX-5 (MT27800 Family) adapters. Machines also have 4 V100 GPUs, and have gdrcopy and nvidia_peermem (tried with nv_peer_memory instead, with the same outcome) drivers loaded. The UCX version is 1.12.0, I tried with both the version that is pre-built and delivered along with MLNX_OFED, and one that I've built from source (using the same flags as MLNX_OFED one is reporting through ucx_info -b). In both cases, when I try to run the ucx_perftest to measure the GPUDirect RDMA bandwidth, by running:

ucx_perftest -t tag_bw -m cuda -s 16536 -n 1000 -p 9999

on machine node1 and then:

ucx_perftest node1 -t tag_bw -m cuda -s 16536 -n 1000 -p 9999

on machine node2, a crash in ucx_perftest would occur, with following printed out:

[node2:40754:0:40754] ib_mlx5_log.c:168  Local protection on mlx5_1:1/IB (synd 0x4 vend 0x51 hw_synd 0/2)
[node2:40754:0:40754] ib_mlx5_log.c:168  RC QP 0x31a wqe[2]: SEND s-- [inl len 26] [va 0x7f22bfa00000 len 8230 lkey 0x42323] [rqpn 0x3c2 dlid=26 sl=0 port=1 src_path_bits=0]
==== backtrace (tid:  40754) ====
 0 0x0000000000025183 uct_ib_mlx5_completion_with_err()  /tmp/ucx-1.12.0/src/uct/ib/mlx5/ib_mlx5_log.c:162
 1 0x0000000000039b87 uct_ib_mlx5_poll_cq()  /tmp/ucx-1.12.0/src/uct/ib/mlx5/ib_mlx5.inl:91
 2 0x0000000000039b87 uct_rc_mlx5_iface_progress()  /tmp/ucx-1.12.0/src/uct/ib/rc/accel/rc_mlx5_iface.c:173
 3 0x0000000000039b87 uct_rc_mlx5_iface_progress_cyclic()  /tmp/ucx-1.12.0/src/uct/ib/rc/accel/rc_mlx5_iface.c:178
 4 0x000000000003b4da ucs_callbackq_dispatch()  /tmp/ucx-1.12.0/src/ucs/datastruct/callbackq.h:211
 5 0x000000000003b4da uct_worker_progress()  /tmp/ucx-1.12.0/src/uct/api/uct.h:2589
 6 0x000000000003b4da ucp_worker_progress()  /tmp/ucx-1.12.0/src/ucp/core/ucp_worker.c:2629
 7 0x00000000004608aa ucp_perf_test_runner<(ucx_perf_cmd_t)7, (ucx_perf_test_type_t)2, 0u>::progress()  /tmp/ucx-1.12.0/src/tools/perf/lib/ucp_tests.cc:219
 8 0x00000000004608aa ucp_perf_test_runner<(ucx_perf_cmd_t)7, (ucx_perf_test_type_t)2, 0u>::send()  /tmp/ucx-1.12.0/src/tools/perf/lib/ucp_tests.cc:388
 9 0x00000000004608aa ucp_perf_test_runner<(ucx_perf_cmd_t)7, (ucx_perf_test_type_t)2, 0u>::run_stream_uni()  /tmp/ucx-1.12.0/src/tools/perf/lib/ucp_tests.cc:763
10 0x0000000000455248 ucp_perf_test_runner<(ucx_perf_cmd_t)7, (ucx_perf_test_type_t)2, 0u>::run()  /tmp/ucx-1.12.0/src/tools/perf/lib/ucp_tests.cc:790
11 0x0000000000455248 ucp_perf_test_dispatch()  /tmp/ucx-1.12.0/src/tools/perf/lib/ucp_tests.cc:935
12 0x0000000000408860 ucx_perf_run()  /tmp/ucx-1.12.0/src/tools/perf/lib/libperf.c:1695
13 0x000000000040581d run_test_recurs()  /tmp/ucx-1.12.0/src/tools/perf/perftest_run.c:273
14 0x0000000000405eda run_test()  /tmp/ucx-1.12.0/src/tools/perf/perftest_run.c:332
15 0x00000000004046cc main()  /tmp/ucx-1.12.0/src/tools/perf/perftest.c:975
16 0x00000000004046cc main()  /tmp/ucx-1.12.0/src/tools/perf/perftest.c:984
17 0x0000000000022555 __libc_start_main()  ???:0
18 0x0000000000404c9b _start()  ???:0
=================================
(END)

Because "protection" is mentioned in the output above, I've tried with running both ucx_perftest instances as root user, but the same happens. I've also tried with changing various UCX related environment variables, like setting UCX_NET_DEVICES, or using UCX_MEMTYPE_CACHE=0, but the outcome is always the same. Of course, if I set UCX_IB_GPU_DIRECT_RDMA=no then it works (also it works if I use -m host instead of -m cuda in ucx_perftest command line). Tried too with UCX 1.11.2 built from source, the same happens.

Tried also every RDMA test listed for example here, and everything works fine. So my question is - any hint what else to try to fix using GPUDirect RDMA from UCX on these machines?

Here is some additional info about my setup:
$ uname -ior
3.10.0-1160.49.1.el7.x86_64 x86_64 GNU/Linux

$ rpm -q rdma-core
rdma-core-55mlnx37-1.55103.x86_64

$ rpm -q libibverbs
libibverbs-55mlnx37-1.55103.x86_64

$ ofed_info -s
MLNX_OFED_LINUX-5.5-1.0.3.2:

$ ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.32.1010
        Hardware version: 0
        Node GUID: 0x98039b030033f86a
        System image GUID: 0x98039b030033f86a
        Port 1:
                State: Down
                Physical state: Disabled
                Rate: 10
                Base lid: 65535
                LMC: 0
                SM lid: 0
                Capability mask: 0x2651e848
                Port GUID: 0x98039b030033f86a
                Link layer: InfiniBand
CA 'mlx5_1'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.32.1010
        Hardware version: 0
        Node GUID: 0x98039b030033f86b
        System image GUID: 0x98039b030033f86a
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 56
                Base lid: 26
                LMC: 0
                SM lid: 1
                Capability mask: 0x2651e848
                Port GUID: 0x98039b030033f86b
                Link layer: InfiniBand

$ lspci -vvv | grep NVIDIA
61:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
        Subsystem: NVIDIA Corporation Device 1249
62:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
        Subsystem: NVIDIA Corporation Device 1249
89:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
        Subsystem: NVIDIA Corporation Device 1249
8a:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
        Subsystem: NVIDIA Corporation Device 1249

$ nvidia-smi | sed -n 3p
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |

$ lsmod | grep nvidia_peermem
nvidia_peermem         13163  0 
ib_core               363078  9 rdma_cm,ib_cm,iw_cm,mlx5_ib,ib_umad,nvidia_peermem,ib_uverbs,rdma_ucm,ib_ipoib
nvidia              36856397  55 gdrdrv,nvidia_modeset,nvidia_peermem,nvidia_uvm

$ lsmod | grep gdr
gdrdrv                 18183  0 
nvidia              36856397  55 gdrdrv,nvidia_modeset,nvidia_peermem,nvidia_uvm

$ ucx_info -d
#
# Memory domain: posix
#     Component: posix
#             allocate: <= 197331396K
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: posix
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#      Transport: sysv
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: self
#         Device: memory0
#           Type: loopback
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: tcp
#         Device: lo
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: ib1
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 6239.81/ppn + 0.00 MB/sec
#              latency: 5210 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: enp1s0f0
#           Type: network
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 1131.64/ppn + 0.00 MB/sec
#              latency: 5258 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: cuda_cpy
#     Component: cuda_cpy
#             allocate: unlimited
#             register: unlimited, cost: 0 nsec
#
#      Transport: cuda_copy
#         Device: cuda
#           Type: accelerator
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 10000.00/ppn + 0.00 MB/sec
#              latency: 8000 nsec
#             overhead: 0 nsec
#            put_short: <= 4294967295
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_short: <= 4294967295
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: cuda_ipc
#     Component: cuda_ipc
#             register: unlimited, cost: 0 nsec
#           remote key: 112 bytes
#
#      Transport: cuda_ipc
#         Device: cuda
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 250000.00/ppn + 0.00 MB/sec
#              latency: 1 nsec
#             overhead: 0 nsec
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: gdr_copy
#     Component: gdr_copy
#             register: unlimited, cost: 0 nsec
#           remote key: 24 bytes
#
#      Transport: gdr_copy
#         Device: cuda
#           Type: accelerator
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 6911.00/ppn + 0.00 MB/sec
#              latency: 1000 nsec
#             overhead: 0 nsec
#            put_short: <= 4294967295
#            get_short: <= 4294967295
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: mlx5_0
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#   < no supported devices found >
#
# Memory domain: mlx5_1
#     Component: ib
#             register: unlimited, cost: 180 nsec
#           remote key: 8 bytes
#           local memory handle is required for zcopy
#
#      Transport: rc_verbs
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (4)
#
#      capabilities:
#            bandwidth: 6433.22/ppn + 0.00 MB/sec
#              latency: 700 + 1.000 * N nsec
#             overhead: 75 nsec
#            put_short: <= 124
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 5 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 5 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 123
#             am_bcopy: <= 8255
#             am_zcopy: <= 8255, up to 4 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 127
#               domain: device
#           atomic_add: 64 bit
#          atomic_fadd: 64 bit
#         atomic_cswap: 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 1
#              max eps: 256
#       device address: 3 bytes
#           ep address: 5 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: rc_mlx5
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (4)
#
#      capabilities:
#            bandwidth: 6433.22/ppn + 0.00 MB/sec
#              latency: 700 + 1.000 * N nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 14 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 14 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 186
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to ep
#      device priority: 38
#     device num paths: 1
#              max eps: 256
#       device address: 3 bytes
#           ep address: 7 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: dc_mlx5
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (4)
#
#      capabilities:
#            bandwidth: 6433.22/ppn + 0.00 MB/sec
#              latency: 760 nsec
#             overhead: 40 nsec
#            put_short: <= 2K
#            put_bcopy: <= 8256
#            put_zcopy: <= 1G, up to 11 iov
#  put_opt_zcopy_align: <= 512
#        put_align_mtu: <= 4K
#            get_bcopy: <= 8256
#            get_zcopy: 65..1G, up to 11 iov
#  get_opt_zcopy_align: <= 512
#        get_align_mtu: <= 4K
#             am_short: <= 2046
#             am_bcopy: <= 8254
#             am_zcopy: <= 8254, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 138
#               domain: device
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 5 bytes
#       error handling: buffer (zcopy), remote access, peer failure, ep_check
#
#
#      Transport: ud_verbs
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (4)
#
#      capabilities:
#            bandwidth: 6433.22/ppn + 0.00 MB/sec
#              latency: 730 nsec
#             overhead: 105 nsec
#             am_short: <= 116
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 5 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 3952
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
#      Transport: ud_mlx5
#         Device: mlx5_1:1
#           Type: network
#  System device: mlx5_1 (4)
#
#      capabilities:
#            bandwidth: 6433.22/ppn + 0.00 MB/sec
#              latency: 730 nsec
#             overhead: 80 nsec
#             am_short: <= 180
#             am_bcopy: <= 4088
#             am_zcopy: <= 4088, up to 3 iov
#   am_opt_zcopy_align: <= 512
#         am_align_mtu: <= 4K
#            am header: <= 132
#           connection: to ep, to iface
#      device priority: 38
#     device num paths: 1
#              max eps: inf
#       device address: 3 bytes
#        iface address: 3 bytes
#           ep address: 6 bytes
#       error handling: peer failure, ep_check
#
#
# Connection manager: rdmacm
#      max_conn_priv: 54 bytes
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#      Transport: cma
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
#
# Memory domain: knem
#     Component: knem
#             register: unlimited, cost: 18446744073709551616000000000 nsec
#           remote key: 16 bytes
#
#      Transport: knem
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 13862.00/ppn + 0.00 MB/sec
#              latency: 80 nsec
#             overhead: 2000 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 0 bytes
#       error handling: none
#
#
# Memory domain: xpmem
#     Component: xpmem
#             register: unlimited, cost: 60 nsec
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: xpmem
#         Device: memory
#           Type: intra-node
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 16 bytes
#       error handling: none
#
Finally, here is the output of crashing `ucx_perftest` run, but this time with `UCX_LOG_LEVEL=debug`:
[1642768357.761365] [node2:42970:0]           debug.c:1211 UCX  DEBUG using signal stack 0x7f8de578d000 size 141824
[1642768357.762515] [node2:42970:0]            init.c:116  UCX  DEBUG /opt/tools/ucx/1.12.0/lib/libucs.so.0 loaded at 0x7f8de4d31000
[1642768357.762541] [node2:42970:0]            init.c:117  UCX  DEBUG cmd line: /opt/tools/ucx/1.12.0/bin/ucx_perftest tre -t tag_bw -m cuda -s 16536 -n 1000 -p 9999 
[1642768357.762554] [node2:42970:0]          module.c:69   UCX  DEBUG ucs library path: /opt/tools/ucx/1.12.0/lib/libucs.so.0
[1642768357.762559] [node2:42970:0]          module.c:273  UCX  DEBUG loading modules for ucs
[1642768357.762592] [node2:42970:0]          module.c:273  UCX  DEBUG loading modules for ucx_perftest
[1642768357.765504] [node2:42970:0]        perftest.c:900  UCX  WARN  CPU affinity is not set (bound to 20 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
[1642768357.768808] [node2:42970:0]         libperf.c:215  UCX  DEBUG set allocator by send mem type cuda
[1642768357.769567] [node2:42970:0]            time.c:22   UCX  DEBUG measured arch clock speed: 2200000000.00 Hz
[1642768357.918690] [node2:42970:0]     ucp_context.c:1779 UCX  INFO  UCP version is 1.12 (release 0)
[1642768357.918703] [node2:42970:0]     ucp_context.c:1567 UCX  DEBUG estimated number of endpoints is 1
[1642768357.918705] [node2:42970:0]     ucp_context.c:1574 UCX  DEBUG estimated number of endpoints per node is 1
[1642768357.918715] [node2:42970:0]     ucp_context.c:1581 UCX  DEBUG estimated bcopy bandwidth is 6081740800.000000
[1642768357.918730] [node2:42970:0]     ucp_context.c:1647 UCX  DEBUG allocation method[0] is md 'sysv'
[1642768357.918732] [node2:42970:0]     ucp_context.c:1647 UCX  DEBUG allocation method[1] is md 'posix'
[1642768357.918739] [node2:42970:0]     ucp_context.c:1659 UCX  DEBUG allocation method[2] is 'huge'
[1642768357.918741] [node2:42970:0]     ucp_context.c:1659 UCX  DEBUG allocation method[3] is 'thp'
[1642768357.918743] [node2:42970:0]     ucp_context.c:1647 UCX  DEBUG allocation method[4] is md '*'
[1642768357.918745] [node2:42970:0]     ucp_context.c:1659 UCX  DEBUG allocation method[5] is 'mmap'
[1642768357.918746] [node2:42970:0]     ucp_context.c:1659 UCX  DEBUG allocation method[6] is 'heap'
[1642768357.918778] [node2:42970:0]          module.c:273  UCX  DEBUG loading modules for uct
[1642768357.919639] [node2:42970:0]          module.c:273  UCX  DEBUG loading modules for uct_cuda
[1642768357.920317] [node2:42970:0]          module.c:162  UCX  DEBUG ignoring 'ucs_module_global_init' (0x7f8dd3d31530) from libuct_cuda.so.0 (0x7f8dd3d2b000), expected in libuct_cuda_gdrcopy.so.0 (7f8dd3b25000)
[1642768357.924016] [node2:42970:0]            topo.c:133  UCX  DEBUG bus id 0x610000 doesn't exist. sys_dev = 0
[1642768357.924025] [node2:42970:0]            topo.c:133  UCX  DEBUG bus id 0x620000 doesn't exist. sys_dev = 1
[1642768357.924028] [node2:42970:0]            topo.c:133  UCX  DEBUG bus id 0x890000 doesn't exist. sys_dev = 2
[1642768357.924032] [node2:42970:0]            topo.c:133  UCX  DEBUG bus id 0x8a0000 doesn't exist. sys_dev = 3
[1642768357.924036] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x610000 exists. sys_dev = 0
[1642768357.924038] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x620000 exists. sys_dev = 1
[1642768357.924040] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x890000 exists. sys_dev = 2
[1642768357.924041] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x8a0000 exists. sys_dev = 3
[1642768357.924398] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x610000 exists. sys_dev = 0
[1642768357.924401] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x620000 exists. sys_dev = 1
[1642768357.924402] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x890000 exists. sys_dev = 2
[1642768357.924404] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x8a0000 exists. sys_dev = 3
[1642768357.924414] [node2:42970:0]          module.c:273  UCX  DEBUG loading modules for uct_ib
[1642768357.932457] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x610000 exists. sys_dev = 0
[1642768357.932462] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x620000 exists. sys_dev = 1
[1642768357.932465] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x890000 exists. sys_dev = 2
[1642768357.932466] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x8a0000 exists. sys_dev = 3
[1642768357.932512] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x620000 exists. sys_dev = 1
[1642768357.932518] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x610000 exists. sys_dev = 0
[1642768357.932520] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x620000 exists. sys_dev = 1
[1642768357.932521] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x890000 exists. sys_dev = 2
[1642768357.932523] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x8a0000 exists. sys_dev = 3
[1642768357.932574] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x620000 exists. sys_dev = 1
[1642768357.932589] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x610000 exists. sys_dev = 0
[1642768357.932591] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x620000 exists. sys_dev = 1
[1642768357.932592] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x890000 exists. sys_dev = 2
[1642768357.932594] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x8a0000 exists. sys_dev = 3
[1642768357.932633] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1642768357.935976] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x15e5750 [id=21 ref 1] ucs_rcache_invalidate_handler() to hash
[1642768357.936062] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 21 events 0x1 mode thread_spinlock
[1642768357.936086] [node2:42970:0]          module.c:273  UCX  DEBUG loading modules for ucm
[1642768357.936828] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x620000 exists. sys_dev = 1
[1642768357.945342] [node2:42970:0]       ib_device.c:554  UCX  DEBUG PF: mlx5_0 vendor_id: 0x15b3 device_id: 4119
[1642768357.945684] [node2:42970:0]    ib_mlx5dv_md.c:491  UCX  DEBUG mlx5_0: disable ODP because it's not supported for DevX QP
[1642768357.948524] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x15ef860 [id=27 ref 1] uct_ib_async_event_handler() to hash
[1642768357.948541] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 27 events 0x1 mode thread_spinlock
[1642768357.948552] [node2:42970:0]       ib_device.c:668  UCX  DEBUG initialized device 'mlx5_0' (InfiniBand channel adapter) with 1 ports
[1642768357.948825] [node2:42970:0]           ib_md.c:1673 UCX  DEBUG mlx5_0: cuda GPUDirect RDMA is enabled
[1642768357.948834] [node2:42970:0]           ib_md.c:1673 UCX  DEBUG mlx5_0: rocm GPUDirect RDMA is disabled
[1642768357.948849] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1642768357.949002] [node2:42970:0]           ib_md.c:1330 UCX  DEBUG mlx5_0: using registration cache
[1642768357.949110] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1642768357.949447] [node2:42970:0]           ib_md.c:1621 UCX  DEBUG mlx5_0: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1642768357.952000] [node2:42970:0]       ib_device.c:1185 UCX  DEBUG no compatible IB ports found for flags 0x0
[1642768357.952007] [node2:42970:0]          uct_md.c:113  UCX  DEBUG failed to query rc_verbs resources: No such device
[1642768357.952012] [node2:42970:0]       ib_device.c:1185 UCX  DEBUG no compatible IB ports found for flags 0x4
[1642768357.952014] [node2:42970:0]          uct_md.c:113  UCX  DEBUG failed to query rc_mlx5 resources: No such device
[1642768357.952018] [node2:42970:0]       ib_device.c:1185 UCX  DEBUG no compatible IB ports found for flags 0xc4
[1642768357.952020] [node2:42970:0]          uct_md.c:113  UCX  DEBUG failed to query dc_mlx5 resources: No such device
[1642768357.952024] [node2:42970:0]       ib_device.c:1185 UCX  DEBUG no compatible IB ports found for flags 0x0
[1642768357.952026] [node2:42970:0]          uct_md.c:113  UCX  DEBUG failed to query ud_verbs resources: No such device
[1642768357.952030] [node2:42970:0]       ib_device.c:1185 UCX  DEBUG no compatible IB ports found for flags 0x4
[1642768357.952032] [node2:42970:0]          uct_md.c:113  UCX  DEBUG failed to query ud_mlx5 resources: No such device
[1642768357.952034] [node2:42970:0]     ucp_context.c:891  UCX  DEBUG No tl resources found for md mlx5_0
[1642768357.952037] [node2:42970:0]     ucp_context.c:1305 UCX  DEBUG closing md mlx5_0 because it has no selected transport resources
[1642768357.952202] [node2:42970:0]           mpool.c:154  UCX  DEBUG mpool devx dbrec destroyed
[1642768357.952221] [node2:42970:0]           mpool.c:154  UCX  DEBUG mpool rcache_mp destroyed
[1642768357.952329] [node2:42970:0]       ib_device.c:686  UCX  DEBUG destroying ib device mlx5_0
[1642768357.952335] [node2:42970:0]           async.c:156  UCX  DEBUG removed async handler 0x15ef860 [id=27 ref 1] uct_ib_async_event_handler() from hash
[1642768357.952338] [node2:42970:0]           async.c:562  UCX  DEBUG removing async handler 0x15ef860 [id=27 ref 1] uct_ib_async_event_handler()
[1642768357.952344] [node2:42970:0]           async.c:171  UCX  DEBUG release async handler 0x15ef860 [id=27 ref 0] uct_ib_async_event_handler()
[1642768357.956792] [node2:42970:0]       ib_device.c:554  UCX  DEBUG PF: mlx5_1 vendor_id: 0x15b3 device_id: 4119
[1642768357.957064] [node2:42970:0]    ib_mlx5dv_md.c:491  UCX  DEBUG mlx5_1: disable ODP because it's not supported for DevX QP
[1642768357.957276] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x15e80f0 [id=27 ref 1] uct_ib_async_event_handler() to hash
[1642768357.957287] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 27 events 0x1 mode thread_spinlock
[1642768357.957290] [node2:42970:0]       ib_device.c:668  UCX  DEBUG initialized device 'mlx5_1' (InfiniBand channel adapter) with 1 ports
[1642768357.957430] [node2:42970:0]           ib_md.c:1673 UCX  DEBUG mlx5_1: cuda GPUDirect RDMA is enabled
[1642768357.957436] [node2:42970:0]           ib_md.c:1673 UCX  DEBUG mlx5_1: rocm GPUDirect RDMA is disabled
[1642768357.957442] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1642768357.957534] [node2:42970:0]           ib_md.c:1330 UCX  DEBUG mlx5_1: using registration cache
[1642768357.957613] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool devx dbrec: align 64, maxelems 4294967295, elemsize 40
[1642768357.957826] [node2:42970:0]           ib_md.c:1621 UCX  DEBUG mlx5_1: md open by 'uct_ib_mlx5_devx_md_ops' is successful
[1642768357.959161] [node2:42970:0]            topo.c:133  UCX  DEBUG bus id 0x600001 doesn't exist. sys_dev = 4
[1642768357.959167] [node2:42970:0]       ib_device.c:1140 UCX  DEBUG mlx5_1 bus id 0:96:0.1 sys_dev 4
[1642768357.959276] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x600001 exists. sys_dev = 4
[1642768357.959279] [node2:42970:0]       ib_device.c:1140 UCX  DEBUG mlx5_1 bus id 0:96:0.1 sys_dev 4
[1642768357.959385] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x600001 exists. sys_dev = 4
[1642768357.959388] [node2:42970:0]       ib_device.c:1140 UCX  DEBUG mlx5_1 bus id 0:96:0.1 sys_dev 4
[1642768357.959493] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x600001 exists. sys_dev = 4
[1642768357.959496] [node2:42970:0]       ib_device.c:1140 UCX  DEBUG mlx5_1 bus id 0:96:0.1 sys_dev 4
[1642768357.959601] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x600001 exists. sys_dev = 4
[1642768357.959603] [node2:42970:0]       ib_device.c:1140 UCX  DEBUG mlx5_1 bus id 0:96:0.1 sys_dev 4
[1642768357.959701] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rcache_mp: align 8, maxelems 4294967295, elemsize 144
[1642768357.959862] [node2:42970:0]     ucp_context.c:1819 UCX  DEBUG created ucp context 0x15de090 0x15de090 [10 mds 16 tls] features 0x1 tl bitmap 0xffff 0x0
[1642768357.959887] [node2:42970:0]          ucp_mm.c:317  UCX  DEBUG allocating user memory at (nil) length 16536 of cuda type
[1642768357.959905] [node2:42970:0]         uct_mem.c:293  UCX  DEBUG   could not allocate memory with any of the provided methods
[1642768357.959907] [node2:42970:0]         uct_mem.c:293  UCX  DEBUG   could not allocate memory with any of the provided methods
[1642768358.084787] [node2:42970:0]          ucp_mm.c:247  UCX  DEBUG allocated memory at 0x7f8db3a00000 with method md, now registering it
[1642768358.085797] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x620000 exists. sys_dev = 1
[1642768358.086750] [node2:42970:0]           mpool.c:237  UCX  DEBUG mpool rcache_mp: allocated chunk 0x7f8de5613008 of 151544 bytes with 1052 elements
[1642768358.088961] [node2:42970:0]          ucp_mm.c:343  UCX  DEBUG allocated buffer 0x7f8db3a00000 length 16536 type cuda memh 0x15f0c10 md_map 0xf0
[1642768358.088976] [node2:42970:0]          ucp_mm.c:317  UCX  DEBUG allocating user memory at (nil) length 16536 of cuda type
[1642768358.089001] [node2:42970:0]         uct_mem.c:293  UCX  DEBUG   could not allocate memory with any of the provided methods
[1642768358.089003] [node2:42970:0]         uct_mem.c:293  UCX  DEBUG   could not allocate memory with any of the provided methods
[1642768358.089036] [node2:42970:0]          ucp_mm.c:247  UCX  DEBUG allocated memory at 0x7f8db3a04200 with method md, now registering it
[1642768358.089041] [node2:42970:0]            topo.c:124  UCX  DEBUG bus id 0x620000 exists. sys_dev = 1
[1642768358.090672] [node2:42970:0]          ucp_mm.c:343  UCX  DEBUG allocated buffer 0x7f8db3a04200 length 16536 type cuda memh 0x1e91630 md_map 0xf0
[1642768358.091733] [node2:42970:0]        mm_posix.c:531  UCX  DEBUG   allocated posix shared memory at 0x7f8de5610000 length 12288
[1642768358.092399] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool mm_recv_desc: align 64, maxelems 4294967295, elemsize 8368
[1642768358.094726] [node2:42970:0]        mm_posix.c:326  UCX  DEBUG   shared memory mmap(addr=(nil), length=6291456, flags= HUGETLB, fd=45) failed: Invalid argument
[1642768358.094736] [node2:42970:0]        mm_posix.c:531  UCX  DEBUG   allocated posix shared memory at 0x7f8dd0f34000 length 4296704
[1642768358.094745] [node2:42970:0]           mpool.c:237  UCX  DEBUG mpool mm_recv_desc: allocated chunk 0x7f8dd0f34018 of 4296680 bytes with 512 elements
[1642768358.095669] [node2:42970:0]        mm_iface.c:674  UCX  DEBUG created mm iface 0x1eb9730 FIFO id 0xc000000ac000a7da va 0x7f8de5610000 size 12288 (128 x 64 elems)
[1642768358.095719] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[0]=0x1eb9730 using posix/memory on worker 0x1e96320
[1642768358.095751] [node2:42970:0]         mm_sysv.c:94   UCX  DEBUG   mm failed to allocate 8447 bytes with hugetlb
[1642768358.095814] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool mm_recv_desc: align 64, maxelems 4294967295, elemsize 8368
[1642768358.096724] [node2:42970:0]         mm_sysv.c:94   UCX  DEBUG   mm failed to allocate 4292720 bytes with hugetlb
[1642768358.096744] [node2:42970:0]           mpool.c:237  UCX  DEBUG mpool mm_recv_desc: allocated chunk 0x7f8dd0b1b018 of 4296680 bytes with 512 elements
[1642768358.098540] [node2:42970:0]        mm_iface.c:674  UCX  DEBUG created mm iface 0x1eba3f0 FIFO id 0x130009 va 0x7f8de560d000 size 12288 (128 x 64 elems)
[1642768358.098553] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[1]=0x1eba3f0 using sysv/memory on worker 0x1e96320
[1642768358.098572] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool self_msg_desc: align 64, maxelems 4294967295, elemsize 8200
[1642768358.098575] [node2:42970:0]            self.c:222  UCX  DEBUG created self iface id 0x3e84df64a047a1a5 send_size 8192
[1642768358.098580] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[2]=0x1ebace0 using self/memory0 on worker 0x1e96320
[1642768358.098605] [node2:42970:0]       tcp_iface.c:587  UCX  DEBUG using TCP port range: 0-0
[1642768358.098607] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool uct_tcp_iface_tx_buf_mp: align 64, maxelems 4294967295, elemsize 8205
[1642768358.098610] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool uct_tcp_iface_rx_buf_mp: align 64, maxelems 4294967295, elemsize 131090
[1642768358.102443] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x1eb95a0 [id=48 ref 1] uct_tcp_iface_connect_handler() to hash
[1642768358.102465] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 48 events 0x5 mode thread_spinlock
[1642768358.102479] [node2:42970:0]       tcp_iface.c:537  UCX  DEBUG tcp_iface 0x1ebb7f0: listening for connections (fd=48) on 127.0.0.1:43378
[1642768358.102521] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.102536] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.102660] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[3]=0x1ebb7f0 using tcp/lo on worker 0x1e96320
[1642768358.102682] [node2:42970:0]       tcp_iface.c:587  UCX  DEBUG using TCP port range: 0-0
[1642768358.102686] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool uct_tcp_iface_tx_buf_mp: align 64, maxelems 4294967295, elemsize 8205
[1642768358.102689] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool uct_tcp_iface_rx_buf_mp: align 64, maxelems 4294967295, elemsize 131090
[1642768358.103277] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x1ebcaf0 [id=50 ref 1] uct_tcp_iface_connect_handler() to hash
[1642768358.103289] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 50 events 0x5 mode thread_spinlock
[1642768358.103294] [node2:42970:0]       tcp_iface.c:537  UCX  DEBUG tcp_iface 0x1ebc280: listening for connections (fd=50) on 10.42.3.65:51151
[1642768358.103647] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[4]=0x1ebc280 using tcp/ib1 on worker 0x1e96320
[1642768358.103676] [node2:42970:0]       tcp_iface.c:587  UCX  DEBUG using TCP port range: 0-0
[1642768358.103680] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool uct_tcp_iface_tx_buf_mp: align 64, maxelems 4294967295, elemsize 8205
[1642768358.103683] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool uct_tcp_iface_rx_buf_mp: align 64, maxelems 4294967295, elemsize 131090
[1642768358.104306] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x1ebd7f0 [id=52 ref 1] uct_tcp_iface_connect_handler() to hash
[1642768358.104319] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 52 events 0x5 mode thread_spinlock
[1642768358.104324] [node2:42970:0]       tcp_iface.c:537  UCX  DEBUG tcp_iface 0x1ebd040: listening for connections (fd=52) on 10.42.1.204:56380
[1642768358.104435] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[5]=0x1ebd040 using tcp/enp1s0f0 on worker 0x1e96320
[1642768358.104462] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool CUDA EVENT objects: align 64, maxelems 4294967295, elemsize 32
[1642768358.104472] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[6]=0x1ebdc00 using cuda_copy/cuda on worker 0x1e96320
[1642768358.104484] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool CUDA_IPC EVENT objects: align 64, maxelems 4294967295, elemsize 72
[1642768358.116653] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[7]=0x1ebe850 using cuda_ipc/cuda on worker 0x1e96320
[1642768358.116683] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[8]=0x1ebf4f0 using gdr_copy/cuda on worker 0x1e96320
[1642768358.117039] [node2:42970:0]        ib_iface.c:866  UCX  DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1642768358.118132] [node2:42970:0]        ib_iface.c:1473 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 91 data_sz 8256
[1642768358.118173] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1642768358.118176] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1642768358.118205] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1642768358.119051] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1642768358.119056] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rc_verbs_short_desc: align 64, maxelems 4294967295, elemsize 192
[1642768358.119664] [node2:42970:0]        ib_iface.c:1008 UCX  DEBUG iface=0x1ec0280: created RC QP 0x348 on mlx5_1:1 TX wr:409 sge:5 inl:124 resp:64 RX wr:0 sge:0 resp:64
[1642768358.120806] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[9]=0x1ec0280 using rc_verbs/mlx5_1:1 on worker 0x1e96320
[1642768358.121009] [node2:42970:0]        ib_iface.c:866  UCX  DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1642768358.121052] [node2:42970:0]       ib_device.c:1409 UCX  DEBUG max IB CQE size is 128
[1642768358.122993] [node2:42970:0]        ib_iface.c:1473 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 90 data_sz 8256
[1642768358.123032] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1642768358.123040] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1642768358.123370] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1642768358.123866] [node2:42970:0]           mpool.c:237  UCX  DEBUG mpool devx dbrec: allocated chunk 0x21e3010 of 8176 bytes with 127 elements
[1642768358.124122] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 64
[1642768358.124280] [node2:42970:0]         ib_mlx5.c:889  UCX  DEBUG SL=0 (AR support - no) was selected on mlx5_1:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1642768358.125021] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool mlx5_dm_desc: align 64, maxelems 1, elemsize 64
[1642768358.125029] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rc_mlx5_atomic_desc: align 64, maxelems 4294967295, elemsize 72
[1642768358.128550] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x1fb4860 [id=55 ref 1] uct_rc_mlx5_devx_iface_event_handler() to hash
[1642768358.128567] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 55 events 0x1 mode thread_spinlock
[1642768358.128754] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[10]=0x200f040 using rc_mlx5/mlx5_1:1 on worker 0x1e96320
[1642768358.128972] [node2:42970:0]        ib_iface.c:866  UCX  DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1642768358.129819] [node2:42970:0]        ib_iface.c:1473 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 90 data_sz 8256
[1642768358.129846] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rc_recv_desc: align 64, maxelems 4294967295, elemsize 8356
[1642768358.129849] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rc_send_desc: align 64, maxelems 4294967295, elemsize 8320
[1642768358.129881] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool send-ops-mpool: align 64, maxelems 4294967295, elemsize 48
[1642768358.130280] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool pending-ops: align 1, maxelems 4294967295, elemsize 112
[1642768358.130409] [node2:42970:0]         ib_mlx5.c:889  UCX  DEBUG SL=0 (AR support - no) was selected on mlx5_1:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1642768358.130472] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool rc_mlx5_atomic_desc: align 64, maxelems 4294967295, elemsize 72
[1642768358.130487] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x1f4bd10 [id=57 ref 1] uct_rc_mlx5_devx_iface_event_handler() to hash
[1642768358.130497] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 57 events 0x1 mode thread_spinlock
[1642768358.131282] [node2:42970:0]         dc_mlx5.c:836  UCX  DEBUG creating dci pool 0 with 8 QPs
[1642768358.141226] [node2:42970:0]         dc_mlx5.c:1387 UCX  DEBUG dc iface 0x21e5050: using 'dcs_quota' policy with 8 dcis and 4608 cqes, dct 0x1547
[1642768358.151836] [node2:42970:0]           mpool.c:237  UCX  DEBUG mpool rc_recv_desc: allocated chunk 0x7f8da1800018 of 39845864 bytes with 4752 elements
[1642768358.152157] [node2:42970:0]         dc_mlx5.c:1403 UCX  DEBUG created dc iface 0x21e5050
[1642768358.152393] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[11]=0x21e5050 using dc_mlx5/mlx5_1:1 on worker 0x1e96320
[1642768358.152741] [node2:42970:0]        ib_iface.c:866  UCX  DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1642768358.154480] [node2:42970:0]        ib_iface.c:1473 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1642768358.155634] [node2:42970:0]        ib_iface.c:1008 UCX  DEBUG iface=0x23a0220: created UD QP 0x1550 on mlx5_1:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1642768358.156873] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1642768358.157265] [node2:42970:0]           mpool.c:237  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7f8dd0a96018 of 544744 bytes with 128 elements
[1642768358.157276] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1642768358.157345] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x23a0220: adding gid fe80::9803:9b03:33:f763 to hash on device mlx5_1 port 1 index 0)
[1642768358.157380] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x23a0220: adding gid fe80:: to hash on device mlx5_1 port 1 index 1)
[1642768358.157407] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x23a0220: adding gid fe80:: to hash on device mlx5_1 port 1 index 2)
[1642768358.157433] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x23a0220: adding gid fe80:: to hash on device mlx5_1 port 1 index 3)
[1642768358.157459] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x23a0220: adding gid fe80:: to hash on device mlx5_1 port 1 index 4)
[1642768358.157485] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x23a0220: adding gid fe80:: to hash on device mlx5_1 port 1 index 5)
[1642768358.157510] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x23a0220: adding gid fe80:: to hash on device mlx5_1 port 1 index 6)
[1642768358.157536] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x23a0220: adding gid fe80:: to hash on device mlx5_1 port 1 index 7)
[1642768358.157850] [node2:42970:0]     timer_wheel.c:41   UCX  DEBUG high res timer created log=23 resolution=3813.003636 usec wanted: 2500.000000 usec
[1642768358.161475] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x23f4590 [id=58 ref 1] uct_ud_iface_async_handler() to hash
[1642768358.161496] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 58 events 0x5 mode thread_spinlock
[1642768358.161679] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[12]=0x23a0220 using ud_verbs/mlx5_1:1 on worker 0x1e96320
[1642768358.161900] [node2:42970:0]        ib_iface.c:866  UCX  DEBUG using pkey[0] 0xffff on mlx5_1:1/IB
[1642768358.162976] [node2:42970:0]        ib_iface.c:1473 UCX  DEBUG created uct_ib_iface_t headroom_ofs 12 payload_ofs 92 hdr_ofs 44 data_sz 4096
[1642768358.163820] [node2:42970:0]        ib_iface.c:1008 UCX  DEBUG iface=0x24df050: created UD QP 0x1551 on mlx5_1:1 TX wr:341 sge:6 inl:124 resp:0 RX wr:4096 sge:1 resp:0
[1642768358.163826] [node2:42970:0]         ib_mlx5.c:568  UCX  DEBUG tx wq 65536 bytes [bb=64, nwqe=1024] mmio_mode bf_post
[1642768358.164453] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool ud_recv_skb: align 64, maxelems 4294967295, elemsize 4196
[1642768358.164829] [node2:42970:0]           mpool.c:237  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7f8dd0a11018 of 544744 bytes with 128 elements
[1642768358.164839] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool ud_tx_skb: align 64, maxelems 4294967295, elemsize 4168
[1642768358.164889] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x24df050: adding gid fe80::9803:9b03:33:f763 to hash on device mlx5_1 port 1 index 0)
[1642768358.164927] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x24df050: adding gid fe80:: to hash on device mlx5_1 port 1 index 1)
[1642768358.164953] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x24df050: adding gid fe80:: to hash on device mlx5_1 port 1 index 2)
[1642768358.164979] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x24df050: adding gid fe80:: to hash on device mlx5_1 port 1 index 3)
[1642768358.165004] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x24df050: adding gid fe80:: to hash on device mlx5_1 port 1 index 4)
[1642768358.165029] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x24df050: adding gid fe80:: to hash on device mlx5_1 port 1 index 5)
[1642768358.165053] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x24df050: adding gid fe80:: to hash on device mlx5_1 port 1 index 6)
[1642768358.165078] [node2:42970:0]        ud_iface.c:421  UCX  DEBUG iface 0x24df050: adding gid fe80:: to hash on device mlx5_1 port 1 index 7)
[1642768358.165519] [node2:42970:0]         ib_mlx5.c:889  UCX  DEBUG SL=0 (AR support - no) was selected on mlx5_1:1, SLs with AR support = { <none> }, SLs without AR support = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
[1642768358.165656] [node2:42970:0]     timer_wheel.c:41   UCX  DEBUG high res timer created log=23 resolution=3813.003636 usec wanted: 2500.000000 usec
[1642768358.165666] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x1ebfe00 [id=59 ref 1] uct_ud_iface_async_handler() to hash
[1642768358.165693] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 59 events 0x5 mode thread_spinlock
[1642768358.165982] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[13]=0x24df050 using ud_mlx5/mlx5_1:1 on worker 0x1e96320
[1642768358.166023] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool uct_scopy_iface_tx_mp: align 64, maxelems 4294967295, elemsize 736
[1642768358.166035] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[14]=0x25ba200 using cma/memory on worker 0x1e96320
[1642768358.166055] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool uct_scopy_iface_tx_mp: align 64, maxelems 4294967295, elemsize 736
[1642768358.166062] [node2:42970:0]      ucp_worker.c:1202 UCX  DEBUG created interface[15]=0x25ba9b0 using knem/memory on worker 0x1e96320
[1642768358.166066] [node2:42970:0]      ucp_worker.c:1008 UCX  DEBUG selected scalable tl bitmap: 0xffff 0x0 (16 tls)
[1642768358.170755] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x23f4ba0 [id=44 ref 1] ucp_worker_iface_async_fd_event() to hash
[1642768358.170771] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 44 events 0x0 mode thread_spinlock
[1642768358.170810] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x25bafc0 [id=46 ref 1] ucp_worker_iface_async_fd_event() to hash
[1642768358.170817] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 46 events 0x0 mode thread_spinlock
[1642768358.170833] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x23f4860 [id=47 ref 1] ucp_worker_iface_async_fd_event() to hash
[1642768358.170843] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 47 events 0x0 mode thread_spinlock
[1642768358.170895] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.170908] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.171054] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.171059] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.171145] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.171151] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.171233] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.171238] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.171318] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.171324] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.171404] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.171409] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.171488] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.171493] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.171572] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.171577] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.171657] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.171662] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.171742] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.171747] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.171827] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.171833] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.171911] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.171923] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.172003] [node2:42970:0]            sock.c:90   UCX  DEBUG ioctl(req=35142, ifr_name=lo) failed: Operation not supported
[1642768358.172008] [node2:42970:0]         tcp_net.c:61   UCX  DEBUG speed of lo is UNKNOWN, assuming 100 Mbps
[1642768358.172085] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x23f48a0 [id=49 ref 1] ucp_worker_iface_async_fd_event() to hash
[1642768358.172094] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 49 events 0x0 mode thread_spinlock
[1642768358.175111] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x1fb4570 [id=51 ref 1] ucp_worker_iface_async_fd_event() to hash
[1642768358.175121] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 51 events 0x0 mode thread_spinlock
[1642768358.176106] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x23f4970 [id=60 ref 1] ucp_worker_iface_async_fd_event() to hash
[1642768358.176114] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 60 events 0x0 mode thread_spinlock
[1642768358.176118] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x1fb4bb0 [id=53 ref 1] ucp_worker_iface_async_fd_event() to hash
[1642768358.176127] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 53 events 0x0 mode thread_spinlock
[1642768358.176176] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x1fb4bf0 [id=54 ref 1] ucp_worker_iface_async_fd_event() to hash
[1642768358.176183] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 54 events 0x0 mode thread_spinlock
[1642768358.176214] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x23d2220 [id=56 ref 1] ucp_worker_iface_async_fd_event() to hash
[1642768358.176221] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 56 events 0x0 mode thread_spinlock
[1642768358.177599] [node2:42970:0]           async.c:231  UCX  DEBUG added async handler 0x1ebcc30 [id=61 ref 1] uct_rdmacm_cm_event_handler() to hash
[1642768358.177612] [node2:42970:0]           async.c:509  UCX  DEBUG listening to async event fd 61 events 0x1 mode thread_spinlock
[1642768358.177618] [node2:42970:0]       rdmacm_cm.c:949  UCX  DEBUG created rdmacm_cm 0x23d2380 with event_channel 0x23d2c90 (fd=61)
[1642768358.177648] [node2:42970:0]      tcp_sockcm.c:215  UCX  DEBUG created tcp_sockcm 0x2438050
[1642768358.177857] [node2:42970:0]          ucp_ep.c:229  UCX  DEBUG created ep 0x7f8de55cb000 to <no debug data> mem_type_ep:cuda
[1642768358.178020] [node2:42970:0]      ucp_worker.c:1867 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(gdr_copy/cuda); 
[1642768358.178028] [node2:42970:0]          wireup.c:1076 UCX  DEBUG   ep 0x7f8de55cb000: am_lane <none> wireup_msg_lane <none> cm_lane <none> reachable_mds 0x50 ep_check_map 0x0
[1642768358.178036] [node2:42970:0]          wireup.c:1086 UCX  DEBUG   ep 0x7f8de55cb000: lane[0]:  8:gdr_copy/cuda.0 md[6]         -> addr[2].md[6]/gdr_copy/sysdev[1]
[1642768358.178040] [node2:42970:0]          wireup.c:1086 UCX  DEBUG   ep 0x7f8de55cb000: lane[1]:  6:cuda_copy/cuda.0 md[4]        -> addr[0].md[4]/cuda_cpy/sysdev[1] rma_bw#0
[1642768358.178078] [node2:42970:0]          ucp_ep.c:229  UCX  DEBUG created ep 0x7f8de55cb040 to <no debug data> mem_type_ep:cuda-managed
[1642768358.178112] [node2:42970:0]      ucp_worker.c:1867 UCX  INFO    ep_cfg[1]: tag(cuda_copy/cuda); rma(cuda_copy/cuda); 
[1642768358.178115] [node2:42970:0]          wireup.c:1076 UCX  DEBUG   ep 0x7f8de55cb040: am_lane <none> wireup_msg_lane <none> cm_lane <none> reachable_mds 0x10 ep_check_map 0x0
[1642768358.178119] [node2:42970:0]          wireup.c:1086 UCX  DEBUG   ep 0x7f8de55cb040: lane[0]:  6:cuda_copy/cuda.0 md[4]        -> addr[0].md[4]/cuda_cpy/sysdev[1] rma_bw#0
[1642768358.178125] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool ucp_requests: align 64, maxelems 4294967295, elemsize 296
[1642768358.178128] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool ucp_rkeys: align 64, maxelems 4294967295, elemsize 104
[1642768358.178130] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool ucp_reg_bufs: align 64, maxelems 4294967295, elemsize 8216
[1642768358.178148] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 153
[1642768358.178150] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 1113
[1642768358.178152] [node2:42970:0]           mpool.c:100  UCX  DEBUG mpool ucp_am_bufs: align 64, maxelems 4294967295, elemsize 65625
[1642768358.178155] [node2:42970:0]       mpool_set.c:121  UCX  DEBUG mpool_set:ucp_am_bufs, sizes map 0x80000440, largest size 65536, mpools num 3
[1642768358.178183] [node2:42970:0]          parser.c:1916 UCX  INFO  UCX_* env variable: UCX_LOG_LEVEL=debug
[1642768358.178334] [node2:42970:0]          ucp_ep.c:229  UCX  DEBUG created ep 0x7f8de55cb080 to <no debug data> from api call
[1642768358.178411] [node2:42970:0]      ucp_worker.c:1867 UCX  INFO    ep_cfg[2]: tag(rc_mlx5/mlx5_1:1); 
[1642768358.178415] [node2:42970:0]          wireup.c:1076 UCX  DEBUG   ep 0x7f8de55cb080: am_lane 0 wireup_msg_lane 0 cm_lane <none> reachable_mds 0x88 ep_check_map 0x0
[1642768358.178420] [node2:42970:0]          wireup.c:1086 UCX  DEBUG   ep 0x7f8de55cb080: lane[0]: 10:rc_mlx5/mlx5_1:1.0 md[7]      -> addr[10].md[7]/ib/sysdev[4] rma_bw#0 am am_bw#0 wireup
[1642768358.180412] [node2:42970:0]           rc_ep.c:161  UCX  DEBUG   created rc ep 0x2438cc0
[1642768358.180439] [node2:42970:0]       wireup_ep.c:549  UCX  DEBUG   ep 0x7f8de55cb080: wireup_ep 0x2438a40 created next_ep 0x2438cc0 to <no debug data> using rc_mlx5/mlx5_1:1
[1642768358.180523] [node2:42970:0]        ib_iface.c:752  UCX  DEBUG   iface 0x24df050: ah_attr dlid=26 sl=0 port=1 src_path_bits=0
[1642768358.180562] [node2:42970:0]        ib_iface.c:752  UCX  DEBUG   iface 0x24df050: ah_attr dlid=26 sl=0 port=1 src_path_bits=0
[1642768358.180605] [node2:42970:0]           ud_ep.c:378  UCX  DEBUG   created ep ep=0x2452a00 iface=0x24df050 id=0
[1642768358.180629] [node2:42970:0]           ud_ep.c:496  UCX  DEBUG   mlx5_1:1/IB lid 28 qpn 0x1551 epid 0 ep 0x2452a00 connected to IFACE lid 26 fe80::pkey 0xffff  qpn 0x16c9
[1642768358.180639] [node2:42970:0]        ib_iface.c:752  UCX  DEBUG   iface 0x24df050: ah_attr dlid=26 sl=0 port=1 src_path_bits=0
[1642768358.180650] [node2:42970:0]        ib_iface.c:752  UCX  DEBUG   iface 0x24df050: ah_attr dlid=26 sl=0 port=1 src_path_bits=0
[1642768358.182252] [node2:42970:0]           mpool.c:237  UCX  DEBUG   mpool ud_tx_skb: allocated chunk 0x2800018 of 6291432 bytes with 1489 elements
[1642768358.184762] [node2:42970:0]           async.c:231  UCX  DEBUG   added async handler 0x2452c50 [id=1000020 ref 1] uct_ud_iface_timer() to hash
[1642768358.184783] [node2:42970:0]       wireup_ep.c:316  UCX  DEBUG   ep 0x7f8de55cb080: wireup_ep 0x2438a40 created aux_ep 0x2452a00 to <no debug data> using ud_mlx5/mlx5_1:1
[1642768358.194539] [node2:42970:0]           mpool.c:237  UCX  DEBUG   mpool rc_recv_desc: allocated chunk 0x7f8d9f000018 of 39845864 bytes with 4752 elements
[1642768358.194746] [node2:42970:0]          wireup.c:1430 UCX  DEBUG ep 0x7f8de55cb080: send wireup request (flags=0x40)
[1642768358.194799] [node2:42970:a]        ib_iface.c:752  UCX  DEBUG iface 0x24df050: ah_attr dlid=26 sl=0 port=1 src_path_bits=0
[1642768358.194856] [node2:42970:a]           ud_ep.c:765  UCX  DEBUG simultaneuous CREQ ep=0x2452a00(iface=0x24df050 conn_sn=0 ep_id=0, dest_ep_id=0 rx_psn=1)
[1642768358.195045] [node2:42970:0]           mpool.c:237  UCX  DEBUG mpool ucp_requests: allocated chunk 0x261b024 of 41044 bytes with 128 elements
[1642768358.200090] [node2:42970:a]           mpool.c:237  UCX  DEBUG mpool ud_recv_skb: allocated chunk 0x7f8d98200018 of 20971496 bytes with 4964 elements
[1642768358.200309] [node2:42970:a]        ib_iface.c:752  UCX  DEBUG   iface 0x200f040: ah_attr dlid=26 sl=0 port=1 src_path_bits=0
[1642768358.201548] [node2:42970:a]    rc_mlx5_devx.c:454  UCX  DEBUG   connected rc devx qp 0x351 on mlx5_1:1/IB to lid 26(+0) sl 0 remote_qp 0x42d mtu 4096 timer 18x7 rnr 13x7 rd_atom 4
[1642768358.201626] [node2:42970:0]           flush.c:310  UCX  DEBUG flush_worker ep 0x7f8de55cb080
[1642768358.205665] [node2:42970:0]       wireup_ep.c:478  UCX  DEBUG ep 0x7f8de55cb080: destroy wireup ep 0x2438a40
[1642768358.205698] [node2:42970:0]           ud_ep.c:1684 UCX  DEBUG ep 0x2452a00: disconnect
[1642768358.205711] [node2:42970:0]           async.c:156  UCX  DEBUG removed async handler 0x2452c50 [id=1000020 ref 1] uct_ud_iface_timer() from hash
[1642768358.205719] [node2:42970:0]           async.c:562  UCX  DEBUG removing async handler 0x2452c50 [id=1000020 ref 1] uct_ud_iface_timer()
[1642768358.205732] [node2:42970:0]           async.c:171  UCX  DEBUG release async handler 0x2452c50 [id=1000020 ref 0] uct_ud_iface_timer()
[1642768358.211855] [node2:42970:a]       ib_device.c:486  UCX  DEBUG IB Async event on mlx5_1: SRQ-attached QP 0x351 was flushed
[1642768358.220322] [node2:42970:0]     ib_mlx5_log.c:168  UCX  DEBUG Local protection on mlx5_1:1/IB (synd 0x4 vend 0x51 hw_synd 0/2)
[1642768358.220322] [node2:42970:0]     ib_mlx5_log.c:168  UCX  DEBUG RC QP 0x351 wqe[0]: opcode SEND 
[1642768358.220334] [node2:42970:0]      ucp_worker.c:520  UCX  DEBUG worker 0x1e96320: error handler called for UCT EP 0x2438cc0: Input/output error
[1642768358.220344] [node2:42970:0]          ucp_ep.c:1172 UCX  DEBUG ep 0x7f8de55cb080: set_ep_failed status Input/output error on lane[0]=0x2438cc0
[1642768358.220349] [node2:42970:0]          ucp_ep.c:1434 UCX  DEBUG ep 0x7f8de55cb080: discarding lanes
[1642768358.220352] [node2:42970:0]          ucp_ep.c:1444 UCX  DEBUG ep 0x7f8de55cb080: discard uct_ep[0]=0x2438cc0
[1642768358.220357] [node2:42970:0]           mpool.c:237  UCX  DEBUG mpool send-ops-mpool: allocated chunk 0x25eb050 of 16472 bytes with 256 elements
[1642768358.220364] [node2:42970:0]          ucp_ep.c:1226 UCX  DIAG  ep 0x7f8de55cb080: error 'Input/output error' on rc_mlx5/mlx5_1:1 will not be handled since no error callback is installed
[node2:42970:0:42970] ib_mlx5_log.c:168  Local protection on mlx5_1:1/IB (synd 0x4 vend 0x51 hw_synd 0/2)
[node2:42970:0:42970] ib_mlx5_log.c:168  RC QP 0x351 wqe[0]: SEND s-- [inl len 26] [va 0x7f8db3a00000 len 8230 lkey 0x42332] [rqpn 0x42d dlid=26 sl=0 port=1 src_path_bits=0]
==== backtrace (tid:  42970) ====
 0 0x0000000000025183 uct_ib_mlx5_completion_with_err()  /tmp/ucx-1.12.0/src/uct/ib/mlx5/ib_mlx5_log.c:162
 1 0x0000000000039b87 uct_ib_mlx5_poll_cq()  /tmp/ucx-1.12.0/src/uct/ib/mlx5/ib_mlx5.inl:91
 2 0x0000000000039b87 uct_rc_mlx5_iface_progress()  /tmp/ucx-1.12.0/src/uct/ib/rc/accel/rc_mlx5_iface.c:173
 3 0x0000000000039b87 uct_rc_mlx5_iface_progress_cyclic()  /tmp/ucx-1.12.0/src/uct/ib/rc/accel/rc_mlx5_iface.c:178
 4 0x000000000003b4da ucs_callbackq_dispatch()  /tmp/ucx-1.12.0/src/ucs/datastruct/callbackq.h:211
 5 0x000000000003b4da uct_worker_progress()  /tmp/ucx-1.12.0/src/uct/api/uct.h:2589
 6 0x000000000003b4da ucp_worker_progress()  /tmp/ucx-1.12.0/src/ucp/core/ucp_worker.c:2629
 7 0x00000000004608aa ucp_perf_test_runner<(ucx_perf_cmd_t)7, (ucx_perf_test_type_t)2, 0u>::progress()  /tmp/ucx-1.12.0/src/tools/perf/lib/ucp_tests.cc:219
 8 0x00000000004608aa ucp_perf_test_runner<(ucx_perf_cmd_t)7, (ucx_perf_test_type_t)2, 0u>::send()  /tmp/ucx-1.12.0/src/tools/perf/lib/ucp_tests.cc:388
 9 0x00000000004608aa ucp_perf_test_runner<(ucx_perf_cmd_t)7, (ucx_perf_test_type_t)2, 0u>::run_stream_uni()  /tmp/ucx-1.12.0/src/tools/perf/lib/ucp_tests.cc:763
10 0x0000000000455248 ucp_perf_test_runner<(ucx_perf_cmd_t)7, (ucx_perf_test_type_t)2, 0u>::run()  /tmp/ucx-1.12.0/src/tools/perf/lib/ucp_tests.cc:790
11 0x0000000000455248 ucp_perf_test_dispatch()  /tmp/ucx-1.12.0/src/tools/perf/lib/ucp_tests.cc:935
12 0x0000000000408860 ucx_perf_run()  /tmp/ucx-1.12.0/src/tools/perf/lib/libperf.c:1695
13 0x000000000040581d run_test_recurs()  /tmp/ucx-1.12.0/src/tools/perf/perftest_run.c:273
14 0x0000000000405eda run_test()  /tmp/ucx-1.12.0/src/tools/perf/perftest_run.c:332
15 0x00000000004046cc main()  /tmp/ucx-1.12.0/src/tools/perf/perftest.c:975
16 0x00000000004046cc main()  /tmp/ucx-1.12.0/src/tools/perf/perftest.c:984
17 0x0000000000022555 __libc_start_main()  ???:0
18 0x0000000000404c9b _start()  ???:0
=================================
@cgorac cgorac added the Bug label Jan 21, 2022
@yosefe
Copy link
Contributor

yosefe commented Jan 21, 2022

@cgorac seems there is problem with GPU direct on this setup.
Can you pls check if ACS is enabled? see https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#pci-access-control-services-acs

@cgorac
Copy link
Author

cgorac commented Jan 21, 2022

Indeed that's it. I've used setpci to turn off ACS on PLX PCI bridges that had it on, and now ucx_perftest works without crashing. The bandwidth is not where I'd like to see it, but that's another topic. Many thanks for your help!

@cgorac cgorac closed this as completed Jan 21, 2022
@cgorac
Copy link
Author

cgorac commented Jan 21, 2022

One more question here: I realized later that MPI programs (it's OpenMPI in particular) that are linked with static version of CUDA runtime library still crash, with the same error, even if I turned off ACS on all PLX PCI bridges. If I add UCX_MEMTYPE_CACHE=0, they they work (but performance this way is visibly worse than even if program copy data between GPU and host memory itself, and then pass host buffer pointers only to MPI calls). I understood that by UCX 1.12.0 using UCX_MEMTYPE_CACHE=0 is not needed at all any more, is that correct or are there still some exceptions, like this one?

@cgorac cgorac reopened this Jan 21, 2022
@yosefe
Copy link
Contributor

yosefe commented Jan 21, 2022

@cgorac does it happen only with statically linked programs (so dynamic link of the same program works fine)?
does the program use cuda stream memory allocation (cudaMallocAsync)?

@cgorac
Copy link
Author

cgorac commented Jan 21, 2022

Yes, the program in question crashes with the same error as mentioned above only if linked with static version of CUDA runtime library. It doesn't use cudaMallocAsync().

@yosefe
Copy link
Contributor

yosefe commented Jan 22, 2022

@cgorac thanks for the clarification,

  1. can you pls run the program with export UCX_MEM_LOG_LEVEL=debug and upload the resulting log?
  2. do you know which Cuda API are used to allocate and release memory?

@cgorac
Copy link
Author

cgorac commented Jan 22, 2022

Sure, the log is below (note that I removed function names etc. from the program itself in the stack traces etc., as these are not relevant anyway). The program in question uses just couple of cudaMalloc() calls to allocate memory, actually it pre-allocates most of the memory needed by one of these calls, and then implements own pool allocator.

At the moment of crash, the program issues some MPI_Isend()/MPI_Irecv() calls, passing pointers to the GPU memory from above mentioned pool as arguments to these calls, and then it calls MPI_Waitall(). As mentioned above, the crash won't happen if I run with UCX_MEMTYPE_CACHE=0; maybe using the pool is actually the cause of the problem, because of clashing with how UCX employs memtype cache?

Here is the log:
[1642876601.101673] [node1:126706:0]              sys.c:326  UCX  DEBUG loaded '/opt/tools/ucx/1.12.0/lib/libucm.so.0' at 0x7f2356669000 with NODELETE flag
[1642876601.101709] [node1:126706:0]          install.c:364  UCX  DEBUG mmap: installing bistro hook for mmap = 0x7f2351cb4060 for event 0x1
[1642876601.101726] [node1:126706:0]            reloc.h:71   UCX  DEBUG original mmap() is at 0x7f23539ccf90
[1642876601.101763] [node1:126706:0]          install.c:364  UCX  DEBUG mmap: installing bistro hook for munmap = 0x7f2351cb40d0 for event 0x2
[1642876601.101777] [node1:126706:0]            reloc.h:71   UCX  DEBUG original munmap() is at 0x7f23539cd050
[1642876601.101816] [node1:126706:0]          install.c:364  UCX  DEBUG mmap: installing bistro hook for mremap = 0x7f2351cb4110 for event 0x4
[1642876601.101830] [node1:126706:0]            reloc.h:71   UCX  DEBUG original mremap() is at 0x7f23539d3290
[1642876601.101849] [node1:126706:0]          install.c:364  UCX  DEBUG mmap: installing bistro hook for shmat = 0x7f2351cb4170 for event 0x8
[1642876601.101866] [node1:126706:0]            reloc.h:71   UCX  DEBUG original shmat() is at 0x7f23539d43c0
[1642876601.101887] [node1:126706:0]          install.c:364  UCX  DEBUG mmap: installing bistro hook for shmdt = 0x7f2351cb41b0 for event 0x10
[1642876601.101902] [node1:126706:0]            reloc.h:71   UCX  DEBUG original shmdt() is at 0x7f23539d43f0
[1642876601.101925] [node1:126706:0]          install.c:364  UCX  DEBUG mmap: installing bistro hook for sbrk = 0x7f2351cb41e0 for event 0x20
[1642876601.101942] [node1:126706:0]            reloc.h:71   UCX  DEBUG original sbrk() is at 0x7f23539c93f0
[1642876601.101970] [node1:126706:0]          install.c:364  UCX  DEBUG mmap: installing bistro hook for brk = 0x7f2351cb4210 for event 0x80
[1642876601.101985] [node1:126706:0]            reloc.h:71   UCX  DEBUG original brk() is at 0x7f23539c9380
[1642876601.102000] [node1:126706:0]          install.c:364  UCX  DEBUG mmap: installing bistro hook for madvise = 0x7f2351cb4240 for event 0x40
[1642876601.102015] [node1:126706:0]            reloc.h:71   UCX  DEBUG original madvise() is at 0x7f23539cd110
[1642876601.102039] [node1:126706:0]          install.c:277  UCX  DEBUG testing mmap installed events 0x300ff
[1642876601.102192] [node1:126706:0]          install.c:284  UCX  DEBUG mmap installed events test: got 0x300ff out of 0x300ff
[1642876601.102204] [node1:126706:0]          install.c:425  UCX  INFO mmap installed events = 0x300ff
[1642876601.103077] [node2:11067:0]               sys.c:326  UCX  DEBUG loaded '/opt/tools/ucx/1.12.0/lib/libucm.so.0' at 0x7fbffdeb9000 with NODELETE flag
[1642876601.103111] [node2:11067:0]           install.c:364  UCX  DEBUG mmap: installing bistro hook for mmap = 0x7fbff9501060 for event 0x1
[1642876601.103127] [node2:11067:0]             reloc.h:71   UCX  DEBUG original mmap() is at 0x7fbffb219f90
[1642876601.103163] [node2:11067:0]           install.c:364  UCX  DEBUG mmap: installing bistro hook for munmap = 0x7fbff95010d0 for event 0x2
[1642876601.103176] [node2:11067:0]             reloc.h:71   UCX  DEBUG original munmap() is at 0x7fbffb21a050
[1642876601.103198] [node2:11067:0]           install.c:364  UCX  DEBUG mmap: installing bistro hook for mremap = 0x7fbff9501110 for event 0x4
[1642876601.103210] [node2:11067:0]             reloc.h:71   UCX  DEBUG original mremap() is at 0x7fbffb220290
[1642876601.103230] [node2:11067:0]           install.c:364  UCX  DEBUG mmap: installing bistro hook for shmat = 0x7fbff9501170 for event 0x8
[1642876601.103241] [node2:11067:0]             reloc.h:71   UCX  DEBUG original shmat() is at 0x7fbffb2213c0
[1642876601.103260] [node2:11067:0]           install.c:364  UCX  DEBUG mmap: installing bistro hook for shmdt = 0x7fbff95011b0 for event 0x10
[1642876601.103272] [node2:11067:0]             reloc.h:71   UCX  DEBUG original shmdt() is at 0x7fbffb2213f0
[1642876601.103288] [node2:11067:0]           install.c:364  UCX  DEBUG mmap: installing bistro hook for sbrk = 0x7fbff95011e0 for event 0x20
[1642876601.103300] [node2:11067:0]             reloc.h:71   UCX  DEBUG original sbrk() is at 0x7fbffb2163f0
[1642876601.103320] [node2:11067:0]           install.c:364  UCX  DEBUG mmap: installing bistro hook for brk = 0x7fbff9501210 for event 0x80
[1642876601.103345] [node2:11067:0]             reloc.h:71   UCX  DEBUG original brk() is at 0x7fbffb216380
[1642876601.103360] [node2:11067:0]           install.c:364  UCX  DEBUG mmap: installing bistro hook for madvise = 0x7fbff9501240 for event 0x40
[1642876601.103371] [node2:11067:0]             reloc.h:71   UCX  DEBUG original madvise() is at 0x7fbffb21a110
[1642876601.103392] [node2:11067:0]           install.c:277  UCX  DEBUG testing mmap installed events 0x300ff
[1642876601.103535] [node2:11067:0]           install.c:284  UCX  DEBUG mmap installed events test: got 0x300ff out of 0x300ff
[1642876601.103547] [node2:11067:0]           install.c:425  UCX  INFO mmap installed events = 0x300ff
[1642876602.021397] [node2:11067:0]           install.c:277  UCX  DEBUG testing mmap existing events 0x0
[1642876602.021431] [node2:11067:0]           install.c:284  UCX  DEBUG mmap existing events test: got 0x0 out of 0x0
[1642876602.021444] [node2:11067:0]             event.c:523  UCX  DEBUG mmap hooks are ready
[1642876602.021456] [node2:11067:0]       malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(before test): have 0x0/0x0 events; mmap_mode=2 hook_called=0
[1642876602.021465] [node2:11067:0]             event.c:533  UCX  DEBUG malloc hooks are ready
[1642876602.022116] [node2:11067:0]             reloc.h:71   UCX  DEBUG original cuMemAlloc() is at 0x7fbffb6e4c10
[1642876602.022143] [node2:11067:0]     bistro_x86_64.c:284  UCX  DEBUG 'cuMemAlloc' at 0x7fbffb6e4c10 code length 18/5 prefix length 10
[1642876602.022175] [node2:11067:0]             reloc.h:71   UCX  DEBUG original cuMemAlloc_v2() is at 0x7fbffb72e620
[1642876602.022186] [node2:11067:0]     bistro_x86_64.c:284  UCX  DEBUG 'cuMemAlloc_v2' at 0x7fbffb72e620 code length 18/5 prefix length 10
[1642876602.022208] [node2:11067:0]             reloc.h:71   UCX  DEBUG original cuMemAllocManaged() is at 0x7fbffb70a6d0
[1642876602.022227] [node2:11067:0]     bistro_x86_64.c:284  UCX  DEBUG 'cuMemAllocManaged' at 0x7fbffb70a6d0 code length 18/5 prefix length 10
[1642876602.022248] [node2:11067:0]             reloc.h:71   UCX  DEBUG original cuMemAllocPitch() is at 0x7fbffb6e4be0
[1642876602.022258] [node2:11067:0]     bistro_x86_64.c:284  UCX  DEBUG 'cuMemAllocPitch' at 0x7fbffb6e4be0 code length 18/5 prefix length 10
[1642876602.022274] [node2:11067:0]             reloc.h:71   UCX  DEBUG original cuMemAllocPitch_v2() is at 0x7fbffb70a850
[1642876602.022300] [node2:11067:0]     bistro_x86_64.c:284  UCX  DEBUG 'cuMemAllocPitch_v2' at 0x7fbffb70a850 code length 18/5 prefix length 10
[1642876602.022325] [node2:11067:0]             reloc.h:71   UCX  DEBUG original cuMemFree() is at 0x7fbffb6e4bb0
[1642876602.022347] [node2:11067:0]     bistro_x86_64.c:284  UCX  DEBUG 'cuMemFree' at 0x7fbffb6e4bb0 code length 18/5 prefix length 10
[1642876602.022363] [node2:11067:0]             reloc.h:71   UCX  DEBUG original cuMemFree_v2() is at 0x7fbffb70a820
[1642876602.022373] [node2:11067:0]     bistro_x86_64.c:284  UCX  DEBUG 'cuMemFree_v2' at 0x7fbffb70a820 code length 18/5 prefix length 10
[1642876602.022396] [node2:11067:0]             reloc.h:71   UCX  DEBUG original cuMemFreeHost() is at 0x7fbffb70a790
[1642876602.022406] [node2:11067:0]     bistro_x86_64.c:284  UCX  DEBUG 'cuMemFreeHost' at 0x7fbffb70a790 code length 18/5 prefix length 10
[1642876602.022426] [node2:11067:0]             reloc.h:71   UCX  DEBUG original cuMemFreeHost_v2() is at 0x0
[1642876602.022443] [node2:11067:0]           cudamem.c:256  UCX  INFO cuda memory hooks on driver API: installed 8 bistro and 0 reloc
[1642876602.022456] [node2:11067:0]             event.c:606  UCX  DEBUG added user handler (func=0x7fbff9981f00 arg=0x20ded1b0) for events=0x200000 prio=1000
[1642876602.030339] [node2:11067:0]           install.c:277  UCX  DEBUG testing mmap existing events 0x200ff
[1642876602.030432] [node2:11067:0]           install.c:284  UCX  DEBUG mmap existing events test: got 0x200ff out of 0x200ff
[1642876602.030445] [node2:11067:0]             event.c:523  UCX  DEBUG mmap hooks are ready
[1642876602.030455] [node2:11067:0]       malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(before test): have 0x0/0x20000 events; mmap_mode=2 hook_called=0
[1642876602.030464] [node2:11067:0]       malloc_hook.c:600  UCX  DEBUG testing malloc...
[1642876602.030509] [node2:11067:0]       malloc_hook.c:642  UCX  DEBUG malloc test: have 0x20000 out of 0x20000, malloc/free hooks were not called
[1642876602.030522] [node2:11067:0]       malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(after test): have 0x20000/0x20000 events; mmap_mode=2 hook_called=0
[1642876602.030531] [node2:11067:0]             event.c:533  UCX  DEBUG malloc hooks are ready
[1642876602.030540] [node2:11067:0]             event.c:606  UCX  DEBUG added user handler (func=0x7fbff9981f00 arg=0x1677fea0) for events=0x220000 prio=1000
[1642876602.038975] [node2:11067:0]           install.c:277  UCX  DEBUG testing mmap existing events 0x200ff
[1642876602.039063] [node2:11067:0]           install.c:284  UCX  DEBUG mmap existing events test: got 0x200ff out of 0x200ff
[1642876602.039074] [node2:11067:0]             event.c:523  UCX  DEBUG mmap hooks are ready
[1642876602.039088] [node2:11067:0]       malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(before test): have 0x20000/0x20000 events; mmap_mode=2 hook_called=0
[1642876602.039098] [node2:11067:0]             event.c:533  UCX  DEBUG malloc hooks are ready
[1642876602.039107] [node2:11067:0]             event.c:606  UCX  DEBUG added user handler (func=0x7fbff9981f00 arg=0x20df1df0) for events=0x220000 prio=1000
[1642876602.041673] [node2:11067:0]           install.c:277  UCX  DEBUG testing mmap existing events 0x200ff
[1642876602.041762] [node2:11067:0]           install.c:284  UCX  DEBUG mmap existing events test: got 0x200ff out of 0x200ff
[1642876602.041775] [node2:11067:0]             event.c:523  UCX  DEBUG mmap hooks are ready
[1642876602.041784] [node2:11067:0]       malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(before test): have 0x20000/0x20000 events; mmap_mode=2 hook_called=0
[1642876602.041794] [node2:11067:0]             event.c:533  UCX  DEBUG malloc hooks are ready
[1642876602.041802] [node2:11067:0]             event.c:606  UCX  DEBUG added user handler (func=0x7fbff9981f00 arg=0x20df0770) for events=0x20000 prio=1000
[1642876602.063397] [node1:126706:0]          install.c:277  UCX  DEBUG testing mmap existing events 0x0
[1642876602.063423] [node1:126706:0]          install.c:284  UCX  DEBUG mmap existing events test: got 0x0 out of 0x0
[1642876602.063437] [node1:126706:0]            event.c:523  UCX  DEBUG mmap hooks are ready
[1642876602.063448] [node1:126706:0]      malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(before test): have 0x0/0x0 events; mmap_mode=2 hook_called=0
[1642876602.063458] [node1:126706:0]            event.c:533  UCX  DEBUG malloc hooks are ready
[1642876602.064059] [node1:126706:0]            reloc.h:71   UCX  DEBUG original cuMemAlloc() is at 0x7f2353e97c10
[1642876602.064086] [node1:126706:0]    bistro_x86_64.c:284  UCX  DEBUG 'cuMemAlloc' at 0x7f2353e97c10 code length 18/5 prefix length 10
[1642876602.064111] [node1:126706:0]            reloc.h:71   UCX  DEBUG original cuMemAlloc_v2() is at 0x7f2353ee1620
[1642876602.064125] [node1:126706:0]    bistro_x86_64.c:284  UCX  DEBUG 'cuMemAlloc_v2' at 0x7f2353ee1620 code length 18/5 prefix length 10
[1642876602.064154] [node1:126706:0]            reloc.h:71   UCX  DEBUG original cuMemAllocManaged() is at 0x7f2353ebd6d0
[1642876602.064170] [node1:126706:0]    bistro_x86_64.c:284  UCX  DEBUG 'cuMemAllocManaged' at 0x7f2353ebd6d0 code length 18/5 prefix length 10
[1642876602.064191] [node1:126706:0]            reloc.h:71   UCX  DEBUG original cuMemAllocPitch() is at 0x7f2353e97be0
[1642876602.064203] [node1:126706:0]    bistro_x86_64.c:284  UCX  DEBUG 'cuMemAllocPitch' at 0x7f2353e97be0 code length 18/5 prefix length 10
[1642876602.064222] [node1:126706:0]            reloc.h:71   UCX  DEBUG original cuMemAllocPitch_v2() is at 0x7f2353ebd850
[1642876602.064234] [node1:126706:0]    bistro_x86_64.c:284  UCX  DEBUG 'cuMemAllocPitch_v2' at 0x7f2353ebd850 code length 18/5 prefix length 10
[1642876602.064254] [node1:126706:0]            reloc.h:71   UCX  DEBUG original cuMemFree() is at 0x7f2353e97bb0
[1642876602.064266] [node1:126706:0]    bistro_x86_64.c:284  UCX  DEBUG 'cuMemFree' at 0x7f2353e97bb0 code length 18/5 prefix length 10
[1642876602.064285] [node1:126706:0]            reloc.h:71   UCX  DEBUG original cuMemFree_v2() is at 0x7f2353ebd820
[1642876602.064298] [node1:126706:0]    bistro_x86_64.c:284  UCX  DEBUG 'cuMemFree_v2' at 0x7f2353ebd820 code length 18/5 prefix length 10
[1642876602.064318] [node1:126706:0]            reloc.h:71   UCX  DEBUG original cuMemFreeHost() is at 0x7f2353ebd790
[1642876602.064330] [node1:126706:0]    bistro_x86_64.c:284  UCX  DEBUG 'cuMemFreeHost' at 0x7f2353ebd790 code length 18/5 prefix length 10
[1642876602.064353] [node1:126706:0]            reloc.h:71   UCX  DEBUG original cuMemFreeHost_v2() is at 0x0
[1642876602.064364] [node1:126706:0]          cudamem.c:256  UCX  INFO cuda memory hooks on driver API: installed 8 bistro and 0 reloc
[1642876602.064378] [node1:126706:0]            event.c:606  UCX  DEBUG added user handler (func=0x7f2352134f00 arg=0x1798f630) for events=0x200000 prio=1000
[1642876602.072392] [node1:126706:0]          install.c:277  UCX  DEBUG testing mmap existing events 0x200ff
[1642876602.072483] [node1:126706:0]          install.c:284  UCX  DEBUG mmap existing events test: got 0x200ff out of 0x200ff
[1642876602.072496] [node1:126706:0]            event.c:523  UCX  DEBUG mmap hooks are ready
[1642876602.072509] [node1:126706:0]      malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(before test): have 0x0/0x20000 events; mmap_mode=2 hook_called=0
[1642876602.072530] [node1:126706:0]      malloc_hook.c:600  UCX  DEBUG testing malloc...
[1642876602.072577] [node1:126706:0]      malloc_hook.c:642  UCX  DEBUG malloc test: have 0x20000 out of 0x20000, malloc/free hooks were not called
[1642876602.072589] [node1:126706:0]      malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(after test): have 0x20000/0x20000 events; mmap_mode=2 hook_called=0
[1642876602.072601] [node1:126706:0]            event.c:533  UCX  DEBUG malloc hooks are ready
[1642876602.072612] [node1:126706:0]            event.c:606  UCX  DEBUG added user handler (func=0x7f2352134f00 arg=0x16adebb0) for events=0x220000 prio=1000
[1642876602.081471] [node1:126706:0]          install.c:277  UCX  DEBUG testing mmap existing events 0x200ff
[1642876602.081564] [node1:126706:0]          install.c:284  UCX  DEBUG mmap existing events test: got 0x200ff out of 0x200ff
[1642876602.081578] [node1:126706:0]            event.c:523  UCX  DEBUG mmap hooks are ready
[1642876602.081591] [node1:126706:0]      malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(before test): have 0x20000/0x20000 events; mmap_mode=2 hook_called=0
[1642876602.081603] [node1:126706:0]            event.c:533  UCX  DEBUG malloc hooks are ready
[1642876602.081612] [node1:126706:0]            event.c:606  UCX  DEBUG added user handler (func=0x7f2352134f00 arg=0x16adecb0) for events=0x220000 prio=1000
[1642876602.084088] [node1:126706:0]          install.c:277  UCX  DEBUG testing mmap existing events 0x200ff
[1642876602.084173] [node1:126706:0]          install.c:284  UCX  DEBUG mmap existing events test: got 0x200ff out of 0x200ff
[1642876602.084186] [node1:126706:0]            event.c:523  UCX  DEBUG mmap hooks are ready
[1642876602.084198] [node1:126706:0]      malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(before test): have 0x20000/0x20000 events; mmap_mode=2 hook_called=0
[1642876602.084210] [node1:126706:0]            event.c:533  UCX  DEBUG malloc hooks are ready
[1642876602.084219] [node1:126706:0]            event.c:606  UCX  DEBUG added user handler (func=0x7f2352134f00 arg=0x1a4dca00) for events=0x20000 prio=1000
[1642876602.167665] [node1:126706:0]          install.c:277  UCX  DEBUG testing mmap internal events 0x20000
[1642876602.167836] [node1:126706:0]          install.c:284  UCX  DEBUG mmap internal events test: got 0x20000 out of 0x20000
[1642876602.168817] [node2:11067:0]           install.c:277  UCX  DEBUG testing mmap internal events 0x20000
[1642876602.169038] [node2:11067:0]           install.c:284  UCX  DEBUG mmap internal events test: got 0x20000 out of 0x20000
[1642876602.607228] [node2:11067:0]           install.c:277  UCX  DEBUG testing mmap existing events 0x0
[1642876602.607290] [node2:11067:0]           install.c:284  UCX  DEBUG mmap existing events test: got 0x0 out of 0x0
[1642876602.607317] [node2:11067:0]             event.c:523  UCX  DEBUG mmap hooks are ready
[1642876602.607341] [node2:11067:0]       malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(before test): have 0x20000/0x0 events; mmap_mode=2 hook_called=0
[1642876602.607367] [node2:11067:0]             event.c:533  UCX  DEBUG malloc hooks are ready
[1642876602.609402] [node2:11067:0]             event.c:606  UCX  DEBUG added user handler (func=0x7fbff9980980 arg=0x23810430) for events=0x300000 prio=1000
[1642876602.611212] [node1:126706:0]          install.c:277  UCX  DEBUG testing mmap existing events 0x0
[1642876602.611242] [node1:126706:0]          install.c:284  UCX  DEBUG mmap existing events test: got 0x0 out of 0x0
[1642876602.611252] [node1:126706:0]            event.c:523  UCX  DEBUG mmap hooks are ready
[1642876602.611261] [node1:126706:0]      malloc_hook.c:574  UCX  DEBUG ucs_malloc_is_ready(before test): have 0x20000/0x0 events; mmap_mode=2 hook_called=0
[1642876602.611271] [node1:126706:0]            event.c:533  UCX  DEBUG malloc hooks are ready
[1642876602.613276] [node1:126706:0]            event.c:606  UCX  DEBUG added user handler (func=0x7f2352133980 arg=0x22b75e60) for events=0x300000 prio=1000

[node1:126706:0:126706] ib_mlx5_log.c:168  Local protection on mlx5_1:1/IB (synd 0x4 vend 0x53 hw_synd 0/157)
[node1:126706:0:126706] ib_mlx5_log.c:168  RC QP 0x6ff wqe[494]: SEND s-- [inl len 26] [va 0x7f1b36f31000 len 8230 lkey 0x23c46b] [rqpn 0x553 dlid=28 sl=0 port=1 src_path_bits=0]
==== backtrace (tid: 126706) ====
 0 0x0000000000025183 uct_ib_mlx5_completion_with_err()  /tmp/ucx-1.12.0/src/uct/ib/mlx5/ib_mlx5_log.c:162
 1 0x0000000000039b87 uct_ib_mlx5_poll_cq()  /tmp/ucx-1.12.0/src/uct/ib/mlx5/ib_mlx5.inl:91
 2 0x0000000000039b87 uct_rc_mlx5_iface_progress()  /tmp/ucx-1.12.0/src/uct/ib/rc/accel/rc_mlx5_iface.c:173
 3 0x0000000000039b87 uct_rc_mlx5_iface_progress_cyclic()  /tmp/ucx-1.12.0/src/uct/ib/rc/accel/rc_mlx5_iface.c:178
 4 0x000000000003b4da ucs_callbackq_dispatch()  /tmp/ucx-1.12.0/src/ucs/datastruct/callbackq.h:211
 5 0x000000000003b4da uct_worker_progress()  /tmp/ucx-1.12.0/src/uct/api/uct.h:2589
 6 0x000000000003b4da ucp_worker_progress()  /tmp/ucx-1.12.0/src/ucp/core/ucp_worker.c:2629
 7 0x00000000000418ac opal_progress()  /tmp/openmpi-4.1.2/opal/runtime/opal_progress.c:231
 8 0x00000000000804ad sync_wait_st()  /tmp/openmpi-4.1.2/ompi/../opal/threads/wait_sync.h:83
 9 0x00000000000804ad ompi_request_default_wait_all()  /tmp/openmpi-4.1.2/ompi/request/req_wait.c:234
10 0x00000000000a7e2c PMPI_Waitall()  /tmp/openmpi-4.1.2/ompi/mpi/c/profile/pwaitall.c:80
11 0x00000000038f0702 ???:0
12 0x0000000002a9fd31 ???:0
13 0x0000000002a241e9 ???:0
14 0x00000000028b7cf2 ???:0
15 0x000000000281c646 ???:0
16 0x00000000015d7759 ???:0
17 0x0000000001640f9d ???:0
18 0x0000000001069f1c ???:0
19 0x0000000000f8b1eb ???:0
20 0x0000000000f8b5bb ???:0
21 0x00000000008656d6 ???:0
22 0x00000000009b5fe1 ???:0
23 0x00000000009b743f ???:0
24 0x00000000009b794c ???:0
25 0x00000000009b86fe ???:0
26 0x000000000075d52e ???:0
27 0x000000000075e620 ???:0
28 0x0000000000481726 ???:0
29 0x0000000000022555 __libc_start_main()  ???:0
30 0x000000000074378f ???:0
=================================
[node1:126706] *** Process received signal ***
[node1:126706] Signal: Aborted (6)
[node1:126706] Signal code:  (-6)
[node1:126706] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7f2355c56630]
[node1:126706] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f235390a387]
[node1:126706] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f235390ba78]
[node1:126706] [ 3] /opt/tools/ucx/1.12.0/lib/libucs.so.0(ucs_fatal_error_message+0x55)[0x7f235212bbb5]
[node1:126706] [ 4] /opt/tools/ucx/1.12.0/lib/libucs.so.0(ucs_log_default_handler+0x614)[0x7f2352130804]
[node1:126706] [ 5] /opt/tools/ucx/1.12.0/lib/libucs.so.0(ucs_log_dispatch+0xdf)[0x7f2352130bbf]
[node1:126706] [ 6] /opt/tools/ucx/1.12.0/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x5c3)[0x7f2341860183]
[node1:126706] [ 7] /opt/tools/ucx/1.12.0/lib/ucx/libuct_ib.so.0(+0x39b87)[0x7f2341874b87]
[node1:126706] [ 8] /opt/tools/ucx/1.12.0/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x7f23526c64da]
[node1:126706] [ 9] /opt/tools/openmpi/4.1.2/lib/libopen-pal.so.40(opal_progress+0x2c)[0x7f23529728ac]
[node1:126706] [10] /opt/tools/openmpi/4.1.2/lib/libmpi.so.40(ompi_request_default_wait_all+0xed)[0x7f23554ec4ad]
[node1:126706] [11] /opt/tools/openmpi/4.1.2/lib/libmpi.so.40(PMPI_Waitall+0x1c)[0x7f2355513e2c]
[node1:126706] [12] 
[node1:126706] [13] 
[node1:126706] [14] 
[node1:126706] [15] 
[node1:126706] [16] 
[node1:126706] [17] 
[node1:126706] [18] 
[node1:126706] [19] 
[node1:126706] [20] 
[node1:126706] [21] 
[node1:126706] [22] 
[node1:126706] [23] 
[node1:126706] [24] 
[node1:126706] [25] 
[node1:126706] [26] 
[node1:126706] [27] 
[node1:126706] [28] 
[node1:126706] [29] 
[node1:126706] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node node1 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

@yosefe
Copy link
Contributor

yosefe commented Jan 22, 2022

@cgorac unfortuantely the log above does not indicate a problem with memory hooks; is it possible to upload a reproducer code?
if not, can you pls rebuild UCX with debug log enabled and collect log file file UCX_MEM_LOG_LEVEL=trace UCX_LOG_LEVEL=data (can be big)? see also https://github.com/openucx/ucx/wiki/Logging

@cgorac
Copy link
Author

cgorac commented Jan 24, 2022

Tried to create minimal reproducing example with MPI_Isend()/MPI_Irecv() from/to GPU memory buffers and then MPI_Waitall(), but it won't crash even if linked with static version of CUDA runtime library, and even if it allocates much more GPU memory than actually needed. So I'm attaching the log, with above mentioned logging options turned on, of the offending program.
crash.log

@yosefe
Copy link
Contributor

yosefe commented Jan 24, 2022

Seems there is a mismatch betwen memory region range

[1643015174.601669] [node1:19319:0]          rcache.c:957  UCX  TRACE mlx5_1: created region 0x239b6710 [0x7f4e42f37800..0x7f4e42f3c780] gt rw ref 2 lkey 0x23c43e rkey 0x23c43e atomic_rkey 0xffffffff

And send operation buffer:

[1643015174.601719] [node1:19319:0]       rc_mlx5.inl:482  UCX  DATA  QP 0x9b2 [494] SEND s-- [inl len 26] [va 0x7f4e42f31000 len 8230 lkey 0x23c43e] -- am 3 EGR_F tag 0 msgid a7eed382d5bf89aa len 13952

(0x7f4e42f31000 < 0x7f4e42f37800 which is wrong)

@cgorac can you pls check if #7791 resolves the issue?

@cgorac
Copy link
Author

cgorac commented Jan 24, 2022

I confirm that it fixes it.

Just to clarify it: that is an issue with UCX itself, right?

@yosefe
Copy link
Contributor

yosefe commented Jan 24, 2022

I confirm that it fixes it.

Just to clarify it: that is an issue with UCX itself, right?

Thanks! Yes, it's an issue with UCX memtype cache logic

@cgorac
Copy link
Author

cgorac commented Jan 24, 2022

Good, I hope then the fix lands in the next release. I have now performance issues to examine, but I'm closing this one. Many thanks for all your help!

@cgorac cgorac closed this as completed Jan 24, 2022
@yosefe
Copy link
Contributor

yosefe commented Jan 24, 2022

Good, I hope then the fix lands in the next release. I have now performance issues to examine, but I'm closing this one. Many thanks for all your help!

Thank you for reporting the issue! we plan to have the fix in v1.13.0 and v1.12.1.

@yosefe
Copy link
Contributor

yosefe commented Jan 24, 2022

reopening the issue ; will close when PR is merged

@yosefe
Copy link
Contributor

yosefe commented Feb 2, 2022

Fixed by #7881 and #7791

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants