Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/IB/MLX5: Prevent compiler to use memmove #9692

Merged
merged 1 commit into from
Feb 20, 2024

Conversation

tvegas1
Copy link
Contributor

@tvegas1 tvegas1 commented Feb 16, 2024

What

Help compiler to optimize without introducting any function call, using instead movl, movq or xmm registers, depending on the optimization level selected.

Why ?

With -O2 only, the compiler replaces UCS_WORD_COPY(uint64_t, dst, uint64_t, src, MLX5_SEND_WQE_BB); by memmove(). This is causing crash below:

ib_mlx5_log.c:179  Local length error on mlx5_0:1/IB (synd 0x1 vend 0x68 hw_synd 0/141)
ib_mlx5_log.c:179  UD QP 0x15a6 wqe[18856]: SEND --- [rqpn 0x15a7 rlid 5] [inl len 24] [va 0x7f2e11f79fe0 len 4072 lkey 0x203400]

It is not entirely clear why, but this could be related to introduced out-of-order/block copy, added prefetch intel instruction...

Internal: 3774158

How ?

Adding compiler fence inside UCS_WORD_COPY() disables usage of xmm registers at -O3 so it is not an option.

Repro

  • AMD EPYC 9654 96-Core Processor
  • gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
  • GNU C Library (Ubuntu GLIBC 2.35-0ubuntu3.6) stable release version 2.35.
  • IB Firmware version: 20.39.1002

UCS_WORD_COPY(uint64_t, dst, uint64_t, src, MLX5_SEND_WQE_BB);
#else
/* Prevent the compiler to replace by memmove() */
*(uct_ib_mlx5_wqe_ctrl_seg_t *)dst = *(uct_ib_mlx5_wqe_ctrl_seg_t *)src;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous patch looked cleaner, wqe may consist of several different segments and uct_ib_mlx5_wqe_ctrl_seg_t is less than 1 BB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah right thanks, this last minute change was bogus, force pushed the right fix since PR is so small.

brminich
brminich previously approved these changes Feb 16, 2024
@tvegas1
Copy link
Contributor Author

tvegas1 commented Feb 16, 2024

rechecked on setup with UD and RC repros, issue is fixed.

} UCS_S_PACKED uct_ib_mlx5_send_wqe_bb_t;

/* Prevent the compiler to replace by memmove() */
*(uct_ib_mlx5_send_wqe_bb_t *)dst = *(uct_ib_mlx5_send_wqe_bb_t *)src;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. can we use
    UCS_WORD_COPY(uct_ib_mlx5_send_wqe_bb_t, dst, uct_ib_mlx5_send_wqe_bb_t, src, MLX5_SEND_WQE_BB);
  2. does it really guarantee ? maybe use attribute((nooptimize)) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed 1.

I think it's not an absolute guarantee. The actual flag to prevent replacing assignments by mem*() functions is no-tree-loop-distribute-patterns.

But due to inlinings, seems pragma/__attribute__ are not enforced, so to get it applied we would need all top callers to have the needed attribute to be set (even existing restrict keywords should allow compiler to replace at least by memcpy() not memmove(), but it's not happening).

In rdma-core we are populating each field in _mlx5_post_send(), so it will not be subject to this memmove() optimization.

Various options:

  • remove inline from uct_ib_mlx5_bf_copy_bb(): seems not best
  • add volatile, but on -O2, code becomes a sequence of mov/add/jne
  • replace by memcpy(): seems it might not even guarantee not being replaced, and we might not be able to get xmm-based code.

/* Prevent the compiler to replace by memmove() */
UCS_WORD_COPY(uct_ib_mlx5_send_wqe_bb_t, dst,
uct_ib_mlx5_send_wqe_bb_t, src,
sizeof(uct_ib_mlx5_send_wqe_bb_t));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe MLX5_SEND_WQE_BB to align with prev lines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@yosefe yosefe merged commit 14c74f3 into openucx:master Feb 20, 2024
134 checks passed
@QuesarVII
Copy link

QuesarVII commented Feb 26, 2024

Thank you for patching this issue. I was the initial reporter of the issue through enterprise support. This patch fixes the ucx_perftest example that was being used to reproduce the issue, but unfortunately MPI jobs are still failing with this patch. I'm using the NAS parallel benchmarks with OpenMPI 4.1.6 to test MPI, using the "sp" test, class D.

I've made a working version by compiling the entire src/uct/ib subdirectory with "-O1" to prevent use of memmove (added "sed -e '/^CFLAGS = /s/-O2/-O1/' -i src/uct/ib/Makefile" in the rpm spec file between the configure and make steps). It seems there's another code path within src/uct/ib that is still an issue.

I found another ucx_perftest that still reproduces the issue with the new patched version to help simplify debugging again:

[root@node2 ~]# ucx_perftest -t tag_sync_bw -s `expr 1024 \* 1024` node3
[1708968667.550414] [node2:18683:0]        perftest.c:813  UCX  WARN  CPU affinity is not set (bound to 384 cpus). Performance may be impacted.
+--------------+--------------+------------------------------+---------------------+-----------------------+
|              |              |       overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | 50.0%ile | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+----------+---------+---------+----------+----------+-----------+-----------+



[root@node3 ~]# ucx_perftest
[1708968686.053814] [node3:17951:0]        perftest.c:813  UCX  WARN  CPU affinity is not set (bound to 384 cpus). Performance may be impacted.
Waiting for connection...
Accepted connection from 10.0.0.2:57492
+----------------------------------------------------------------------------------------------------------+
| API:          protocol layer                                                                             |
| Test:         tag sync match bandwidth                                                                   |
| Data layout:  (automatic)                                                                                |
| Send memory:  host                                                                                       |
| Recv memory:  host                                                                                       |
| Message size: 1048576                                                                                    |
| Window size:  32                                                                                         |
+----------------------------------------------------------------------------------------------------------+
[node3:17951:0:17951] ib_mlx5_log.c:179  Remote access error on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[node3:17951:0:17951] ib_mlx5_log.c:179  RC QP 0x2ea wqe[22]: RDMA_READ s-- [rva 0x7f42b6a2c000 rkey 0xa0000] [va 0x7fda3306e000 len 1048576 lkey 0x183200] [rqpn 0x46a dlid=12 sl=0 port=1 src_path_bits=0]
==== backtrace (tid:  17951) ====
 0 0x00000000000294ed uct_ib_mlx5_completion_with_err()  ???:0
 1 0x0000000000046454 uct_rc_mlx5_devx_cleanup_srq()  ???:0
 2 0x000000000002986d uct_ib_mlx5_check_completion()  ???:0
 3 0x000000000003fb47 uct_rc_mlx5_iface_check_rx_completion()  ???:0
 4 0x0000000000045d0a ucp_worker_progress()  ???:0
 5 0x0000000000041a1a ???()  /usr/bin/ucx_perftest:0
 6 0x000000000002af72 ???()  /usr/bin/ucx_perftest:0
 7 0x00000000000091fc ???()  /usr/bin/ucx_perftest:0
 8 0x0000000000009bdf ???()  /usr/bin/ucx_perftest:0
 9 0x0000000000005d05 ???()  /usr/bin/ucx_perftest:0
10 0x000000000003feb0 __libc_start_call_main()  ???:0
11 0x000000000003ff60 __libc_start_main_alias_2()  :0
12 0x0000000000006905 ???()  /usr/bin/ucx_perftest:0
=================================
Aborted (core dumped)


That same test works properly with the src/uct/ib -O1 fixed build.

Thanks,
Rick Warner

@tvegas1
Copy link
Contributor Author

tvegas1 commented Feb 26, 2024

What is the output of gcc --version? Are you able to provide output of objdump -S <install_dir>/libuct_ib.so in maybe a git gist?

@QuesarVII
Copy link

I'm testing with Rocky 9 on the same hardware config that was first reported with Ubuntu 22 (different customer order).

Here is gcc-
gcc --version
gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)

Here is the gist link: https://gist.github.com/QuesarVII/7d48cd2659861b51993970ceacc9e48c

Thanks!

@tvegas1
Copy link
Contributor Author

tvegas1 commented Feb 26, 2024

Here is the gist link: https://gist.github.com/QuesarVII/7d48cd2659861b51993970ceacc9e48c

Thanks! Assuming this library reproduces the issue, I do not see any memmove(). Tried to repro on test1/test2 without luck. Could you please provide an access to repro using private communication?

Edit: I do see memmove() viewing full logs, would you be able to provide objudmp output with library reproducing the issue built with debug symbols? I guess we would still need setup access.

@QuesarVII
Copy link

Yes, that objdump is from the latest git with the patch. Here's the objdump from the working build of version 1.14 with the uct/ib tree built with -O1: https://gist.github.com/QuesarVII/c79bcffab661d1e2da28b302b4d26b5c

I have a system setup for remote access for Nvidia already. That 1 is running Ubuntu 22. I can test this updated patch there and see if it reproduces as well and give you the access info.

@QuesarVII
Copy link

I cannot reproduce it under Ubuntu 22 - the updated patch is fixing it there. I checked objdump -S and did not see any memmove calls in there either. I will reinstall 2 of those test systems with Rocky 9 and set it up for your access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants