-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/IB/MLX5: Prevent compiler to use memmove #9692
Conversation
src/uct/ib/mlx5/ib_mlx5.inl
Outdated
UCS_WORD_COPY(uint64_t, dst, uint64_t, src, MLX5_SEND_WQE_BB); | ||
#else | ||
/* Prevent the compiler to replace by memmove() */ | ||
*(uct_ib_mlx5_wqe_ctrl_seg_t *)dst = *(uct_ib_mlx5_wqe_ctrl_seg_t *)src; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the previous patch looked cleaner, wqe may consist of several different segments and uct_ib_mlx5_wqe_ctrl_seg_t
is less than 1 BB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah right thanks, this last minute change was bogus, force pushed the right fix since PR is so small.
074160b
to
3836f2c
Compare
rechecked on setup with UD and RC repros, issue is fixed. |
src/uct/ib/mlx5/ib_mlx5.inl
Outdated
} UCS_S_PACKED uct_ib_mlx5_send_wqe_bb_t; | ||
|
||
/* Prevent the compiler to replace by memmove() */ | ||
*(uct_ib_mlx5_send_wqe_bb_t *)dst = *(uct_ib_mlx5_send_wqe_bb_t *)src; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- can we use
UCS_WORD_COPY(uct_ib_mlx5_send_wqe_bb_t, dst, uct_ib_mlx5_send_wqe_bb_t, src, MLX5_SEND_WQE_BB);
- does it really guarantee ? maybe use attribute((nooptimize)) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed 1.
I think it's not an absolute guarantee. The actual flag to prevent replacing assignments by mem*()
functions is no-tree-loop-distribute-patterns
.
But due to inlinings, seems pragma/__attribute__
are not enforced, so to get it applied we would need all top callers to have the needed attribute to be set (even existing restrict
keywords should allow compiler to replace at least by memcpy()
not memmove()
, but it's not happening).
In rdma-core
we are populating each field in _mlx5_post_send()
, so it will not be subject to this memmove()
optimization.
Various options:
- remove inline from
uct_ib_mlx5_bf_copy_bb()
: seems not best - add
volatile
, but on-O2
, code becomes a sequence ofmov
/add
/jne
- replace by
memcpy()
: seems it might not even guarantee not being replaced, and we might not be able to getxmm
-based code.
src/uct/ib/mlx5/ib_mlx5.inl
Outdated
/* Prevent the compiler to replace by memmove() */ | ||
UCS_WORD_COPY(uct_ib_mlx5_send_wqe_bb_t, dst, | ||
uct_ib_mlx5_send_wqe_bb_t, src, | ||
sizeof(uct_ib_mlx5_send_wqe_bb_t)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe MLX5_SEND_WQE_BB to align with prev lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
321c6a0
to
e838d7c
Compare
Thank you for patching this issue. I was the initial reporter of the issue through enterprise support. This patch fixes the ucx_perftest example that was being used to reproduce the issue, but unfortunately MPI jobs are still failing with this patch. I'm using the NAS parallel benchmarks with OpenMPI 4.1.6 to test MPI, using the "sp" test, class D. I've made a working version by compiling the entire src/uct/ib subdirectory with "-O1" to prevent use of memmove (added "sed -e '/^CFLAGS = /s/-O2/-O1/' -i src/uct/ib/Makefile" in the rpm spec file between the configure and make steps). It seems there's another code path within src/uct/ib that is still an issue. I found another ucx_perftest that still reproduces the issue with the new patched version to help simplify debugging again:
That same test works properly with the src/uct/ib -O1 fixed build. Thanks, |
What is the output of |
I'm testing with Rocky 9 on the same hardware config that was first reported with Ubuntu 22 (different customer order). Here is gcc- Here is the gist link: https://gist.github.com/QuesarVII/7d48cd2659861b51993970ceacc9e48c Thanks! |
Thanks! Assuming this library reproduces the issue, I do not see any Edit: I do see |
Yes, that objdump is from the latest git with the patch. Here's the objdump from the working build of version 1.14 with the uct/ib tree built with -O1: https://gist.github.com/QuesarVII/c79bcffab661d1e2da28b302b4d26b5c I have a system setup for remote access for Nvidia already. That 1 is running Ubuntu 22. I can test this updated patch there and see if it reproduces as well and give you the access info. |
I cannot reproduce it under Ubuntu 22 - the updated patch is fixing it there. I checked objdump -S and did not see any memmove calls in there either. I will reinstall 2 of those test systems with Rocky 9 and set it up for your access. |
What
Help compiler to optimize without introducting any function call, using instead
movl
,movq
orxmm
registers, depending on the optimization level selected.Why ?
With
-O2
only, the compiler replacesUCS_WORD_COPY(uint64_t, dst, uint64_t, src, MLX5_SEND_WQE_BB);
bymemmove()
. This is causing crash below:It is not entirely clear why, but this could be related to introduced out-of-order/block copy, added prefetch intel instruction...
Internal: 3774158
How ?
Adding compiler fence inside
UCS_WORD_COPY()
disables usage ofxmm
registers at-O3
so it is not an option.Repro