-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ucx_perftest crash in uct_ib_mlx5_completion_with_err #7863
Comments
@cgorac seems there is problem with GPU direct on this setup. |
Indeed that's it. I've used |
One more question here: I realized later that MPI programs (it's OpenMPI in particular) that are linked with static version of CUDA runtime library still crash, with the same error, even if I turned off ACS on all PLX PCI bridges. If I add UCX_MEMTYPE_CACHE=0, they they work (but performance this way is visibly worse than even if program copy data between GPU and host memory itself, and then pass host buffer pointers only to MPI calls). I understood that by UCX 1.12.0 using UCX_MEMTYPE_CACHE=0 is not needed at all any more, is that correct or are there still some exceptions, like this one? |
@cgorac does it happen only with statically linked programs (so dynamic link of the same program works fine)? |
Yes, the program in question crashes with the same error as mentioned above only if linked with static version of CUDA runtime library. It doesn't use cudaMallocAsync(). |
@cgorac thanks for the clarification,
|
Sure, the log is below (note that I removed function names etc. from the program itself in the stack traces etc., as these are not relevant anyway). The program in question uses just couple of cudaMalloc() calls to allocate memory, actually it pre-allocates most of the memory needed by one of these calls, and then implements own pool allocator. At the moment of crash, the program issues some MPI_Isend()/MPI_Irecv() calls, passing pointers to the GPU memory from above mentioned pool as arguments to these calls, and then it calls MPI_Waitall(). As mentioned above, the crash won't happen if I run with UCX_MEMTYPE_CACHE=0; maybe using the pool is actually the cause of the problem, because of clashing with how UCX employs memtype cache? Here is the log:
|
@cgorac unfortuantely the log above does not indicate a problem with memory hooks; is it possible to upload a reproducer code? |
Tried to create minimal reproducing example with MPI_Isend()/MPI_Irecv() from/to GPU memory buffers and then MPI_Waitall(), but it won't crash even if linked with static version of CUDA runtime library, and even if it allocates much more GPU memory than actually needed. So I'm attaching the log, with above mentioned logging options turned on, of the offending program. |
Seems there is a mismatch betwen memory region range
And send operation buffer:
( |
I confirm that it fixes it. Just to clarify it: that is an issue with UCX itself, right? |
Thanks! Yes, it's an issue with UCX memtype cache logic |
Good, I hope then the fix lands in the next release. I have now performance issues to examine, but I'm closing this one. Many thanks for all your help! |
Thank you for reporting the issue! we plan to have the fix in v1.13.0 and v1.12.1. |
reopening the issue ; will close when PR is merged |
I have MLNX_OFED installed on a couple of RH 7.9 machines, with ConnectX-5 (MT27800 Family) adapters. Machines also have 4 V100 GPUs, and have
gdrcopy
andnvidia_peermem
(tried withnv_peer_memory
instead, with the same outcome) drivers loaded. The UCX version is 1.12.0, I tried with both the version that is pre-built and delivered along with MLNX_OFED, and one that I've built from source (using the same flags as MLNX_OFED one is reporting throughucx_info -b
). In both cases, when I try to run theucx_perftest
to measure the GPUDirect RDMA bandwidth, by running:on machine
node1
and then:on machine
node2
, a crash inucx_perftest
would occur, with following printed out:Because "protection" is mentioned in the output above, I've tried with running both
ucx_perftest
instances as root user, but the same happens. I've also tried with changing various UCX related environment variables, like settingUCX_NET_DEVICES
, or usingUCX_MEMTYPE_CACHE=0
, but the outcome is always the same. Of course, if I setUCX_IB_GPU_DIRECT_RDMA=no
then it works (also it works if I use-m host
instead of-m cuda
inucx_perftest
command line). Tried too with UCX 1.11.2 built from source, the same happens.Tried also every RDMA test listed for example here, and everything works fine. So my question is - any hint what else to try to fix using GPUDirect RDMA from UCX on these machines?
Here is some additional info about my setup:
Finally, here is the output of crashing `ucx_perftest` run, but this time with `UCX_LOG_LEVEL=debug`:
The text was updated successfully, but these errors were encountered: