-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang when using polling fd #1492
Comments
Since it's reproduced only for large messages, it looks like side effect from RNDV protocol. Receiver catches only RTS packet with event which is processed earlier than UCX request is completed. |
This reverts commit 71474ca.
@artpol84 @evgeny-leksikov the problem is with the test (which was modeled after the buggy ucp_hello_world example) -- |
@yosefe I'm not sure I'm 100% understand. Will buggy hello world example be fixed so I can see it in the code? |
yes |
Going to sleep = calling epoll? |
@artpol84 yes. need to call it before epoll, can be before or after arm. as long as getting successful probe at that point will mean not go to sleep. |
I see what you are saying. In the original ucp_hello_world.c example do-while loop skips the very first probe: The reason I double-asked was because I had impression that you should do probe before arm (at least that was making more sense to me). And from your original note I got a feeling that it is mandatory to call probe after arm. Now I see that my understanding was ok. |
@artpol84 exactly. Just to stress the problem, the wait() for send request may call ucp_worker_progress(). This might already get a receive completion and put the message on unexpected queue. Then we go to sleep and never wake up (since the message was already received). |
I'm seeing hangs when using my custom-made latency test that was derived from ucp_hello_world.c:
https://github.com/artpol84/poc/tree/ucx_hang_demo/ucx/latency.
Command line is the same, here is how you can reproduce the issue:
Since communication is symmetric, hang may occur on either side.
For one particular case, sender side is not hanging, but waiting for the message send completion:
receiver side (one that actually hangs):
If I manually interrupt poll and fake ret value and force it to go to processing, I see that actually receive event has occurred, but poll wasn't interrupted for some reasons (or *_arm hasn't captured existing event)
If I introduce artificial delay at the sender side:
https://github.com/artpol84/poc/blob/ucx_hang_demo/ucx/latency/ucp_latency.c#L270
hang goes away.
The text was updated successfully, but these errors were encountered: