Hang when using polling fd #1492

artpol84 · 2017-05-10T17:36:17Z

I'm seeing hangs when using my custom-made latency test that was derived from ucp_hello_world.c:
https://github.com/artpol84/poc/tree/ucx_hang_demo/ucx/latency.

Command line is the same, here is how you can reproduce the issue:

server: ./launch.sh ./ucp_latency -f
cli: ./launch.sh ./ucp_latency -f -n vegas06 -p 13337

Since communication is symmetric, hang may occur on either side.
For one particular case, sender side is not hanging, but waiting for the message send completion:

(gdb) bt
#0  0x00007ffff6516213 in pthread_spin_lock () from /usr/lib64/libpthread.so.0
#1  0x00007ffff562b9e6 in mlx5_poll_cq () from /usr/lib64/libmlx5-rdmav2.so
#2  0x00007ffff72a0e33 in ibv_poll_cq (wc=0x7fffffffd888, num_entries=<optimized out>, cq=<optimized out>)
    at /usr/include/infiniband/verbs.h:1360
#3  uct_ib_poll_cq (wcs=0x7fffffffd888, count=<synthetic pointer>, cq=<optimized out>)
    at /hpc/local/benchmarks/hpcx_install_Tuesday/src/hpcx-gcc-redhat7.2/ucx-master/src/uct/ib/base/ib_device.h:270
#4  uct_ud_verbs_iface_poll_rx (is_async=0, iface=0x6bd020) at ib/ud/verbs/ud_verbs.c:320
#5  uct_ud_verbs_iface_progress (arg=0x6bd020) at ib/ud/verbs/ud_verbs.c:368
#6  0x00007ffff727d68e in ucs_callbackq_dispatch (cbq=0x648f18, cbq=0x648f18)
    at /hpc/local/benchmarks/hpcx_install_Tuesday/src/hpcx-gcc-redhat7.2/ucx-master/src/ucs/datastruct/callbackq.inl:39
#7  uct_worker_progress (worker=0x648f10) at base/uct_md.c:233
#8  0x00007ffff7bc71dd in ucp_worker_progress (worker=0x656030) at core/ucp_worker.c:719
#9  0x0000000000401a77 in wait (ucp_worker=0x656030, context=0x6e1748) at ucp_latency.c:123
#10 0x0000000000401ec3 in send_msg (ucp_worker=0x656030, server_ep=0x611280, msg_len=262144) at ucp_latency.c:284
#11 0x0000000000401f35 in pong_ping (ucp_worker=0x656030, ep=0x611280) at ucp_latency.c:301
#12 0x000000000040215e in run_ucx_client (ucp_worker=0x656030) at ucp_latency.c:372
#13 0x0000000000402553 in run_test (server=0x7fffffffe34e "vegas06", ucp_worker=0x656030) at ucp_latency.c:486
#14 0x00000000004029e4 in main (argc=6, argv=0x7fffffffdfe8) at ucp_latency.c:617

receiver side (one that actually hangs):

(gdb) bt
#0  0x00007ffff75bfd13 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x0000000000401b5d in test_poll_wait (ucp_worker=0x61f2d0) at ucp_latency.c:161
#2  0x0000000000401bd4 in recv_msg (ucp_worker=0x61f2d0, server_ep=0x6248e0) at ucp_latency.c:194
#3  0x0000000000401f73 in ping_pong (ucp_worker=0x61f2d0, ep=0x6248e0, len=262144) at ucp_latency.c:307
#4  0x000000000040247c in run_ucx_server (ucp_worker=0x61f2d0) at ucp_latency.c:470
#5  0x0000000000402561 in run_test (server=0x0, ucp_worker=0x61f2d0) at ucp_latency.c:488
#6  0x00000000004029e4 in main (argc=2, argv=0x7fffffffe018) at ucp_latency.c:617

If I manually interrupt poll and fake ret value and force it to go to processing, I see that actually receive event has occurred, but poll wasn't interrupted for some reasons (or *_arm hasn't captured existing event)

(gdb) fini
Run till exit from #0  0x00007ffff75bfd13 in epoll_wait () from /usr/lib64/libc.so.6
0x0000000000401b5d in test_poll_wait (ucp_worker=0x61f2d0) at ucp_latency.c:161
161             ret = epoll_wait(epoll_fd_local, &ev, 1, -1);
(gdb) n
162         } while ((ret == -1) && (errno == EINTR));
(gdb) set ret = 0
(gdb) n
164         ret = UCS_OK;
(gdb)
167         close(epoll_fd_local);
(gdb)
169         return ret;
(gdb)
170     }
(gdb)
recv_msg (ucp_worker=0x61f2d0, server_ep=0x6248e0) at ucp_latency.c:195
195                 if (status != UCS_OK) {
(gdb)
201             ucp_worker_progress(ucp_worker);
(gdb)
204             msg_tag = ucp_tag_probe_nb(ucp_worker, tag, tag_mask, 1, &info_tag);
(gdb)
205         } while (msg_tag == NULL);
(gdb)
207         if( same_buf ) {
(gdb) p msg_tag
$1 = (ucp_tag_message_h) 0x7ffff49ed2a9

If I introduce artificial delay at the sender side:
https://github.com/artpol84/poc/blob/ucx_hang_demo/ucx/latency/ucp_latency.c#L270
hang goes away.

The text was updated successfully, but these errors were encountered:

evgeny-leksikov · 2017-05-11T16:32:46Z

Since it's reproduced only for large messages, it looks like side effect from RNDV protocol. Receiver catches only RTS packet with event which is processed earlier than UCX request is completed.

This reverts commit 71474ca.

yosefe · 2017-06-17T21:01:22Z

@artpol84 @evgeny-leksikov the problem is with the test (which was modeled after the buggy ucp_hello_world example) --
the last thing before going to sleep should be doing ucp_tag_probe_nb(). otherwise, a message might already be received by ucp_worker_progress(), but the test would still go to sleep.
See the documentation for ucp_worker_arm() - particularly that "check_for_events" is done before arming the worker and going to sleep.

artpol84 · 2017-06-18T09:17:45Z

@yosefe I'm not sure I'm 100% understand. Will buggy hello world example be fixed so I can see it in the code?

yosefe · 2017-06-18T10:32:23Z

yes
the fix is to probe for message before going to sleep

artpol84 · 2017-06-18T15:17:19Z

Going to sleep = calling epoll?
So we call it in between arming and epoll?

yosefe · 2017-06-18T16:47:43Z

@artpol84 yes. need to call it before epoll, can be before or after arm. as long as getting successful probe at that point will mean not go to sleep.

artpol84 · 2017-06-18T18:01:06Z

I see what you are saying. In the original ucp_hello_world.c example do-while loop skips the very first probe:
https://github.com/openucx/ucx/blob/master/test/examples/ucp_hello_world.c#L195:L212
Indeed same is in my latency test. Thanks for pointing it out.

The reason I double-asked was because I had impression that you should do probe before arm (at least that was making more sense to me). And from your original note I got a feeling that it is mandatory to call probe after arm. Now I see that my understanding was ok.

yosefe · 2017-06-18T19:50:49Z

@artpol84 exactly. Just to stress the problem, the wait() for send request may call ucp_worker_progress(). This might already get a receive completion and put the message on unexpected queue. Then we go to sleep and never wake up (since the message was already received).

as per openucx/ucx#1492

shamisp added this to the v1.2 - release milestone May 10, 2017

yosefe added the Bug label May 11, 2017

yosefe modified the milestones: v1.2 - release, v1.3 May 15, 2017

evgeny-leksikov pushed a commit to evgeny-leksikov/ucx that referenced this issue Jun 2, 2017

UCT: fix hang when using polling fd openucx#1492

34f0416

evgeny-leksikov mentioned this issue Jun 2, 2017

UCT: fix hang when using polling fd #1492 #1561

Closed

evgeny-leksikov added a commit to evgeny-leksikov/ucx that referenced this issue Jun 2, 2017

UCT: fix hang when using polling fd openucx#1492

71474ca

evgeny-leksikov added a commit to evgeny-leksikov/ucx that referenced this issue Jun 4, 2017

Revert "UCT: fix hang when using polling fd openucx#1492"

a326dc3

This reverts commit 71474ca.

evgeny-leksikov added a commit to evgeny-leksikov/ucx that referenced this issue Jun 4, 2017

UCT/WAKEUP: fix hang when using polling fd openucx#1492

7e8f2da

yosefe closed this as completed Jun 19, 2017

artpol84 added a commit to artpol84/poc that referenced this issue Jun 29, 2017

ucx: fix latency test

51e4bff

as per openucx/ucx#1492

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang when using polling fd #1492

Hang when using polling fd #1492

artpol84 commented May 10, 2017

evgeny-leksikov commented May 11, 2017

yosefe commented Jun 17, 2017

artpol84 commented Jun 18, 2017

yosefe commented Jun 18, 2017

artpol84 commented Jun 18, 2017

yosefe commented Jun 18, 2017

artpol84 commented Jun 18, 2017

yosefe commented Jun 18, 2017 •

edited

Loading

Hang when using polling fd #1492

Hang when using polling fd #1492

Comments

artpol84 commented May 10, 2017

evgeny-leksikov commented May 11, 2017

yosefe commented Jun 17, 2017

artpol84 commented Jun 18, 2017

yosefe commented Jun 18, 2017

artpol84 commented Jun 18, 2017

yosefe commented Jun 18, 2017

artpol84 commented Jun 18, 2017

yosefe commented Jun 18, 2017 • edited Loading

yosefe commented Jun 18, 2017 •

edited

Loading