Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX 1.10 issues #668

Closed
pentschev opened this issue Jan 14, 2021 · 5 comments
Closed

UCX 1.10 issues #668

pentschev opened this issue Jan 14, 2021 · 5 comments

Comments

@pentschev
Copy link
Member

After the UCX 1.10 package was created, we started seeing and getting reports of some issues. The first is raised by a change in the default UCX_SOCKADDR_CM_ENABLE=y, which used to be disabled until 1.9, causing:

[1610542356.468568] [dgx13:13389:0]    ucp_context.c:1080 UCX  ERROR UCX_SOCKADDR_CM_ENABLE is set to yes but none of the available components supports SOCKADDR_CM

If we revert that change, as done in #667), in an attempt to revert behavior to UCX pre-1.10 we see some segfaults:

[1610567155.174521] [dgx13:78463:0]         tcp_ep.c:233  UCX  DEBUG tcp_ep 0x5631358d6730: created on iface 0x5631350c4890, fd -1
[1610567155.174526] [dgx13:78463:0]         tcp_cm.c:104  UCX  DEBUG tcp_ep 0x5631358d6730: CLOSED -> CONNECTING for the [10.33.225.163:35587]<->[10.33.225.163:46367]:0 connection [-:-]
[1610567155.174535] [dgx13:78463:0]         tcp_cm.c:104  UCX  DEBUG tcp_ep 0x5631358d6730: CONNECTING -> CONNECTED for the [10.33.225.163:35587]<->[10.33.225.163:46367]:0 connection [-:Rx]
[dgx13:78463:0:78463] Caught signal 11 (Segmentation fault: address not mapped to object at address 0xfffffffffffffff8)
==== backtrace (tid:  78463) ====
 0  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/libucs.so.0(ucs_handle_error+0x10c) [0x7f74f710e3cc]
 1  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/libucs.so.0(+0x2b74c) [0x7f74f710e74c]
 2  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/libucs.so.0(+0x2b9c4) [0x7f74f710e9c4]
 3  /lib/x86_64-linux-gnu/libpthread.so.0(+0x12890) [0x7f75d8e35890]
 4  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/libuct.so.0(+0x21040) [0x7f74ff648040]
 5  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/libuct.so.0(+0x25570) [0x7f74ff64c570]
 6  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/libucs.so.0(ucs_event_set_wait+0x103) [0x7f74f7115443]
 7  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/libuct.so.0(uct_tcp_iface_progress+0x93) [0x7f74ff64c663]
 8  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/libucp.so.0(+0x340ec) [0x7f74ff8a10ec]
 9  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/libucs.so.0(+0x1eb9a) [0x7f74f7101b9a]
10  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7f74ff8a6c8a]
11  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/python3.8/site-packages/ucp/_libs/ucx_api.cpython-38-x86_64-linux-gnu.so(+0x17049) [0x7f74ff351049]
12  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(+0x182df8) [0x563131303df8]
13  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(PyObject_Call+0x5e) [0x5631312936be]
14  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0x21ba) [0x56313134486a]
15  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x56313132b053]
16  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x378) [0x56313132c428]
17  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0xa4b) [0x5631313430fb]
18  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(+0x172073) [0x5631312f3073]
19  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0xb846) [0x7f75d7177846]
20  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyObject_MakeTpCall+0x31e) [0x5631312a4ade]
21  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(+0x20f3f7) [0x5631313903f7]
22  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(+0x115662) [0x563131296662]
23  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(PyVectorcall_Call+0x6e) [0x5631312a13ce]
24  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0x5c0b) [0x5631313482bb]
25  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x1a6) [0x56313132c256]
26  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0xa4b) [0x5631313430fb]
27  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x1a6) [0x56313132c256]
28  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0xa4b) [0x5631313430fb]
29  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x1a6) [0x56313132c256]
30  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0xa4b) [0x5631313430fb]
31  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x1a6) [0x56313132c256]
32  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0xa4b) [0x5631313430fb]
33  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x56313132b053]
34  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x378) [0x56313132c428]
35  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0xa4b) [0x5631313430fb]
36  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x56313132b053]
37  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x378) [0x56313132c428]
38  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(+0x1ac097) [0x56313132d097]
39  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(PyObject_Call+0x5e) [0x5631312936be]
40  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0x21ba) [0x56313134486a]
41  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x1a6) [0x56313132c256]
42  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(+0x1ac097) [0x56313132d097]
43  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(PyObject_Call+0x5e) [0x5631312936be]
44  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0x21ba) [0x56313134486a]
45  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x1a6) [0x56313132c256]
46  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0xa4b) [0x5631313430fb]
47  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x56313132b053]
48  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x378) [0x56313132c428]
49  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0xa4b) [0x5631313430fb]
50  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x1a6) [0x56313132c256]
51  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0x92f) [0x563131342fdf]
52  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x56313132b053]
53  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyFunction_Vectorcall+0x378) [0x56313132c428]
54  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalFrameDefault+0x178c) [0x563131343e3c]
55  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x56313132b053]
56  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(PyEval_EvalCodeEx+0x39) [0x56313132c0a9]
57  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(PyEval_EvalCode+0x1b) [0x5631313cd13b]
58  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(+0x24c1d3) [0x5631313cd1d3]
59  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(+0x26b983) [0x5631313ec983]
60  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(PyRun_StringFlags+0x7d) [0x5631313f16ad]
61  /datasets/pentschev/miniconda3/envs/ucx-1.10-110-0.18.210113/bin/python(PyRun_SimpleStringFlags+0x3d) [0x5631313f170d]
=================================
e.c:227  UCX  DEBUG sockcm_iface 0x5638801ff560: accepted connection from 127.0.0.1:56348 at fd 90 Resource temporarily unavailable

After discussing offline with some UCX devs, I've been told that starting with UCX 1.10 we should move to the new tcp_sockcm. That involves some changes in variables we use today, specifically removing sockcm, with a base of variables now being switched to: UCX_TLS=tcp,cuda_copy UCX_SOCKADDR_TLS_PRIORITY=tcp UCX_SOCKADDR_CM_ENABLE=y. We still need cuda_ipc and rc to enable NVLink and IB, respectively.

Moving to the new tcp_sockcm, we still see issues though, particularly:

  1. The loopback interface isn't supported, causing a segfault. This is important in some very common cases in Dask, which seems to use that when you connect a Client to a LocalCUDACluster, unless we specify host to the latter to prevent from binding to loopback, and this would break lots of user code today.
  2. I'm still seeing segfaults when endpoints disconnect, but it's only reproducible with multiple workers, and it doesn't seem to be in a regular number of workers, for instance I see segfaults with CUDA devices 0,1,2,3, but not 0,1,2,4. This still seems like a bug in UCX.

With all the above said, using UCX 1.10 is not viable for UCX-Py at the moment of writing, I'm working with @dmitrygx and @alinask to check whether these issues/limitations can be solved. However, the UCX 1.10 conda package is breaking for our nightly build users and they have to pin ucx=1.8 or they will experience segfaults. We can still create 1.9 packages, but that would still require us to either delete the current UCX package from Anaconda or pin 1.9 in our metapackages, any ideas or preferences on what path we should follow @quasiben @jakirkham ?

@beckernick
Copy link
Member

cc @randerzander for visibility

@jakirkham
Copy link
Member

Just to update this thread, we have both pulled the problematic ucx 1.10.0rc1 packages and started publishing 1.9.0 packages. Hopefully this addresses this issue for now. We are discussing with the UCX team on how to prioritize TCP support. So we can address this issue in future ucx versions.

@pentschev
Copy link
Member Author

We need fixes from openucx/ucx#6001 and openucx/ucx#6157 to be backported to 1.10 for UCX-Py, discussion is ongoing. We will also need support for TCP loopback in the new UCX 1.10 transport for 1.10, which is also being checked by UCX devs.

Apart from that, we will need to adjust transports in https://github.com/dask/distributed/blob/9442d9b3f2847bf6d0252a8ed671d342a5379501/distributed/comm/ucx.py#L479-L480 , which I'll do once there's a new 1.10 RC with all the patches we need. #667 won't be necessary, and I'll close it now.

@jakirkham
Copy link
Member

Thanks for the update Peter and keeping track of all of these threads! 😀

@pentschev
Copy link
Member Author

This isn't relevant anymore. The resolution is that we should avoid UCX 1.10, but UCX 1.11 onwards will be fully supported. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants