Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP transport may cause Connection reset by peer when closing shortly after ucp_tag_send_nb #6922

Closed
pentschev opened this issue Jun 9, 2021 · 4 comments
Assignees
Labels

Comments

@pentschev
Copy link
Contributor

pentschev commented Jun 9, 2021

Describe the bug

Calling ucp_ep_close_nb with UCP_EP_CLOSE_MODE_FORCE immediately after ucp_tag_send_nb may cause Connection reset by remote peer on remote end with TCP transport.

Steps to Reproduce

May be reproduced with UCX-Py send-recv benchmark:

$ UCX_TLS=tcp python benchmarks/send-recv.py --n-bytes 1GB --n-iter 3
Server Running at 10.33.225.165:43799
Client connecting to server at 10.33.225.165:43799
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/datasets/pentschev/miniconda3/envs/ucx-master-112-21.08.210608/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/datasets/pentschev/miniconda3/envs/ucx-master-112-21.08.210608/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/datasets/pentschev/src/ucx-py/benchmarks/send-recv.py", line 205, in client
    loop.run_until_complete(run())
  File "/datasets/pentschev/miniconda3/envs/ucx-master-112-21.08.210608/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/datasets/pentschev/src/ucx-py/benchmarks/send-recv.py", line 197, in run
    await ep.recv(msg_recv_list[i])
  File "/datasets/pentschev/miniconda3/envs/ucx-master-112-21.08.210608/lib/python3.8/site-packages/ucp/core.py", line 704, in recv
    ret = await comm.tag_recv(self._ep, buffer, nbytes, tag, name=log)
ucp.exceptions.UCXError: <[Recv #002] ep: 0x7f14900da000, tag: 0xa3397626c5a3c734, nbytes: 1000000000, type: <class 'numpy.ndarray'>>: Connection reset by remote peer
Traceback (most recent call last):
  File "benchmarks/send-recv.py", line 412, in <module>
    main()
  File "benchmarks/send-recv.py", line 404, in main
    assert not p2.exitcode
AssertionError

Setup and versions

Additional information (depending on the issue)

Discussed this issue offline with @dmitrygx who confirmed this is likely an issue in the TCP transport closing protocol.

@pentschev pentschev added the Bug label Jun 9, 2021
@dmitrygx dmitrygx self-assigned this Jun 9, 2021
@wangvsa
Copy link

wangvsa commented Jun 27, 2021

Hi, I observed the same issue when running the example code ucp_client_server.c and changing close_nbx(...)
to ucp_ep_close_nb(ep, UCP_EP_CLOSE_MODE_FLUSH)

Any temporary fix or hack that I can try?

pentschev added a commit to pentschev/distributed that referenced this issue Jul 21, 2021
@pentschev
Copy link
Contributor Author

I also don't have a solution for this, but it would be great if it could be prioritized. Users started to report issues with Dask in rapidsai/dask-cuda#677, as a workaround I added a 100ms sleep between ucp_tag_send_nb and ucp_ep_close_nb, for that one case it seems to be ok, but I imagine it may be problematic if the other endpoint takes longer than that to react and receive the last message sent.

@dmitrygx
Copy link
Member

dmitrygx commented Aug 8, 2021

fixed by #7140 (master) and #7188 (v1.11.x)

@dmitrygx dmitrygx closed this as completed Aug 8, 2021
@dmitrygx
Copy link
Member

dmitrygx commented Aug 8, 2021

@wangvsa could you check if the #7140 (master) or #7188 (v1.11.x) fix the issue observed by you? thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants