Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP/TEST: Fix simultaneous ep close with ucp_hello_world #7224

Merged

Conversation

yosefe
Copy link
Contributor

@yosefe yosefe commented Aug 11, 2021

Why

As a result of #7140, ucp_hello_world fails in docker when it uses TCP transport, like this:

2021-08-10T19:42:56.9329573Z + taskset -c 0 ./examples/ucp_hello_world -f -m host -p 16001
2021-08-10T19:42:56.9651931Z INFO: UCP_HELLO_WORLD mode = 2 server = (null) port = 16001, pid = 20460
2021-08-10T19:43:11.9295628Z + '[' 0 -eq 1 ']'
2021-08-10T19:43:11.9304753Z + client_pid=20476
2021-08-10T19:43:11.9305794Z + wait 20476
2021-08-10T19:43:11.9307301Z + taskset -c 1 ./examples/ucp_hello_world -f -m host -n 0231793f6e3f -p 16001
...
2021-08-10T19:43:12.0205215Z [0231793f6e3f:20476:0:20476]      ucp_ep.c:2822 Assertion `ucp_ep_config(ep)->key.err_mode != UCP_ERR_HANDLING_MODE_NONE' failed
2021-08-10T19:43:12.0206896Z ==== backtrace (tid:  20476) ====
2021-08-10T19:43:12.0208701Z  0  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_handle_error+0xfc) [0x7f0a5481f44c]
2021-08-10T19:43:12.0210714Z  1  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_fatal_error_message+0x58) [0x7f0a5481c228]
2021-08-10T19:43:12.0212782Z  2  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_fatal_error_format+0xd1) [0x7f0a5481c3b1]
2021-08-10T19:43:12.0214903Z  3  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_ep_invoke_err_cb+0x1cb) [0x7f0a5450b36b]
2021-08-10T19:43:12.0216926Z  4  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_ep_set_failed+0x1f2) [0x7f0a54510102]
2021-08-10T19:43:12.0218867Z  5  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(+0x4ec2a) [0x7f0a54525c2a]
2021-08-10T19:43:12.0221995Z  6  /__w/1/s/build-test/src/uct/.libs/libuct.so.0(uct_tcp_ep_set_failed+0x61) [0x7f0a542b3cc1]
2021-08-10T19:43:12.0224157Z  7  /__w/1/s/build-test/src/uct/.libs/libuct.so.0(+0x256bf) [0x7f0a542b66bf]
2021-08-10T19:43:12.0225985Z  8  /__w/1/s/build-test/src/uct/.libs/libuct.so.0(+0x29e84) [0x7f0a542bae84]
2021-08-10T19:43:12.0227836Z  9  /__w/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_event_set_wait+0xf1) [0x7f0a5482ccf1]
2021-08-10T19:43:12.0229713Z 10  /__w/1/s/build-test/src/uct/.libs/libuct.so.0(uct_tcp_iface_progress+0x74) [0x7f0a542bad94]
2021-08-10T19:43:12.0231969Z 11  /__w/1/s/build-test/src/ucp/.libs/libucp.so.0(ucp_worker_progress+0x6a) [0x7f0a54523bca]
2021-08-10T19:43:12.0233840Z 12  /__w/1/s/build-test/examples/.libs/lt-ucp_hello_world() [0x401ed2]
2021-08-10T19:43:12.0235177Z 13  /lib64/libc.so.6(__libc_start_main+0xf5) [0x7f0a53190555]
2021-08-10T19:43:12.0236900Z 14  /__w/1/s/build-test/examples/.libs/lt-ucp_hello_world() [0x402746]
2021-08-10T19:43:12.0238169Z =================================
2021-08-10T19:43:12.0239342Z [0231793f6e3f:20476:0:20476] Process frozen..

(https://dev.azure.com/ucfconsort/0b36e3f0-8ab9-4a48-b68b-4b2350e02c88/_apis/build/builds/22781/logs/389)

How

  • UCP: Don't try to invoke error callback if endpoint closed but not flushed yet. If user started to close endpoint, callbacks should not be called on this endpoint anymore. This change avoids the assert fail above.
  • TCP: Flush outstanding put operations in case of error. Without this change, endpoint close (flush stage) does not finish.
  • Test: run ucp_hello_world with TCP

@yosefe yosefe force-pushed the topic/tcp-test-fix-simultaneous-ep-close-with branch 2 times, most recently from 4e06533 to 6956fc9 Compare August 11, 2021 22:33
@yosefe
Copy link
Contributor Author

yosefe commented Aug 11, 2021

@dmitrygx @evgeny-leksikov can you pls take a look?

@yosefe yosefe force-pushed the topic/tcp-test-fix-simultaneous-ep-close-with branch from 6956fc9 to 7ff8a89 Compare August 11, 2021 23:59
examples/ucp_hello_world.c Show resolved Hide resolved
@yosefe
Copy link
Contributor Author

yosefe commented Aug 13, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@yosefe yosefe force-pushed the topic/tcp-test-fix-simultaneous-ep-close-with branch from 357a650 to 68de674 Compare August 13, 2021 11:56
@yosefe
Copy link
Contributor Author

yosefe commented Aug 15, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@yosefe
Copy link
Contributor Author

yosefe commented Aug 16, 2021

@hoopoepg can you pls review latest commit ? i saw you added this line but maybe there was another intention i missed

@yosefe yosefe force-pushed the topic/tcp-test-fix-simultaneous-ep-close-with branch from 2ffd517 to 58407ae Compare August 16, 2021 21:05
@yosefe
Copy link
Contributor Author

yosefe commented Aug 17, 2021

bot:pipe:retest

@yosefe yosefe force-pushed the topic/tcp-test-fix-simultaneous-ep-close-with branch from 9d9d0ea to 097e250 Compare August 17, 2021 07:50
Fixes an assertion failure when running ucp_hello_world over TCP
transport.

UCP: When both sides close their TCP endpoints, one side can receive
connection reset event while it's trying to close-flush its endpoint. We
should not try to invoke user error callback in such case. Instead, the
close operation should complete with status CONNECTION_RESET.

UCT/TCP: Need to purge outstanding PUT operations when getting an error.

Test: Run ucp_hello_world over several transports. Currently it used
TCP only when ran inside a docker, so issue was not detected.
Fix barrier in hello_world test to prevent failures.
@yosefe yosefe force-pushed the topic/tcp-test-fix-simultaneous-ep-close-with branch from 097e250 to f7a4757 Compare August 17, 2021 15:50
@yosefe yosefe merged commit 70c347e into openucx:master Aug 17, 2021
@yosefe yosefe deleted the topic/tcp-test-fix-simultaneous-ep-close-with branch August 17, 2021 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants