Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/PROTO: Handle AM short failure correctly [v1.10.x] #6164

Merged
merged 1 commit into from
Jan 22, 2021

Conversation

dmitrygx
Copy link
Member

@dmitrygx dmitrygx commented Jan 20, 2021

What

Handle UCT AM short failure correctly in UCP progress functions.

Why ?

If AM short failed in a progress function (i.e. the status is neither UCS_OK nor UCS_ERR_NO_RESOURCE), a UCP request has to be completed with the status, but UCS_OK should be returned from a function to satisfy ucp_request_try_send() expectations that UCS_OK/UCS_INPROGRESS/UCS_ERR_NO_RESOURCE statuses could be returned from progress functions.

How ?

  1. Introduce ucp_am_short_handle_status_from_pending() common function for AM Short to use in all TAG/AM/STREAM function as we have for AM Bcopy.
  2. Use the new function to handle status from uct_ep_am_short() in TAG/AM/STREAM short progress functions.

backport of #6157 PR

@brminich
Copy link
Contributor

Looks like PR description is taken from wrong PR

@dmitrygx
Copy link
Member Author

Looks like PR description is taken from wrong PR

@brminich thanks!
fixed

@dmitrygx
Copy link
Member Author

unrelated failure on r-vmb-ppc-jenkins node:

shm_ib/test_ucp_sockaddr_destroy_ep_on_err.onesided_client_sforce/4

http://hpc-master.lab.mtl.com:8080/blue/organizations/jenkins/ucx/detail/ucx/9292/pipeline/568

bot:pipe:retest

@brminich
Copy link
Contributor

can error be relevant?

2021-01-20T20:44:57.0460335Z [ RUN      ] rc_mlx5/uct_flush_test.am_flush_ep_no_comp/0 <rc_mlx5/mlx5_1:1>
2021-01-20T20:44:57.0580903Z [     INFO ] Testing component: ib
2021-01-20T20:44:57.1214206Z [swx-rdmz-ucx-arm-hwi:14347:0:14347]       rc_ep.c:506  Assertion `!(ep->flags & UCT_RC_EP_FLAG_FLUSH_CANCEL)' failed
2021-01-20T20:44:57.1230226Z ==== backtrace (tid:  14347) ====
2021-01-20T20:44:57.1232260Z  0  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_fatal_error_message+0xa0) [0xffff91b14fe8]
2021-01-20T20:44:57.1234277Z  1  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_fatal_error_format+0xbc) [0xffff91b150a8]
2021-01-20T20:44:57.1236817Z  2  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/src/uct/ib/.libs/libuct_ib.so.0(uct_rc_ep_check+0) [0xffff916baf80]
2021-01-20T20:44:57.1238718Z  3  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/src/uct/ib/.libs/libuct_ib.so.0(uct_rc_mlx5_ep_flush+0x44) [0xffff916cfe60]
2021-01-20T20:44:57.1240323Z  4  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x61717c]
2021-01-20T20:44:57.1241747Z  5  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x6183f4]
2021-01-20T20:44:57.1243182Z  6  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x5bc58c]
2021-01-20T20:44:57.1244630Z  7  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x59fd84]
2021-01-20T20:44:57.1246050Z  8  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x59d4d0]
2021-01-20T20:44:57.1247495Z  9  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x59d1dc]
2021-01-20T20:44:57.1248982Z 10  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x59cf28]
2021-01-20T20:44:57.1250416Z 11  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x58cdd0]
2021-01-20T20:44:57.1251823Z 12  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x5a0bf4]
2021-01-20T20:44:57.1253580Z 13  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x58c8cc]
2021-01-20T20:44:57.1255080Z 14  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x5a8e14]
2021-01-20T20:44:57.1256476Z 15  /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0xe0) [0xffff913386e0]
2021-01-20T20:44:57.1257858Z 16  /scrap/azure/agent-06/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest() [0x57a0f4]
2021-01-20T20:44:57.1258712Z =================================
2021-01-20T20:44:57.1259883Z [swx-rdmz-ucx-arm-hwi:14347:0:14347] Process frozen...

@dmitrygx
Copy link
Member Author

can error be relevant?

no, this is UCT test
changes in this PR are in UCP only

@yosefe
Copy link
Contributor

yosefe commented Jan 21, 2021

@brminich @dmitrygx i think we are missing some fix in v1.10. where this assertion was replaced by ignoring?

@dmitrygx
Copy link
Member Author

@brminich @dmitrygx i think we are missing some fix in v1.10. where this assertion was replaced by ignoring?

yes, good catch. did you mean #6055?
should we port it?

@yosefe
Copy link
Contributor

yosefe commented Jan 21, 2021

yes, good catch. did you mean #6055?
should we port it?

Let's port, to make v1.10 pass tests

@yosefe yosefe merged commit 0427865 into openucx:v1.10.x Jan 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants