Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/CORE/RNDV/GTEST: Handle status from AM/TAG RNDV RTS/data correctly #6163

Merged
merged 1 commit into from
Jan 27, 2021

Conversation

dmitrygx
Copy link
Member

@dmitrygx dmitrygx commented Jan 20, 2021

What

Handle status from AM/TAG RNDV RTS/data correctly in UCP progress functions.

Why ?

If AM/TAG RNDV RTS/data sending failed in a progress function (i.e. the status is neither UCS_OK nor UCS_ERR_NO_RESOURCE), a UCP request has to be completed with the status, but UCS_OK should be returned from a function to satisfy ucp_request_try_send() expectations that UCS_OK/UCS_INPROGRESS/UCS_ERR_NO_RESOURCE statuses could be returned from progress functions.

How ?

  1. Introduced ucp_rndv_rts_handle_status_from_pending() to handle status returned from RTS send progress function.
  2. Used the new function to handle status returned for sending AM/TAG RNDV RTS packet.
  3. Fixed possible failure during sending RNDV data - used ucp_am_bcopy_handle_status_from_pending() w/o completing a request.
  4. Fixed ucp_proto_progress_am_single().
  5. Updated GTEST/UCP test_ucp_sockaddr_protocols_err to reproduce the bug fixed in this PR:
  • removed closing UCP EP for a sender/receiver in error handling callback.
  • added sending of a message using sender/receiver's EP after error handling callback was called to test that sending after error detection works fine.
  1. Updated GTEST/UCP test_ucp_sockaddr_protocols_err to reproduce the bug fixed in UCP/PROTO: Handle AM short failure correctly #6157 PR:
  • added the new test to check TAG/Eager for 32 bytes.
  • added the GPU copying TLs to the test in order to always allocate a request instead doing of a TAG inline send (that's done when no CUDA and RoCM MDs).

@dmitrygx dmitrygx force-pushed the topic/ucp/rndv_rts branch 3 times, most recently from 4eef417 to 44e567b Compare January 21, 2021 20:14
@dmitrygx dmitrygx changed the title UCP/CORE/RNDV/GTEST: Handle status from AM/TAG RNDV RTS correctly UCP/CORE/RNDV/GTEST: Handle status from AM/TAG RNDV RTS/data correctly Jan 21, 2021
} else if (status == UCP_STATUS_PENDING_SWITCH) {
status = UCS_OK;

ret_status = ucp_am_bcopy_handle_status_from_pending(self, !single, 0, 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place, where no_complete is passed to ucp_am_bcopy_handle_status_from_pending.
I'd remove changes in ucp_am_bcopy_handle_status_from_pending and rewrite code here without it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1137,6 +1134,16 @@ class test_ucp_sockaddr_protocols : public test_ucp_sockaddr {
ucp_request_release(sreq);
}

void extra_send_before_disconenct(entity &e, const std::string &send_buf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disconnect

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 1646 to 1652
"RNDV_THRESH=0", "RNDV_SCHEME=get_zcopy")
{
test_tag_send_recv(64 * UCS_KBYTE, false, false);
}

UCS_TEST_P(test_ucp_sockaddr_protocols_err, tag_rndv_unexp_get_scheme,
"RNDV_THRESH=0", "RNDV_SCHEME=get_zcopy")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the difference in these tests

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we are asking for GET_ZCOPY, but transport doesn't support it (e.g. TCP), it fallbacks to RNDV AM
the idea is to cover as much as possible cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two tests are identical (this one and the previous one)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@dmitrygx
Copy link
Member Author

bot:pipe:retest

@dmitrygx
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dmitrygx
Copy link
Member Author

bot:pipe:retest

@yosefe
Copy link
Contributor

yosefe commented Jan 24, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

src/ucp/proto/proto_am.inl Outdated Show resolved Hide resolved
src/ucp/rndv/rndv.c Outdated Show resolved Hide resolved
UCP_INSTANTIATE_TEST_CASE_TLS(_test_case, rcx, "rc_x") \
UCP_INSTANTIATE_TEST_CASE_TLS(_test_case, ib, "ib") \
UCP_INSTANTIATE_TEST_CASE_TLS(_test_case, tcp, "tcp") \
UCP_INSTANTIATE_TEST_CASE_TLS(_test_case, dcudx, "dc_x,ud," UCP_TEST_GPU_COPY_TLS) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse UCP_INSTANTIATE_TEST_CASE_GPU_AWARE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, they have different sets of TLs instantiated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to unite some common parts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yosefe is it ok to have UCP_INSTANTIATE_TEST_CASE_TLS_GPU_AWARE macro to instantiate a test case with GPU support?
I didn't find a better way to reuse UCP_INSTANTIATE_TEST_CASE_GPU_AWARE macro for CM instantiation macro.

test/gtest/ucp/test_ucp_sockaddr.cc Show resolved Hide resolved
@dmitrygx
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

src/ucp/rndv/rndv.h Outdated Show resolved Hide resolved
@dmitrygx
Copy link
Member Author

@yosefe could you review pls?

@dmitrygx
Copy link
Member Author

@yosefe ok to merge?

@brminich brminich merged commit 99f60ee into openucx:master Jan 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants