Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/NBX: fixed external request free from CB #5998

Merged

Conversation

hoopoepg
Copy link
Contributor

@hoopoepg hoopoepg commented Dec 8, 2020

  • fixed crash in completion callback when user is tried to free
    external request
  • added UCP_REQUEST_FLAG_EXTERNAL (replaced
    UCP_REQUEST_DEBUG_FLAG_EXTERNAL)
  • added gtest

fixes #5991

(_req)->status = (_status); \
if (ucs_likely((_req)->flags & UCP_REQUEST_FLAG_CALLBACK)) { \
(_req)->_cb((_req) + 1, (_status), ## __VA_ARGS__); \
} \
if (ucs_unlikely(((_req)->flags |= UCP_REQUEST_FLAG_COMPLETED) & \
if (ucs_unlikely(!external && \
((_req)->flags |= UCP_REQUEST_FLAG_COMPLETED) & \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space before |=

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -171,7 +171,8 @@ ucp_rma_nonblocking(ucp_ep_h ep, const void *buffer, size_t length,
{return UCS_STATUS_PTR(UCS_ERR_NO_MEMORY);});

status = ucp_rma_request_init(req, ep, buffer, length, remote_addr, rkey,
progress_cb, zcopy_thresh, 0);
progress_cb, zcopy_thresh,
req->flags & UCP_REQUEST_FLAG_EXTERNAL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?
if it is already set, no need to reset..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_request_get_param just select way how to get request (allocate or use from params) + set flag EXTERNAL,
ucp_rma_request_init doesn't respect any flags from request - here fixed it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to override request? It looks weird.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also req->flags may be undefined for internal request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set it to 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req->flags is already initialized, why need to set it again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there added more flags

@@ -292,7 +292,7 @@ UCS_PROFILE_FUNC(ucs_status_ptr_t, ucp_tag_send_nbx,
datatype, contig_length, param);
} else {
ucp_tag_send_req_init(req, ep, buffer, datatype, memory_type, count,
tag, 0);
tag, req->flags & UCP_REQUEST_FLAG_EXTERNAL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need reinit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not re-init, req_init fills all values in request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

req->flags is already initialized, why need to set it again? can just do |= in req_init

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there added more flags

test/gtest/ucp/ucp_test.h Show resolved Hide resolved
@@ -64,11 +64,13 @@

#define ucp_request_complete(_req, _cb, _status, ...) \
{ \
uint32_t external = (_req)->flags & UCP_REQUEST_FLAG_EXTERNAL; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__external?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

};

const size_t test_ucp_tag_fallback::MSG_SIZE = 4 * 1024 * ucs_get_page_size();
const size_t test_ucp_tag_nbx::MSG_SIZE = 4 * 1024 * ucs_get_page_size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 * UCS_KBYTE * ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -94,6 +96,7 @@
} \
} else { \
__req = ((ucp_request_t*)(_param)->request) - 1; \
__req->flags |= UCP_REQUEST_FLAG_EXTERNAL; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

align by =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aligned

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to clear this flag (or init to 0) if it is internal request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, set to 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where? Also why do you do |=? User may not touch UCX part of request at all, so req->flags may contain garbage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-pushed

#if UCS_ENABLE_ASSERT
UCP_REQUEST_FLAG_STREAM_RECV = UCS_BIT(18),
UCP_REQUEST_DEBUG_FLAG_EXTERNAL = UCS_BIT(19)
UCP_REQUEST_FLAG_STREAM_RECV = UCS_BIT(19),
#else
UCP_REQUEST_FLAG_STREAM_RECV = 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for trailing ,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

#if UCS_ENABLE_ASSERT
UCP_REQUEST_FLAG_STREAM_RECV = UCS_BIT(18),
UCP_REQUEST_DEBUG_FLAG_EXTERNAL = UCS_BIT(19)
UCP_REQUEST_FLAG_STREAM_RECV = UCS_BIT(19),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for trailing ,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, fixed

test/gtest/ucp/test_ucp_tag.cc Show resolved Hide resolved
@brminich
Copy link
Contributor

brminich commented Dec 9, 2020

@dmitrygx, what is the case when ucp_request_imm_cmpl_param() is invoked, but req->status is not set?

@dmitrygx
Copy link
Member

dmitrygx commented Dec 9, 2020

@dmitrygx, what is the case when ucp_request_imm_cmpl_param() is invoked, but req->status is not set?

ucp_request_imm_cmpl_param(param, req, send);

and we set status before for recv:

req->status = status;

@brminich
Copy link
Contributor

brminich commented Dec 9, 2020

@dmitrygx, what is the case when ucp_request_imm_cmpl_param() is invoked, but req->status is not set?

ucp_request_imm_cmpl_param(param, req, send);

and we set status before for recv:

req->status = status;

But where we set UCP_REQUEST_FLAG_COMPLETED which is required for calling ucp_request_imm_cmpl_param? Do not we set status there?

@dmitrygx
Copy link
Member

dmitrygx commented Dec 9, 2020

But where we set UCP_REQUEST_FLAG_COMPLETED which is required for calling ucp_request_imm_cmpl_param? Do not we set status there?

it seems here ucp_request_complete(), does it mean that request status inside the callback is UCS_OK then?

@hoopoepg
Copy link
Contributor Author

/azp run

@hoopoepg
Copy link
Contributor Author

bot:pipe:retest

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hoopoepg
Copy link
Contributor Author

timed out

Comment on lines 67 to 68
uint32_t _external = ((_req)->flags |= UCP_REQUEST_FLAG_COMPLETED) & \
UCP_REQUEST_FLAG_EXTERNAL; \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitrygx as you wish

@hoopoepg hoopoepg force-pushed the topic/fixed-external-request-free-crash branch from f7e6def to addbfff Compare December 10, 2020 18:05
@hoopoepg
Copy link
Contributor Author

completely re-implemented fix, squashed due to full re-implementation

@@ -64,12 +64,15 @@

#define ucp_request_complete(_req, _cb, _status, ...) \
{ \
/* NOTE: we have to store "RELEASED" flag here to provide backward */ \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_request_release function doesn't cancel completion call, to support such behavior we have to store "RELEASED" flag here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated comment

@hoopoepg hoopoepg force-pushed the topic/fixed-external-request-free-crash branch from addbfff to 6170d08 Compare December 11, 2020 09:43
@yosefe
Copy link
Contributor

yosefe commented Dec 11, 2020

@hoopoepg test failure seems relevant

[2020-12-11T10:57:37.848Z] [----------] 1 test from self/test_ucp_tag_nbx
[2020-12-11T10:57:37.848Z] [ RUN      ] self/test_ucp_tag_nbx.external_request_free/0 <self>
[2020-12-11T10:57:37.848Z] [hpc-arm-cavium-jenkins:30991:0:30991]     offload.c:242  Assertion `wiface != ((void *)0)' failed
[2020-12-11T10:57:37.848Z] 
[2020-12-11T10:57:37.848Z] /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/tag/offload.c: [ ucp_tag_offload_cancel_inner() ]
[2020-12-11T10:57:37.848Z]       ...
[2020-12-11T10:57:37.848Z]       239     ucs_status_t status;
[2020-12-11T10:57:37.848Z]       240 
[2020-12-11T10:57:37.848Z]       241     ucs_assert(wiface != NULL);
[2020-12-11T10:57:37.848Z] ==>   242     status = uct_iface_tag_recv_cancel(wiface->iface, &req->recv.uct_ctx,
[2020-12-11T10:57:37.848Z]       243                                        mode & UCP_TAG_OFFLOAD_CANCEL_FORCE);
[2020-12-11T10:57:37.848Z]       244     if (status != UCS_OK) {
[2020-12-11T10:57:37.848Z]       245         ucs_error("Failed to cancel recv in the transport: %s",
[2020-12-11T10:57:37.848Z] 
[2020-12-11T10:57:38.785Z] ==== backtrace (tid:  30991) ====
[2020-12-11T10:57:38.785Z]  0 0x00000000000565b8 ucs_debug_print_backtrace()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucs/debug/debug.c:656
[2020-12-11T10:57:38.785Z]  1 0x00000000000850c4 ucp_tag_offload_cancel_inner()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/tag/offload.c:242
[2020-12-11T10:57:38.785Z]  2 0x00000000000850c4 ucp_tag_offload_cancel()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/tag/offload.c:235
[2020-12-11T10:57:38.785Z]  3 0x0000000000075dac ucp_tag_offload_try_cancel()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/tag/offload.h:105
[2020-12-11T10:57:38.785Z]  4 0x0000000000075dac ucp_tag_rndv_process_rts()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/tag/tag_rndv.c:46
[2020-12-11T10:57:38.785Z]  5 0x0000000000056dec ucp_rndv_rts_handler_inner()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/rndv/rndv.c:1333
[2020-12-11T10:57:38.785Z]  6 0x000000000001db50 uct_iface_invoke_am()  /scrap/jenkins/workspace/ucx-6/contrib/../src/uct/base/uct_iface.h:662
[2020-12-11T10:57:38.785Z]  7 0x000000000001db50 uct_self_iface_sendrecv_am()  /scrap/jenkins/workspace/ucx-6/contrib/../src/uct/sm/self/self.c:149
[2020-12-11T10:57:38.785Z]  8 0x000000000001e098 uct_self_ep_am_short()  /scrap/jenkins/workspace/ucx-6/contrib/../src/uct/sm/self/self.c:262
[2020-12-11T10:57:38.785Z]  9 0x000000000003de54 uct_ep_am_short()  /scrap/jenkins/workspace/ucx-6/contrib/../src/uct/api/uct.h:2632
[2020-12-11T10:57:38.785Z] 10 0x00000000000755d4 ucp_proto_progress_rndv_rts_inner()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/tag/tag_rndv.c:89
[2020-12-11T10:57:38.785Z] 11 0x000000000007f000 ucp_request_try_send()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/core/ucp_request.inl:242
[2020-12-11T10:57:38.785Z] 12 0x000000000007f000 ucp_request_send()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/core/ucp_request.inl:267
[2020-12-11T10:57:38.785Z] 13 0x000000000007f000 ucp_tag_send_req()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/tag/tag_send.c:116
[2020-12-11T10:57:38.785Z] 14 0x000000000007f000 ucp_tag_send_nbx_inner()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/tag/tag_send.c:296
[2020-12-11T10:57:38.785Z] 15 0x000000000007f000 ucp_tag_send_nbx()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucp/tag/tag_send.c:234
[2020-12-11T10:57:38.785Z] 16 0x00000000007d8080 test_ucp_tag_nbx_external_request_free_Test::test_body()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/test_ucp_tag.cc:549
[2020-12-11T10:57:38.785Z] 17 0x000000000058b88c ucs::test_base::run()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/test.cc:344
[2020-12-11T10:57:38.785Z] 18 0x000000000058b88c ucs::test_base::TestBodyProxy()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/test.cc:370
[2020-12-11T10:57:38.785Z] 19 0x000000000056e00c HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3562
[2020-12-11T10:57:38.785Z] 20 0x000000000056e00c testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3598
[2020-12-11T10:57:38.785Z] 21 0x00000000005627c4 testing::Test::Run()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3635
[2020-12-11T10:57:38.785Z] 22 0x0000000000562894 testing::TestInfo::Run()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3812
[2020-12-11T10:57:38.785Z] 23 0x0000000000562a04 testing::TestCase::Run()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3930
[2020-12-11T10:57:38.785Z] 24 0x0000000000566e14 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:5808
[2020-12-11T10:57:38.785Z] 25 0x0000000000567130 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:5725
[2020-12-11T10:57:38.785Z] 26 0x0000000000567130 HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3618
[2020-12-11T10:57:38.785Z] 27 0x0000000000567130 testing::UnitTest::Run()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:5422
[2020-12-11T10:57:38.785Z] 28 0x0000000000510908 RUN_ALL_TESTS()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest.h:20059
[2020-12-11T10:57:38.785Z] 29 0x00000000000215d4 __libc_start_main()  :0
[2020-12-11T10:57:38.785Z] 30 0x000000000054d0fc _start()  :0
[2020-12-11T10:57:38.785Z] =================================
[2020-12-11T10:57:38.785Z] Sending notification to sergeyo@nvidia.com
[2020-12-11T10:57:44.065Z] [hpc-arm-cavium-jenkins:30991:0:30991] Process frozen...
[2020-12-11T13:37:11.727Z] make: *** [test] Terminated
script returned exit code 124

@hoopoepg hoopoepg force-pushed the topic/fixed-external-request-free-crash branch from 6170d08 to 1360f39 Compare December 11, 2020 17:06
@hoopoepg
Copy link
Contributor Author

yep, issue was in tag-offload buffer dereg.
fixed

@hoopoepg
Copy link
Contributor Author

@yosefe ok to merge?

src/ucp/core/ucp_request.inl Outdated Show resolved Hide resolved
/* Cancel req in transport if it was offloaded, because it arrived
as unexpected */
ucp_tag_offload_try_cancel(worker, rreq, UCP_TAG_OFFLOAD_CANCEL_FORCE);
ucp_tag_rndv_matched(worker, rreq, rts_hdr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does it fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucp_tag_rndv_matched may free request in self transport & access to request became impossible. fix is cancel tag offload buffer first (it is not used in expected recv) and then process rndv operation.

this fix is proposed by @brminich

test/gtest/ucp/test_ucp_tag.cc Show resolved Hide resolved
test/gtest/ucp/ucp_test.h Show resolved Hide resolved
test/gtest/ucp/ucp_test.h Show resolved Hide resolved
- fixed crash in completion callback when user is tried to free
  external request
- added gtest
- added function wait_for_value for UCP tests
@hoopoepg hoopoepg force-pushed the topic/fixed-external-request-free-crash branch from 1360f39 to 1795c1f Compare December 14, 2020 06:03
@hoopoepg
Copy link
Contributor Author

bot:pipe:retest

@hoopoepg
Copy link
Contributor Author

@yosefe ok to merge?

@yosefe yosefe merged commit 6887a67 into openucx:master Dec 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Q: implementation of ucp_request_alloc
4 participants