Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/CORE: Drop packets with invalid REQ or UCP EP IDs #6001

Merged
merged 1 commit into from
Jan 13, 2021

Conversation

dmitrygx
Copy link
Member

@dmitrygx dmitrygx commented Dec 9, 2020

What

Drop packets with invalid REQ or UCP EP IDs

Why ?

Just drop packets instead of asserting that we expect that requests or EPs are found by their ids.

How ?

  1. Add additional if to check the result from ucp_worker_get_ep_by_id() and ucp_worker_get_request_by_id(). If it unsuccessful to find a request or an EP, just drop a packet.
  2. Add gtests to reproduce an invalid EP or request ID handling during peer failure flow.

@dmitrygx
Copy link
Member Author

failure was #6029
bot:pipe:retest

@dmitrygx dmitrygx force-pushed the topic/ucp/drop_by_id branch 5 times, most recently from a304d9f to ec740b3 Compare December 16, 2020 07:27
@dmitrygx
Copy link
Member Author

tests are passing, unrelated change was rolled back.
@hoopoepg @brminich could you review pls?

@dmitrygx
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

src/ucp/core/ucp_request.inl Outdated Show resolved Hide resolved
src/ucp/proto/proto_am.inl Outdated Show resolved Hide resolved
src/ucp/rma/amo_sw.c Outdated Show resolved Hide resolved
src/ucp/rma/rma_sw.c Outdated Show resolved Hide resolved
src/ucp/rma/rma_sw.c Outdated Show resolved Hide resolved
src/ucp/rma/rma_sw.c Outdated Show resolved Hide resolved
src/ucp/rndv/rndv.c Outdated Show resolved Hide resolved
src/ucp/rndv/rndv.c Show resolved Hide resolved
src/ucp/tag/eager_snd.c Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
@dmitrygx
Copy link
Member Author

botLpipe:retest

1 similar comment
@dmitrygx
Copy link
Member Author

botLpipe:retest

@dmitrygx
Copy link
Member Author

bot:pipe:retest

@dmitrygx
Copy link
Member Author

@brminich @hoopoepg could you review pls?

src/ucp/core/ucp_am.c Outdated Show resolved Hide resolved
src/ucp/proto/proto_am.inl Outdated Show resolved Hide resolved
src/ucp/rma/amo_sw.c Outdated Show resolved Hide resolved
src/ucp/rma/rma_sw.c Outdated Show resolved Hide resolved
src/ucp/rma/rma_sw.c Outdated Show resolved Hide resolved
src/ucp/rma/rma_sw.c Outdated Show resolved Hide resolved
src/ucp/tag/eager_snd.c Outdated Show resolved Hide resolved
src/ucp/wireup/wireup.c Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
@dmitrygx
Copy link
Member Author

dmitrygx commented Dec 18, 2020

@brminich @yosefe I'd keep the behavior as is (except dropping packets if EP doesn't exist anymore in PTR MAP), since there are many tests (e.g. test_ucp_wireup_1sided.send_disconnect_reply1) which wait for a sender's operation completion after receiver's ep issued ucp_ep_close_nbx(). e.g. if this is the SW PUT implementation, a target side receives PUT_REQ and checks whether EP exists or not and then copies data to its buffer and answers with completion to PUT operation. But if we will check EP state (CLOSED or not), we will not answer with completion. So, an initiator of PUT will hang and operation will be timed out, since no completion was received and no errors were detected (since those tests don't request error handling support).
I think we should handle this behavior, we should wait for #5821 and enable error handling in those tests.

@dmitrygx
Copy link
Member Author

failure was #6011
bot:pipe:retest

@dmitrygx
Copy link
Member Author

bot:pipe:retest

@dmitrygx
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yosefe
Copy link
Contributor

yosefe commented Dec 19, 2020

bot:pipe:retest

@dmitrygx
Copy link
Member Author

@brminich failure is not related, could you review pls?

@dmitrygx
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dmitrygx
Copy link
Member Author

dmitrygx commented Jan 4, 2021

@yosefe @hoopoepg could you review pls?

src/ucp/core/ucp_ep.inl Outdated Show resolved Hide resolved
src/ucp/tag/eager_snd.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_ep.inl Outdated Show resolved Hide resolved
src/ucp/core/ucp_ep.inl Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.inl Show resolved Hide resolved
src/ucp/core/ucp_worker.inl Outdated Show resolved Hide resolved
src/ucp/rndv/rndv.c Outdated Show resolved Hide resolved
rreq = ucp_tag_recv_nb(receiver().worker(), &recv_buf[0], size,
ucp_dt_make_contig(1), 0, 0,
rtag_complete_cb);
ucp_tag_recv_info_t recv_info = {};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put this is separate function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 1154 to 1156
if (!err_handling) {
request_wait(sreq);
request_wait(rreq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make std::vector of the requests we need to wait for

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (!err_handling) {
compare_buffers(send_buf, recv_buf);
} else {
delete slh;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ucs::auto_ptr<scoped_log_handler>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 1568 to 1573
(get_variant_value() &
SEND_STOP) ?
send_stop : NULL,
(get_variant_value() &
RECV_STOP) ?
recv_stop : NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check the variant inside test_tag_send_recv?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't since this is the function from the base class
but we need to check variants and apply them only warmup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any other way to simplify this code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed as you suggested, but it was required to move functions to test_ucp_sockaddr_protocols class

Comment on lines 1586 to 1587
(err_str.find("ptr map ") != std::string::npos) ||
(err_str.find("was not returned to mpool ucp_requests") !=
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we OK to have leaks?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it will be fixed then by the tracking UCP requests
it is ok, since we close EP for send, we don't expect that recv request won't be completed and vice versa

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we force-release the leaked request somehow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, it could be problematic, since we need to fix C++ compilation of ucp_request.inl file to use ucp_request_put() function for force-release of UCP requests and more complex thing is to release PTR map key since it is needed in RNDV, synchronized protocols
tracking UCP requests will allow us just remove the warn_leak_hander() and we would expect that no warnings/errors are printed by UCP

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we force-release the leaked request somehow?

@yosefe done by forcibly setting COMPLETED flag and releasing a request ID if a request is in PTR MAP.

@dmitrygx
Copy link
Member Author

dmitrygx commented Jan 7, 2021

@yosefe could you review pls?

@yosefe
Copy link
Contributor

yosefe commented Jan 7, 2021

@dmitrygx is it possible to remove the merge commit(496192b) and squash the fixes (starting from 94bbe10) to one commit?
it's not easy to see what has changed since last review (bd9ece1)

@dmitrygx
Copy link
Member Author

dmitrygx commented Jan 7, 2021

@dmitrygx is it possible to remove the merge commit(496192b) and squash the fixes (starting from 94bbe10) to one commit?
it's not easy to see what has changed since last review (bd9ece1)

@yosefe hope it is what you mean
removed the merge commit, but rebased on master due to conflicts
pls review the last commit af2d0b1 - it has fixes for the latest comments

@dmitrygx
Copy link
Member Author

dmitrygx commented Jan 7, 2021

bot:pipe:retest

@dmitrygx
Copy link
Member Author

@yosefe could you review pls?

src/ucp/core/ucp_am.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_ep.inl Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.inl Outdated Show resolved Hide resolved
src/ucp/rma/amo_sw.c Outdated Show resolved Hide resolved
src/ucp/rndv/rndv.c Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
test/gtest/ucp/ucp_test.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
test/gtest/ucp/test_ucp_sockaddr.cc Outdated Show resolved Hide resolved
src/ucp/core/ucp_am.c Outdated Show resolved Hide resolved
src/ucp/rndv/rndv.c Outdated Show resolved Hide resolved
src/ucp/rndv/rndv.c Outdated Show resolved Hide resolved
@dmitrygx
Copy link
Member Author

@brminich @yosefe are you ok with the latest changes? there is merge-conflict, I guess better to fix it when squashing

@yosefe
Copy link
Contributor

yosefe commented Jan 11, 2021

@brminich @yosefe are you ok with the latest changes? there is merge-conflict, I guess better to fix it when squashing

need to fix it to run tests

@dmitrygx
Copy link
Member Author

@brminich @yosefe are you ok with the latest changes? there is merge-conflict, I guess better to fix it when squashing

need to fix it to run tests

@yosefe done

@dmitrygx
Copy link
Member Author

static checkers found errors:
Error: CPPCHECK_WARNING:
/src/ucm/util/reloc.c:71: error[unknownMacro]: There is an unknown macro here somewhere. Configuration is required. If ElfW is a macro then please configure it.
##[error]static checkers found errors

the error is unrelated and will be fixed by @gleon99 in #6112

@dmitrygx
Copy link
Member Author

@yosefe could you review pls? failures are unrelated

@dmitrygx
Copy link
Member Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dmitrygx
Copy link
Member Author

@yosefe ok to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants