-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCS/CALLBACKQ: fix recursive one-shot #5926
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general LGTM, just need to fix commit title short->shot and build issues
d4b22b8
to
1fb7e41
Compare
1fb7e41
to
9c30b34
Compare
src/ucp/rma/flush.c
Outdated
@@ -115,7 +118,7 @@ static void ucp_ep_flush_progress(ucp_request_t *req) | |||
} | |||
} else { | |||
ucp_ep_flush_error(req, status); | |||
break; | |||
req->send.flush.started_lanes |= UCS_BIT(lane); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do
--req->send.state.uct_comp.count;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move req->send.flush.started_lanes |= UCS_BIT(lane);
to ucp_ep_flush_error()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we do
--req->send.state.uct_comp.count;
?
it's done in ucp_ep_flush_error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move
req->send.flush.started_lanes |= UCS_BIT(lane);
toucp_ep_flush_error()
?
it will be wrong for pending_add
failure
0a83780
to
9c30b34
Compare
bot:pipe:retest |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
bot:pipe:retest |
1 similar comment
bot:pipe:retest |
@yosefe the failure is relevant, reproduced locally when network interface is TCP but selected transport is RCX |
bot:pipe:retest |
1 similar comment
bot:pipe:retest |
src/ucp/proto/proto_am.inl
Outdated
@@ -348,6 +348,9 @@ ucs_status_t ucp_do_am_zcopy_multi(uct_pending_req_t *self, uint8_t am_id_first, | |||
ucp_send_request_add_reg_lane(req, req->send.lane); | |||
} else { | |||
req->send.lane = ucp_ep_get_am_lane(ep); | |||
if (req->send.state.dt.offset == 0) { | |||
ucp_send_request_add_reg_lane(req, req->send.lane); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move this after line 354?
seems only case ucp_send_request_add_reg_lane is not called, is when enable_am_bw is false
bot:pipe:retest |
pls squash |
+ reg send lane on first iter in UCP zcopy multy protocol
087e922
to
b8ab8b0
Compare
@evgeny-leksikov failure could be related. AFAIR, didn't see such failure before
|
yep, reproduced locally, this is the EP reconfiguration issue when request is in pending queue in case of |
Did this PR introduce the issue, or just "uncover" it? |
uncover |
failure is #6194 |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
@yosefe ok to merge? |
What
Fix recursive one-shot
Why ?
To avoid a stuck until out-of-memory
How ?
limit iteration count by actual value before the loop