Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCS/CALLBACKQ: fix recursive one-shot #5926

Merged
merged 1 commit into from
Jan 26, 2021
Merged

Conversation

evgeny-leksikov
Copy link
Contributor

@evgeny-leksikov evgeny-leksikov commented Nov 18, 2020

What

Fix recursive one-shot

Why ?

To avoid a stuck until out-of-memory

How ?

limit iteration count by actual value before the loop

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general LGTM, just need to fix commit title short->shot and build issues

@@ -115,7 +118,7 @@ static void ucp_ep_flush_progress(ucp_request_t *req)
}
} else {
ucp_ep_flush_error(req, status);
break;
req->send.flush.started_lanes |= UCS_BIT(lane);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do
--req->send.state.uct_comp.count;?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move req->send.flush.started_lanes |= UCS_BIT(lane); to ucp_ep_flush_error()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do
--req->send.state.uct_comp.count;?

it's done in ucp_ep_flush_error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move req->send.flush.started_lanes |= UCS_BIT(lane); to ucp_ep_flush_error()?

it will be wrong for pending_add failure

@evgeny-leksikov
Copy link
Contributor Author

bot:pipe:retest

@evgeny-leksikov
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@evgeny-leksikov
Copy link
Contributor Author

bot:pipe:retest

1 similar comment
@yosefe
Copy link
Contributor

yosefe commented Nov 27, 2020

bot:pipe:retest

@evgeny-leksikov
Copy link
Contributor Author

@yosefe the failure is relevant, reproduced locally when network interface is TCP but selected transport is RCX

@evgeny-leksikov
Copy link
Contributor Author

bot:pipe:retest

1 similar comment
@evgeny-leksikov
Copy link
Contributor Author

bot:pipe:retest

@@ -348,6 +348,9 @@ ucs_status_t ucp_do_am_zcopy_multi(uct_pending_req_t *self, uint8_t am_id_first,
ucp_send_request_add_reg_lane(req, req->send.lane);
} else {
req->send.lane = ucp_ep_get_am_lane(ep);
if (req->send.state.dt.offset == 0) {
ucp_send_request_add_reg_lane(req, req->send.lane);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move this after line 354?
seems only case ucp_send_request_add_reg_lane is not called, is when enable_am_bw is false

@yosefe
Copy link
Contributor

yosefe commented Dec 4, 2020

bot:pipe:retest

@yosefe
Copy link
Contributor

yosefe commented Jan 11, 2021

pls squash

 + reg send lane on first iter in UCP zcopy multy protocol
@yosefe
Copy link
Contributor

yosefe commented Jan 12, 2021

@evgeny-leksikov failure could be related. AFAIR, didn't see such failure before

[2021-01-11T12:36:36.864Z] [ RUN      ] all/test_ucp_sockaddr_protocols.stream_bcopy_4k_exp/0 <all>
[2021-01-11T12:36:37.120Z] [     INFO ] < rdma_bind_addr(addr=10.209.45.93:0) failed: No such device >
[2021-01-11T12:36:37.120Z] [     INFO ] server listening on 10.209.45.93:36814
[2021-01-11T12:36:38.488Z] [r-vmb-ppc-jenkins:26934:0:26934] ucp_request.inl:254  Fatal: unexpected error: Invalid parameter
[2021-01-11T12:36:38.785Z] 
[2021-01-11T12:36:38.785Z] /scrap/jenkins/workspace/ucx-5/contrib/../src/ucp/core/ucp_request.inl: [ ucp_request_try_send() ]
[2021-01-11T12:36:38.785Z]       ...
[2021-01-11T12:36:38.785Z]       251     }
[2021-01-11T12:36:38.785Z]       252 
[2021-01-11T12:36:38.785Z]       253     ucs_fatal("unexpected error: %s", ucs_status_string(status));
[2021-01-11T12:36:38.785Z] ==>   254 }
[2021-01-11T12:36:38.785Z]       255 
[2021-01-11T12:36:38.785Z]       256 /**
[2021-01-11T12:36:38.785Z]       257  * Start sending a request.
[2021-01-11T12:36:38.785Z] 
[2021-01-11T12:36:39.366Z] ==== backtrace (tid:  26934) ====
[2021-01-11T12:36:39.366Z]  0 0x000000000005f848 ucs_debug_print_backtrace()  /scrap/jenkins/workspace/ucx-5/contrib/../src/ucs/debug/debug.c:656
[2021-01-11T12:36:39.366Z]  1 0x00000000000af96c ucp_request_try_send()  /scrap/jenkins/workspace/ucx-5/contrib/../src/ucp/core/ucp_request.inl:254
[2021-01-11T12:36:39.366Z]  2 0x00000000000af96c ucp_request_send()  /scrap/jenkins/workspace/ucx-5/contrib/../src/ucp/core/ucp_request.inl:267
[2021-01-11T12:36:39.366Z]  3 0x00000000000af96c ucp_wireup_replay_pending_request()  /scrap/jenkins/workspace/ucx-5/contrib/../src/ucp/wireup/wireup.c:753
[2021-01-11T12:36:39.366Z]  4 0x00000000000af96c ucp_wireup_replay_pending_requests()  /scrap/jenkins/workspace/ucx-5/contrib/../src/ucp/wireup/wireup.c:763
[2021-01-11T12:36:39.366Z]  5 0x00000000000ab7cc ucp_wireup_ep_progress()  /scrap/jenkins/workspace/ucx-5/contrib/../src/ucp/wireup/wireup_ep.c:99
[2021-01-11T12:36:39.366Z]  6 0x0000000000051d04 ucs_callbackq_slow_proxy()  /scrap/jenkins/workspace/ucx-5/contrib/../src/ucs/datastruct/callbackq.c:402
[2021-01-11T12:36:39.366Z]  7 0x0000000000047894 ucs_callbackq_dispatch()  /scrap/jenkins/workspace/ucx-5/contrib/../src/ucs/datastruct/callbackq.h:211
[2021-01-11T12:36:39.366Z]  8 0x0000000000047894 uct_worker_progress()  /scrap/jenkins/workspace/ucx-5/contrib/../src/uct/api/uct.h:2436
[2021-01-11T12:36:39.366Z]  9 0x0000000000047894 ucp_worker_progress()  /scrap/jenkins/workspace/ucx-5/contrib/../src/ucp/core/ucp_worker.c:2403

@evgeny-leksikov
Copy link
Contributor Author

@evgeny-leksikov failure could be related. AFAIR, didn't see such failure before

yep, reproduced locally, this is the EP reconfiguration issue when request is in pending queue in case of UCX_CM_USE_ALL_DEVICES, initial AM lane is TCP, then switching to IB

@yosefe
Copy link
Contributor

yosefe commented Jan 12, 2021

yep, reproduced locally, this is the EP reconfiguration issue when request is in pending queue in case of UCX_CM_USE_ALL_DEVICES, initial AM lane is TCP, then switching to IB

Did this PR introduce the issue, or just "uncover" it?

@evgeny-leksikov
Copy link
Contributor Author

yep, reproduced locally, this is the EP reconfiguration issue when request is in pending queue in case of UCX_CM_USE_ALL_DEVICES, initial AM lane is TCP, then switching to IB

Did this PR introduce the issue, or just "uncover" it?

uncover

@evgeny-leksikov
Copy link
Contributor Author

failure is #6194

@evgeny-leksikov
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@evgeny-leksikov
Copy link
Contributor Author

@yosefe ok to merge?

@yosefe yosefe merged commit 07b22d4 into openucx:master Jan 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants