-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/UCP: Fixes for am_zcopy sends #125
Conversation
@brminich @evgeny-leksikov can you pls take a look? it fixes am_zcopy flow which is apparently not tested because ZCOPY_THRESH==RNDV_THRESH in mtt |
src/uct/ib/rc/accel/rc_mlx5_ep.c
Outdated
@@ -221,6 +220,7 @@ ucs_status_t uct_rc_mlx5_ep_put_zcopy(uct_ep_h tl_ep, const uct_iov_t *iov, size | |||
UCT_CHECK_LENGTH(uct_iov_total_length(iov, iovcnt), 0, UCT_IB_MAX_MESSAGE_SIZE, | |||
"put_zcopy"); | |||
UCT_RC_CHECK_NUM_RDMA_READ(iface); | |||
UCT_RC_CHECK_RES(iface, &ep->super); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be done before UCT_RC_CHECK_NUM_RDMA_READ
? Or how does it help?
src/uct/ib/rc/accel/rc_mlx5_ep.c
Outdated
@@ -269,6 +269,7 @@ ucs_status_t uct_rc_mlx5_ep_get_zcopy(uct_ep_h tl_ep, const uct_iov_t *iov, size | |||
iface->super.super.config.max_inl_resp + 1, | |||
iface->super.config.max_get_zcopy, "get_zcopy"); | |||
UCT_RC_CHECK_NUM_RDMA_READ(&iface->super); | |||
UCT_RC_CHECK_RES(&iface->super, &ep->super); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be done before UCT_RC_CHECK_NUM_RDMA_READ? Or how does it help?
src/uct/ib/rc/accel/rc_mlx5_ep.c
Outdated
UCT_RC_CHECK_NUM_RDMA_READ(&iface->super); | ||
UCT_RC_CHECK_RES(&iface->super, &ep->super); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UCT_RC_CHECK_RMA_RES
instead of both macros?
src/uct/ib/rc/accel/rc_mlx5_ep.c
Outdated
UCT_RC_CHECK_NUM_RDMA_READ(iface); | ||
UCT_RC_CHECK_RES(iface, &ep->super); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UCT_RC_CHECK_RMA_RES
instead of both macros?
src/ucp/proto/proto_am.inl
Outdated
@@ -391,6 +391,8 @@ ucs_status_t ucp_do_am_zcopy_multi(uct_pending_req_t *self, uint8_t am_id_first, | |||
} | |||
ucs_assert(status == UCS_INPROGRESS); | |||
return UCS_OK; | |||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for else
i.e. can fix it in the same way as in master:
https://github.com/openucx/ucx/blob/8d3801b94512e91c508a7d2a357b58bb9e5d4b6c/src/ucp/proto/proto_am.inl#L444
1. ucp_do_am_zcopy_multi() did not return UCS_ERR_NO_RESOURCE when such status was returned from uct_ep_am_zcopy, as a result the send request was removed from pending queue and did not progress (io_demo test stuck) 2. rc_mlx5 am_zcopy checked pending queue assertion before checking send resources in uct_rc_mlx5_ep_zcopy_post() which leads to wrong assertion. Move resource checking from zcopy_post to calling functions. 3. Invalid send flags passed in zcopy_post: SOLICITED was not set when comp != NULL.
20aacb7
to
7df1246
Compare
status was returned from uct_ep_am_zcopy, as a result the send
request was removed from pending queue and did not progress (io_demo
test stuck)
send resources in uct_rc_mlx5_ep_zcopy_post() which leads to wrong
assertion. Move resource checking from zcopy_post to calling
functions.
comp != NULL.