Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/UCP: Fixes for am_zcopy sends #125

Merged
merged 1 commit into from
Jan 19, 2021
Merged

Conversation

yosefe
Copy link
Owner

@yosefe yosefe commented Jan 18, 2021

  1. ucp_do_am_zcopy_multi() did not return UCS_ERR_NO_RESOURCE when such
    status was returned from uct_ep_am_zcopy, as a result the send
    request was removed from pending queue and did not progress (io_demo
    test stuck)
  2. rc_mlx5 am_zcopy checked pending queue assertion before checking
    send resources in uct_rc_mlx5_ep_zcopy_post() which leads to wrong
    assertion. Move resource checking from zcopy_post to calling
    functions.
  3. Invalid send flags passed in zcopy_post: SOLICITED was not set when
    comp != NULL.

@yosefe
Copy link
Owner Author

yosefe commented Jan 18, 2021

@brminich @evgeny-leksikov can you pls take a look? it fixes am_zcopy flow which is apparently not tested because ZCOPY_THRESH==RNDV_THRESH in mtt

@@ -221,6 +220,7 @@ ucs_status_t uct_rc_mlx5_ep_put_zcopy(uct_ep_h tl_ep, const uct_iov_t *iov, size
UCT_CHECK_LENGTH(uct_iov_total_length(iov, iovcnt), 0, UCT_IB_MAX_MESSAGE_SIZE,
"put_zcopy");
UCT_RC_CHECK_NUM_RDMA_READ(iface);
UCT_RC_CHECK_RES(iface, &ep->super);
Copy link
Collaborator

@brminich brminich Jan 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be done before UCT_RC_CHECK_NUM_RDMA_READ? Or how does it help?

@@ -269,6 +269,7 @@ ucs_status_t uct_rc_mlx5_ep_get_zcopy(uct_ep_h tl_ep, const uct_iov_t *iov, size
iface->super.super.config.max_inl_resp + 1,
iface->super.config.max_get_zcopy, "get_zcopy");
UCT_RC_CHECK_NUM_RDMA_READ(&iface->super);
UCT_RC_CHECK_RES(&iface->super, &ep->super);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it be done before UCT_RC_CHECK_NUM_RDMA_READ? Or how does it help?

Comment on lines 271 to 272
UCT_RC_CHECK_NUM_RDMA_READ(&iface->super);
UCT_RC_CHECK_RES(&iface->super, &ep->super);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCT_RC_CHECK_RMA_RES instead of both macros?

Comment on lines 222 to 223
UCT_RC_CHECK_NUM_RDMA_READ(iface);
UCT_RC_CHECK_RES(iface, &ep->super);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UCT_RC_CHECK_RMA_RES instead of both macros?

@@ -391,6 +391,8 @@ ucs_status_t ucp_do_am_zcopy_multi(uct_pending_req_t *self, uint8_t am_id_first,
}
ucs_assert(status == UCS_INPROGRESS);
return UCS_OK;
} else {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. ucp_do_am_zcopy_multi() did not return UCS_ERR_NO_RESOURCE when such
   status was returned from uct_ep_am_zcopy, as a result the send
   request was removed from pending queue and did not progress (io_demo
   test stuck)
2. rc_mlx5 am_zcopy checked pending queue assertion before checking
   send resources in uct_rc_mlx5_ep_zcopy_post() which leads to wrong
   assertion. Move resource checking from zcopy_post to calling
   functions.
3. Invalid send flags passed in zcopy_post: SOLICITED was not set when
   comp != NULL.
@yosefe
Copy link
Owner Author

yosefe commented Jan 19, 2021

@dmitrygx @brminich thanks for the comments, fixed, force-pushed due to small size

@yosefe yosefe merged commit 5cb3ace into integration3 Jan 19, 2021
@yosefe yosefe deleted the topic/mlx5-send-fixes branch January 19, 2021 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants