Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP:rndv multirail support infrastructure + protocol #1894

Closed
wants to merge 6 commits into from

Conversation

hoopoepg
Copy link
Contributor

@hoopoepg hoopoepg commented Oct 6, 2017

  • added infrastructure to support multi-rail on rndv-RMA:
    • updated rndv-rma lane to array
    • added memory handles to store registered buffer and remote keys
    • added wireup configuration for mrail
    • added context parameter to enable mrail (disabled by default) UCX_RNDV_RAILS=1

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/2766/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4757/ for details (Mellanox internal link).

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Oct 6, 2017

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4758/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/2767/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4759/ for details (Mellanox internal link).

Copy link
Contributor

@evgeny-leksikov evgeny-leksikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add ucp_request_send* prefix to all corresponding functions which work with request.send structure and ucp_request_[send]_state* to the functions which work with the state ( #1874 ). BTW, do we plan any mrail infra on recv (request.recv)? it potentially can be needed in case multi protocols over rails.

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Oct 7, 2017

I would add ucp_request_send* prefix to all corresponding functions which work with request.send structure and ucp_request_[send]_state* to the functions which work with the state ( #1874 ).

ok, I will rebase changes to your update

BTW, do we plan any mrail infra on recv (request.recv)? it potentially can be needed in case multi protocols over rails.

may be later

@yosefe yosefe added the WIP-DNM Work in progress / Do not review label Oct 8, 2017
static inline int ucp_ep_is_rndv_mrail_present(ucp_ep_h ep)
{
return (ucp_ep_config(ep)->key.rndv_lanes[0] != UCP_NULL_LANE &&
ucp_ep_config(ep)->key.rndv_lanes[1] != UCP_NULL_LANE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it enough to check just [1] element?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, on some initialization there is clear only rndv_lanes[0] element (and left all other lanes uninitialized), for such cases we check 2 first lanes to be non-NULL

Copy link
Contributor

@brminich brminich Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not you want to check: if have more than 1 RNDV lane -> mrail present ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lane[0] must be initialized (by NULL_LANE or real lane number), lane 1 may be uninitialized, and we have to check both

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we have to initialize all RNDV lanes (it's a part of the key and you perform check on it). Then lanes[1] could be either NULL_LANE or some real lane number (which means mrail is used)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update

@@ -115,6 +115,12 @@ struct ucp_request {
uintptr_t remote_request; /* pointer to the sender's send request */
uct_rkey_bundle_t rkey_bundle;
ucp_request_t *rreq; /* receive request on the recv side */
unsigned use_mrail;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by how much does this increase request size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 * (1+sizeof(rkey_byndle))
what is the problem here?

@@ -175,7 +175,8 @@ void ucp_iov_buffer_memh_dereg(uct_md_h uct_md, uct_mem_h *memh,
UCS_PROFILE_FUNC(ucs_status_t, ucp_request_memory_reg,
(context, rsc_index, buffer, length, datatype, state),
ucp_context_t *context, ucp_rsc_index_t rsc_index, void *buffer,
size_t length, ucp_datatype_t datatype, ucp_dt_state_t *state)
size_t length, ucp_datatype_t datatype,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to split line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

@@ -196,7 +202,8 @@ void ucp_request_send_buffer_dereg(ucp_request_t *req, ucp_lane_index_t lane);

ucs_status_t ucp_request_memory_reg(ucp_context_t *context, ucp_rsc_index_t rsc_index,
void *buffer, size_t length,
ucp_datatype_t datatype, ucp_dt_state_t *state);
ucp_datatype_t datatype,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to split line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will join


static UCS_F_ALWAYS_INLINE int
ucp_request_is_empty_rail(ucp_dt_state_t *state, int rail) {
return state->dt.mrail[rail].memh == UCT_MEM_HANDLE_NULL ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not just check for lane enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we may have situation when some lanes doesn't require mem-handle, not sure if it is possible in real world, but we should handle such configuration. will look at it after wireup is complete

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's why checking lane for NULL seems to be enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will fix

ucs_assert(key->rndv_lane == UCP_NULL_LANE);
key->rndv_lane = lane;
/* TODO: add rndv sort */
ucs_assert(key->rndv_lanes[0] == UCP_NULL_LANE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will you select what lanes are suitable for RNDV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't know for a while, this is in nearest plans :)

/* dereg the original send request since we are going to send on the AM lane next */
ucp_rndv_rma_request_send_buffer_dereg(sreq);
sreq->send.state.dt.contig.memh = UCT_MEM_HANDLE_NULL;
ucp_request_clear_rails(&sreq->send.state);
} else if (ucp_request_have_rails(&sreq->send.state)) {
ucp_rndv_rma_request_send_buffer_dereg(sreq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not get this. What if am_lane is the same as one of the rails lane? Then you would not need to deregister send buffer, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for multirail we have to de-register buffers to avoid resource leaks - the problem is in de-registration: we can't de-register lanes 1+, and not de-register lane 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we do so?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it will require deeper refactoring of request/memory registration, which affect non-rndv functionality too

(ucp_ep_get_am_lane(ep) != ucp_ep_get_rndv_get_lane(ep))) {
ucp_request_clear_rails(&sreq->send.state);
} else if ((ucp_ep_is_rndv_lane_present(ep, 0)) &&
(ucp_ep_get_am_lane(ep) != ucp_ep_get_rndv_get_lane(ep, 0))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation (thought it was not added by you:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

@@ -288,16 +354,33 @@ UCS_PROFILE_FUNC(ucs_status_t, ucp_proto_progress_rndv_get_zcopy, (self),
offset, (uintptr_t)rndv_req->send.buffer % align,
(void*)rndv_req->send.buffer + offset, length);

if ((rndv_req->send.rndv_get.use_mrail) &&
((ucp_request_is_empty_rail(&rndv_req->send.state, rndv_req->send.rndv_get.rail_idx)) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix, but current intent is more "beautiful"

src/ucp/dt/dt.h Outdated
@@ -24,8 +25,14 @@ typedef struct ucp_dt_state {
size_t offset; /* Total offset in overall payload. */
union {
struct {
uct_mem_h memh;
} contig;
struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is worth to have a single struct and use just one element for contig case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is logic to use same lane for am/get and using single element makes processing complicated

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/2775/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4776/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/2776/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4777/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/2777/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4778/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/2778/ for details.

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Oct 9, 2017

bot:retest

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4779/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/2779/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4780/ for details (Mellanox internal link).

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Oct 9, 2017

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4782/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/4807/ for details (Mellanox internal link).

@hoopoepg hoopoepg force-pushed the multirail-rma-sw branch 2 times, most recently from 28f9af0 to bdda5df Compare October 13, 2017 14:48
Sergey Oblomov added 5 commits November 6, 2017 06:46
- added infrastructure to support multi-rail on rndv-RMA:
  - updated rndv-rma lane to array
  - added memory handles to store registered buffer and remote keys
  - added wireup configuration for mrail
  - added context parameter to enable mrail (disabled by default) UCX_RNDV_RAILS=1
- refactoring of datatypes: reduced size of core structures (request & dt)
@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5070/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5073/ for details (Mellanox internal link).

@@ -344,3 +352,52 @@ ucp_request_send_start(ucp_request_t *req, ssize_t max_short,

return UCS_ERR_NO_PROGRESS;
}

int ucp_request_rndv_reg(ucp_request_t *req)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to ucp_request_rndv_mem_reg(), also i see the return value is not really needed so might as well make it void

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


ucs_assert(UCP_DT_IS_CONTIG(req->send.datatype));

ucp_dt_clear_rndv_lanes(state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't have to clear all lanes - can set memh to NULL as an 'else' clause in line 378

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but this infrastructure may be used for AM rndv too, and we have to add complex logic to clean handles

}

req->send.reg_rsc = UCP_NULL_RESOURCE;
ucp_dt_clear_rndv_lanes(state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can clear only what has been deregistered

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

*/
typedef struct ucp_rndv_get_rkey {
ucp_lane_index_t lane_num; /* number of rkeys obtained from peer */
ucp_lane_index_t lane_idx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a comment to describe this field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

* available - only one rkey is proceeded
*/
typedef struct ucp_rndv_get_rkey {
ucp_lane_index_t lane_num; /* number of rkeys obtained from peer */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to keep "lane_num" (which should be renamed to num_lanes) and "lane_idx" in this struct, and not directly in ucp_request.send.rndv_get? this way could avoid this struct altogether:

            struct {
                     uint64_t             remote_address; /* address of the sender's data buffer */
                     uintptr_t            remote_request; /* pointer to the sender's send request */
                     ucp_lane_index_t     num_lanes;
                     ucp_lane_index_t     lane_idx;
                     ucp_request_t        *rreq;           /* receive request on the recv side */
                     uct_rkey_bundle_t    *rkey_bundle;      /* rendezvous-get remote keys */
            } rndv_get;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there free space to add more values?

ucp_wireup_add_lane_desc(lane_descs, num_lanes_p, rsc_index, addr_index,
address_list[addr_index].md_index, score,
UCP_WIREUP_LANE_USAGE_RNDV, 0);
if ((params->field_mask & UCP_EP_PARAM_FIELD_ERR_HANDLING_MODE) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought we disable rndv at all if err handling is enabled.. @evgeny-leksikov ?

Copy link
Contributor Author

@hoopoepg hoopoepg Nov 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in original code rndv lane is configured always. disabled for peer handling error mode

tl_bitmap, remote_md_map, 0,
&rsc_index, &addr_index, &score);
if (status == UCS_OK) {
/* Add lane description and remove all occurrences of the remote md */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it needed to remote occurrences of remote md?
if you want to avoid multiple rndv lanes on same device, need check device name, rather than the md

/* In case if selected SHM transport - leave it alone, disable multirail */
break;
}
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would just do 'if (status!=OK) break' and avoid the 'else'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

ucs_qsort_r(key->rma_lanes, UCP_MAX_LANES, sizeof(ucp_lane_index_t),
ucp_wireup_compare_lane_rma_score, lane_descs);
ucs_qsort_r(key->amo_lanes, UCP_MAX_LANES, sizeof(ucp_lane_index_t),
ucp_wireup_compare_lane_amo_score, lane_descs);

/* Do not sort rndv lanes to save original sequence */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to contradict the comment '/* Sort RNDV, RMA and AMO lanes according to score */'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment is updated

@@ -95,6 +95,12 @@

/**
* Size of statically-declared array
* May be used for arrays in structure values
*/
#define ucs_countof(_array) (sizeof(_array) / sizeof((_array)[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed
ucs_static_array_size is not used too, remove it?
BTW, ucs_static_array_size is failed to compile by Intel compiler (due to static assert inside)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's remove it

@jenkinsornl
Copy link

Build finished.

@jenkinsornl
Copy link

Build finished.

@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3005/ for details.

@jenkinsornl
Copy link

Build finished.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5078/ for details (Mellanox internal link).

- functions/values renaming, minor optimization
@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3007/ for details.

@jenkinsornl
Copy link

Build finished.

@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3008/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5080/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3009/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5081/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5082/ for details (Mellanox internal link).

@yosefe yosefe added the WIP-DNM Work in progress / Do not review label Nov 24, 2017
@yosefe
Copy link
Contributor

yosefe commented Dec 14, 2017

@hoopoepg can we close this?

@hoopoepg
Copy link
Contributor Author

yes, I will make another PR

@yosefe yosefe closed this Dec 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature Optimization Code / performance optimization WIP-DNM Work in progress / Do not review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants