Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCX/RNDV: multirail - updated EP configuration #1981

Merged
merged 4 commits into from
Nov 16, 2017

Conversation

hoopoepg
Copy link
Contributor

@hoopoepg hoopoepg commented Nov 8, 2017

  • added array of lanes to be used by rndv-mrail
  • added array of memory handles for multirail

this is split PR for #1894

@jenkinsornl
Copy link

Build finished.

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3021/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5094/ for details (Mellanox internal link).

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Nov 8, 2017

bot:mlx:retest

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3028/ for details.

@jenkinsornl
Copy link

Build finished.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5100/ for details (Mellanox internal link).

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Nov 8, 2017

bot:ornl:retest

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5101/ for details (Mellanox internal link).

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Nov 8, 2017

bot:mlx:retest

@jenkinsornl
Copy link

Build finished.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5107/ for details (Mellanox internal link).

@jenkinsornl
Copy link

Build finished.

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3035/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5114/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3038/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5117/ for details (Mellanox internal link).

@jenkinsornl
Copy link

Build finished.

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Nov 9, 2017

bot:ornl:retest

@jenkinsornl
Copy link

Build finished.

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Nov 9, 2017

@MattBBaker could you look what is wrong?

@MattBBaker
Copy link
Contributor

So the test actually succeeds, but then has a hiccup on the JSON it gets from github when setting the build to success.

@MattBBaker
Copy link
Contributor

bot:ornl:retest

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Nov 9, 2017

ok, thank you

@jenkinsornl
Copy link

Build finished.

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Nov 9, 2017

@MattBBaker something wrong again

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Nov 9, 2017

bot:ornl:retest

@jenkinsornl
Copy link

Build finished.

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3052/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5132/ for details (Mellanox internal link).

- additional fix for lane config
@jenkinsornl
Copy link

Build finished.

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3053/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5133/ for details (Mellanox internal link).

Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is better to remove RNDV lane and use RMA lane instead at first? Otherwise you would need to remove half of newly written code in the next PRs

@@ -100,6 +100,10 @@ static ucs_config_field_t ucp_config_table[] = {
"the eager_zcopy protocol",
ucs_offsetof(ucp_config_t, ctx.rndv_perf_diff), UCS_CONFIG_TYPE_DOUBLE},

{"MAX_RNDV_LANES", "1",
"Set max multirail-get rendezvous lane numbers",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we get rid of multirail word here as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi-rail is still here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich @yosefe
"Set max rendezvous-get lane numbers"
will work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Maximal number of devices on which a rendezvous operation may be executed in parallel."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -53,6 +53,8 @@ typedef struct ucp_context_config {
int use_mt_mutex;
/** On-demand progress */
int adaptive_progress;
/** Rendezvous multirail support */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multilanes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you agree with multilanes instead of multi-rail?

return ucp_ep_config(ep)->key.rndv_lanes[idx] != UCP_NULL_LANE;
}

static inline int ucp_ep_rndv_num_lanes(ucp_ep_h ep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better use UCS_F_ALWAYS_INLINE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all other funcs there are static inline
will add separate PR to set UCS_F_ALWAYS_INLINE

src/ucp/dt/dt.h Outdated
@@ -41,4 +42,26 @@ typedef struct ucp_dt_state {
size_t ucp_dt_pack(ucp_datatype_t datatype, void *dest, const void *src,
ucp_dt_state_t *state, size_t length);

#endif
static UCS_F_ALWAYS_INLINE void
ucp_dt_clear_rndv_lanes(ucp_dt_state_t *state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is worth not to mention RNDV, as the code is rather generic and is in dt.h

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

@@ -340,7 +357,7 @@ static void ucp_rndv_handle_recv_contig(ucp_request_t *rndv_req, ucp_request_t *
} else {
if (rndv_rts_hdr->flags & UCP_RNDV_RTS_FLAG_PACKED_RKEY) {
UCS_PROFILE_CALL(uct_rkey_unpack, rndv_rts_hdr + 1,
&rndv_req->send.rndv_get.rkey_bundle);
&rndv_req->send.rndv_get.rkey_bundle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

@jenkinsornl
Copy link

Build finished.

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3055/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5135/ for details (Mellanox internal link).

@yosefe
Copy link
Contributor

yosefe commented Nov 13, 2017

@brminich pls re-review

- updated comment & help wording
@jenkinsornl
Copy link

Build finished.

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/3069/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5152/ for details (Mellanox internal link).

@hoopoepg
Copy link
Contributor Author

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5155/ for details (Mellanox internal link).

@hoopoepg
Copy link
Contributor Author

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5183/ for details (Mellanox internal link).

@hoopoepg
Copy link
Contributor Author

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5189/ for details (Mellanox internal link).

@hoopoepg
Copy link
Contributor Author

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/5191/ for details (Mellanox internal link).

@hoopoepg
Copy link
Contributor Author

hoopoepg commented Nov 16, 2017

@miked-mellanox @alinask are you ok to merge this PR?
@yosefe and @brminich are approved it

this PR really blocks my work

@mike-dubman mike-dubman merged commit 4bd4c13 into openucx:master Nov 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants