Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT: fix #1502, #1513 #1532

Merged
merged 1 commit into from
May 24, 2017
Merged

UCT: fix #1502, #1513 #1532

merged 1 commit into from
May 24, 2017

Conversation

evgeny-leksikov
Copy link
Contributor

  • Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm

ucs_wtimer_add(&iface->async.slow_timer, &ep->slow_timer,
uct_ud_slow_tick());
/* Cool down the timer on rescheduling/resending */
ep->tx.slow_tick *= 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make it configuration option - UD_SLOW_TIMER_BACKOFF=2.0

@yosefe yosefe added the Bugfix label May 21, 2017
@yosefe yosefe added this to the v1.3 milestone May 21, 2017
@mellanox-github
Copy link
Contributor

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1700/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3638/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1704/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3641/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1712/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3649/ for details (Mellanox internal link).

@evgeny-leksikov
Copy link
Contributor Author

@yosefe pls take a look again

@shamisp
Copy link
Contributor

shamisp commented May 22, 2017

Looks like bugfix. Do we need it for v1.2

@@ -134,6 +134,7 @@ uct_ud_iface_complete_tx_inl(uct_ud_iface_t *iface, uct_ud_ep_t *ep,
uct_ud_iface_get_async_time(iface) -
ucs_twheel_get_time(&iface->async.slow_timer) +
uct_ud_slow_tick());
ep->tx.slow_tick = uct_ud_slow_tick();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid calling uct_ud_slow_tick() twice in fast path - move this line before ucs_wtimer_add(), and pass ep->tx.slow_tick to ucs_wtimer_add(). same for other locations in the code.

/* Ignore warnings about empty memory pool */
if (level == UCS_LOG_LEVEL_ERROR) {
/* Ignore warnings about empty memory pool or EP failure */
if ((level == UCS_LOG_LEVEL_ERROR) || (level == UCS_LOG_LEVEL_WARN)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO would be better to change ucs_warn("Error %s was not handled for ep %p"...) to ucs_error("Unhandled error %s for ep %p"...)

@@ -433,6 +433,14 @@ UCS_CLASS_INIT_FUNC(uct_ud_iface_t, uct_ud_iface_ops_t *ops, uct_md_h md,
self->config.tx_qp_len = config->super.tx.queue_len;
self->config.peer_timeout = ucs_time_from_sec(config->peer_timeout);

if (config->slow_timer_backoff <= 0.) {
ucs_error("The slow timer back off should be > 0(%lf)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add space after 0

@@ -402,6 +406,7 @@ uct_ud_ep_process_ack(uct_ud_iface_t *iface, uct_ud_ep_t *ep,

ucs_arbiter_group_schedule(&iface->tx.pending_q, &ep->tx.pending.group);

ep->tx.slow_tick = uct_ud_slow_tick();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe would be better to remember then "backoff" multiplier instead of "slow_tick", so resetting the backoff would be simply assigning 1.0 instead of floating point multiplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe even better to cache uct_ud_slow_tick() value in ud_iface.async?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's do that

@yosefe
Copy link
Contributor

yosefe commented May 23, 2017

@alex-mikheev pls take a look as well

@yosefe
Copy link
Contributor

yosefe commented May 23, 2017

@shamisp yes, will need to port to v1.2

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1718/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3659/ for details (Mellanox internal link).

@shamisp shamisp mentioned this pull request May 23, 2017
@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1722/ for details.

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, can you please squash?

- Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm
@evgeny-leksikov
Copy link
Contributor Author

@yosefe done

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3662/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1725/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3665/ for details (Mellanox internal link).

@evgeny-leksikov
Copy link
Contributor Author

bot:retest

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1731/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3672/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3673/ for details (Mellanox internal link).

@yosefe yosefe merged commit c8f891f into openucx:master May 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants