-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT: fix #1502, #1513 #1532
UCT: fix #1502, #1513 #1532
Conversation
evgeny-leksikov
commented
May 21, 2017
- Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm
src/uct/ib/ud/base/ud_ep.c
Outdated
ucs_wtimer_add(&iface->async.slow_timer, &ep->slow_timer, | ||
uct_ud_slow_tick()); | ||
/* Cool down the timer on rescheduling/resending */ | ||
ep->tx.slow_tick *= 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make it configuration option - UD_SLOW_TIMER_BACKOFF=2.0
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
@yosefe pls take a look again |
Looks like bugfix. Do we need it for v1.2 |
src/uct/ib/ud/base/ud_inl.h
Outdated
@@ -134,6 +134,7 @@ uct_ud_iface_complete_tx_inl(uct_ud_iface_t *iface, uct_ud_ep_t *ep, | |||
uct_ud_iface_get_async_time(iface) - | |||
ucs_twheel_get_time(&iface->async.slow_timer) + | |||
uct_ud_slow_tick()); | |||
ep->tx.slow_tick = uct_ud_slow_tick(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid calling uct_ud_slow_tick() twice in fast path - move this line before ucs_wtimer_add(), and pass ep->tx.slow_tick to ucs_wtimer_add(). same for other locations in the code.
test/gtest/common/test.cc
Outdated
/* Ignore warnings about empty memory pool */ | ||
if (level == UCS_LOG_LEVEL_ERROR) { | ||
/* Ignore warnings about empty memory pool or EP failure */ | ||
if ((level == UCS_LOG_LEVEL_ERROR) || (level == UCS_LOG_LEVEL_WARN)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO would be better to change ucs_warn("Error %s was not handled for ep %p"...) to ucs_error("Unhandled error %s for ep %p"...)
src/uct/ib/ud/base/ud_iface.c
Outdated
@@ -433,6 +433,14 @@ UCS_CLASS_INIT_FUNC(uct_ud_iface_t, uct_ud_iface_ops_t *ops, uct_md_h md, | |||
self->config.tx_qp_len = config->super.tx.queue_len; | |||
self->config.peer_timeout = ucs_time_from_sec(config->peer_timeout); | |||
|
|||
if (config->slow_timer_backoff <= 0.) { | |||
ucs_error("The slow timer back off should be > 0(%lf)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add space after 0
src/uct/ib/ud/base/ud_ep.c
Outdated
@@ -402,6 +406,7 @@ uct_ud_ep_process_ack(uct_ud_iface_t *iface, uct_ud_ep_t *ep, | |||
|
|||
ucs_arbiter_group_schedule(&iface->tx.pending_q, &ep->tx.pending.group); | |||
|
|||
ep->tx.slow_tick = uct_ud_slow_tick(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe would be better to remember then "backoff" multiplier instead of "slow_tick", so resetting the backoff would be simply assigning 1.0 instead of floating point multiplication?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe even better to cache uct_ud_slow_tick()
value in ud_iface.async?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, let's do that
@alex-mikheev pls take a look as well |
@shamisp yes, will need to port to v1.2 |
Test PASSed. |
Test FAILed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, can you please squash?
- Fix hang in MPI_Finalize with UCX_TLS=rc[_x],sm
31679e1
to
d2a722a
Compare
@yosefe done |
Test FAILed. |
Test PASSed. |
Test FAILed. |
bot:retest |
Test PASSed. |
Test PASSed. |
Test PASSed. |