-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/IB: Use separate resource domain for mlx5 trasnports #2338
Conversation
mlx5 transports manage their own blue-flame register state. In order to avoid inconsistencies with verbs transports, they must not share that register. This change is using the verbs "resource domain" API in order to force the separation, since QPs created on different resource domains will have separate blue-flame register space.
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
src/uct/ib/base/ib_iface.c
Outdated
|
||
return res_domain->ibv_domain->context == dev->ibv_context; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing #if HAVE_IBV_EXP_RES_DOMAIN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/uct/ib/rc/base/rc_iface.c
Outdated
@@ -919,7 +920,7 @@ ucs_status_t uct_rc_iface_qp_create(uct_rc_iface_t *iface, int qp_type, | |||
|
|||
# if HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE | |||
if (dev->dev_attr.exp_atomic_cap == IBV_EXP_ATOMIC_HCA_REPLY_BE) { | |||
qp_init_attr.comp_mask |= IBV_EXP_QP_INIT_ATTR_CREATE_FLAGS; | |||
qp_init_attr.comp_mask |= IBV_EXP_QP_INIT_ATTR_CREATE_FLAGS; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous alignment was correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -236,7 +236,8 @@ static UCS_CLASS_INIT_FUNC(uct_rc_mlx5_iface_t, uct_md_h md, uct_worker_h worker | |||
|
|||
UCS_CLASS_CALL_SUPER_INIT(uct_rc_iface_t, &uct_rc_mlx5_iface_ops, md, worker, | |||
params, &config->super, 0, | |||
sizeof(uct_rc_fc_request_t), IBV_EXP_TM_CAP_RC); | |||
sizeof(uct_rc_fc_request_t), IBV_EXP_TM_CAP_RC, | |||
UCT_IB_MLX5_RES_DOMAIN_KEY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok that all accelerated transports use the same domain key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes because they share the BF register structure as well (with the same key)
And if you don't separate these ? I guess verbs and mlx5 will step on each other and you will end up having more dropped doorbells than you want ? |
@shamisp yes, they step on each other. i've observed data corruption, though dropped doorbells can happen too |
ouch... this is critical one. BTW, Can we add some check (ucx info ?) validating that majority of BF doorbells are successful ? |
reading the counters is not possible without root privileges.. |
yes...
…On Mon, Feb 26, 2018 at 6:16 PM, Yossi Itigin ***@***.***> wrote:
reading the counters is not possible without root privileges..
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2338 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACIe2FO24eCYSpmvz3_-jnLZ46M813d7ks5tYvULgaJpZM4SN-tc>
.
|
…-res-domain Conflicts: src/uct/ib/mlx5/ib_mlx5.h
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
bot:mlx:retest |
Test FAILed. |
Test FAILed. |
ibv_exp_destroy_res_domain], | ||
[AC_DEFINE([HAVE_IBV_EXP_RES_DOMAIN], 1, [IB resource domain])], | ||
[AC_MSG_WARN([Cannot use mlx5 accel because resource domains are not supported]) | ||
AC_MSG_WARN([Please upgrade MellanoxOFED to 3.1 or above]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add URL
Test FAILed. |
Test PASSed. |
bot:bgate:retest |
Test PASSed. |
b17b836
to
fe17377
Compare
Test PASSed. |
Test FAILed. |
Test PASSed. |
@brminich pls take a look |
@yosefe |
Fixes #1926