Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/IB: Use separate resource domain for mlx5 trasnports #2338

Merged
merged 8 commits into from
Mar 26, 2018

Conversation

yosefe
Copy link
Contributor

@yosefe yosefe commented Feb 21, 2018

Fixes #1926

mlx5 transports manage their own blue-flame register state. In order to
avoid inconsistencies with verbs transports, they must not share that
register.
This change is using the verbs "resource domain" API in order to force
the separation, since QPs created on different resource domains will
have separate blue-flame register space.
@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/6336/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/4059/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/6339/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/6340/ for details (Mellanox internal link).


return res_domain->ibv_domain->context == dev->ibv_context;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing #if HAVE_IBV_EXP_RES_DOMAIN

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -919,7 +920,7 @@ ucs_status_t uct_rc_iface_qp_create(uct_rc_iface_t *iface, int qp_type,

# if HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE
if (dev->dev_attr.exp_atomic_cap == IBV_EXP_ATOMIC_HCA_REPLY_BE) {
qp_init_attr.comp_mask |= IBV_EXP_QP_INIT_ATTR_CREATE_FLAGS;
qp_init_attr.comp_mask |= IBV_EXP_QP_INIT_ATTR_CREATE_FLAGS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous alignment was correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -236,7 +236,8 @@ static UCS_CLASS_INIT_FUNC(uct_rc_mlx5_iface_t, uct_md_h md, uct_worker_h worker

UCS_CLASS_CALL_SUPER_INIT(uct_rc_iface_t, &uct_rc_mlx5_iface_ops, md, worker,
params, &config->super, 0,
sizeof(uct_rc_fc_request_t), IBV_EXP_TM_CAP_RC);
sizeof(uct_rc_fc_request_t), IBV_EXP_TM_CAP_RC,
UCT_IB_MLX5_RES_DOMAIN_KEY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok that all accelerated transports use the same domain key?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes because they share the BF register structure as well (with the same key)

@yosefe yosefe added the WIP-DNM Work in progress / Do not review label Feb 22, 2018
@shamisp
Copy link
Contributor

shamisp commented Feb 26, 2018

And if you don't separate these ? I guess verbs and mlx5 will step on each other and you will end up having more dropped doorbells than you want ?

@yosefe
Copy link
Contributor Author

yosefe commented Feb 26, 2018

@shamisp yes, they step on each other. i've observed data corruption, though dropped doorbells can happen too

@shamisp
Copy link
Contributor

shamisp commented Feb 26, 2018

ouch... this is critical one.

BTW, Can we add some check (ucx info ?) validating that majority of BF doorbells are successful ?

@yosefe
Copy link
Contributor Author

yosefe commented Feb 26, 2018

reading the counters is not possible without root privileges..

@shamisp
Copy link
Contributor

shamisp commented Feb 26, 2018 via email

@yosefe yosefe added Bugfix and removed WIP-DNM Work in progress / Do not review labels Mar 8, 2018
@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/4183/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/6491/ for details (Mellanox internal link).

@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/4191/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/6499/ for details (Mellanox internal link).

@yosefe yosefe changed the title UCT/IB: Use separate resourcve domain for mlx5 trasnports UCT/IB: Use separate resource domain for mlx5 trasnports Mar 12, 2018
@yosefe
Copy link
Contributor Author

yosefe commented Mar 21, 2018

bot:mlx:retest

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/6600/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/6610/ for details (Mellanox internal link).

ibv_exp_destroy_res_domain],
[AC_DEFINE([HAVE_IBV_EXP_RES_DOMAIN], 1, [IB resource domain])],
[AC_MSG_WARN([Cannot use mlx5 accel because resource domains are not supported])
AC_MSG_WARN([Please upgrade MellanoxOFED to 3.1 or above])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add URL

@swx-jenkins1
Copy link

Test FAILed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/4290/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/6612/ for details (Mellanox internal link).

@yosefe
Copy link
Contributor Author

yosefe commented Mar 23, 2018

bot:bgate:retest

@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/4301/ for details.

@yosefe yosefe force-pushed the topic/uct-ib-mlx5-res-domain branch from b17b836 to fe17377 Compare March 25, 2018 11:30
@swx-jenkins1
Copy link

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/4304/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/6629/ for details (Mellanox internal link).

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/6631/ for details (Mellanox internal link).

@yosefe
Copy link
Contributor Author

yosefe commented Mar 26, 2018

@brminich pls take a look

@yosefe yosefe merged commit 81939df into openucx:master Mar 26, 2018
@yosefe yosefe deleted the topic/uct-ib-mlx5-res-domain branch March 26, 2018 09:59
@artpol84
Copy link
Contributor

@yosefe
I confirm that lost doorbells was the case in my multi-threaded tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants