Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/CUDA_IPC: MD_FLAG_INVALIDATE stub implementation #7522

Merged
merged 1 commit into from
Oct 31, 2021

Conversation

Akshay-Venkatesh
Copy link
Contributor

What

Support MD_FLAG_INVALIDATE using rcache to track memory regions

@Akshay-Venkatesh
Copy link
Contributor Author

@pentschev Can you check if this PR prevents cuda-ipc from being dropped with error callbacks usage in ucxpy?

@swx-jenkins4
Copy link

Can one of the admins verify this patch?

@yosefe
Copy link
Contributor

yosefe commented Oct 9, 2021

ok to test

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe are the errors related here? I see tests ending with terminated status unexpectedly.

@pentschev
Copy link
Contributor

Sorry for the delayed response, but I wanted to make sure I tested performance and all of our tests too. Everything is looking good with this, performance is restored. Thanks so much @Akshay-Venkatesh !

@yosefe
Copy link
Contributor

yosefe commented Oct 19, 2021

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@yosefe
Copy link
Contributor

yosefe commented Oct 20, 2021

@yosefe are the errors related here? I see tests ending with terminated status unexpectedly.

@Akshay-Venkatesh the failures seem related to this PR

@pentschev
Copy link
Contributor

Interestingly, this branch seemed to have worked before, but I now see various issues, for example:

[1634739065.124670] [dgx14:67606:0]    cuda_copy_ep.c:150  UCX  ERROR cudaMemcpyAsync() failed: invalid argument

Followed by messages on the Python side:

distributed.utils - ERROR - Unable to allocate 7.15 EiB for an array with shape (8243383320426316033,) and data type uint8

It seems that the size is being corrupted somewhere. What changed in our environment since is we upgraded from MOFED 4 to MOFED 5, in case that information might make any sense.

Some other tests entirely segfault:

[1634739418.316511] [dgx14:71163:0]           ib_md.c:349  UCX  ERROR ibv_reg_mr(address=0x7fc05387ae20, length=140449725546512, access=0xf) failed: Cannot allocate memory
[1634739418.316532] [dgx14:71163:0]          ucp_mm.c:153  UCX  ERROR failed to register address 0x7fc05387ae2b mem_type bit 0x2 length 140449725546496 on md[3]=mlx5_1: Input/output error (md reg_mem_types 0x3)
[1634739418.316537] [dgx14:71163:0]     ucp_request.c:501  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7fb951e1a300 len 100000000: Input/output error
[dgx14:71163:0:71163]        rndv.c:1955 Assertion `status == UCS_OK' failed
==== backtrace (tid:  71163) ====
 0  /datasets/pentschev/miniconda3/envs/gdf/lib/libucs.so.0(ucs_handle_error+0x14c) [0x7fc05340a36c]
 1  /datasets/pentschev/miniconda3/envs/gdf/lib/libucs.so.0(ucs_fatal_error_message+0x68) [0x7fc0534071d8]
 2  /datasets/pentschev/miniconda3/envs/gdf/lib/libucs.so.0(+0x2b309) [0x7fc053407309]
 3  /datasets/pentschev/miniconda3/envs/gdf/lib/libucp.so.0(ucp_rndv_rtr_handler+0x7fc) [0x7fc053af32bc]
 4  /datasets/pentschev/miniconda3/envs/gdf/lib/ucx/libuct_ib.so.0(+0x38504) [0x7fc040429504]
 5  /datasets/pentschev/miniconda3/envs/gdf/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7fc053acae4a]
 6  /datasets/pentschev/miniconda3/envs/gdf/lib/python3.8/site-packages/ucp/_libs/ucx_api.cpython-38-x86_64-linux-gnu.so(+0x4b099) [0x7fc04ac3c099]
 7  /datasets/pentschev/miniconda3/envs/gdf/bin/python(+0x195528) [0x55ca707e6528]
 8  /datasets/pentschev/miniconda3/envs/gdf/bin/python(PyObject_Call+0x5e) [0x55ca7077616e]
 9  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0x21bf) [0x55ca7081f4ef]
10  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55ca707ffdb3]
11  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x378) [0x55ca70801198]
12  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55ca7081dd93]
13  /datasets/pentschev/miniconda3/envs/gdf/bin/python(+0x1806f3) [0x55ca707d16f3]
14  /datasets/pentschev/miniconda3/envs/gdf/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0xb896) [0x7fc075ebf896]
15  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyObject_MakeTpCall+0x31e) [0x55ca7078c30e]
16  /datasets/pentschev/miniconda3/envs/gdf/bin/python(+0x21beaf) [0x55ca7086ceaf]
17  /datasets/pentschev/miniconda3/envs/gdf/bin/python(+0x129082) [0x55ca7077a082]
18  /datasets/pentschev/miniconda3/envs/gdf/bin/python(PyVectorcall_Call+0x6e) [0x55ca7077ce4e]
19  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0x5f25) [0x55ca70823255]
20  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55ca70800fc6]
21  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55ca7081dd93]
22  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55ca70800fc6]
23  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55ca7081dd93]
24  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55ca70800fc6]
25  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55ca7081dd93]
26  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55ca70800fc6]
27  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55ca7081dd93]
28  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55ca707ffdb3]
29  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x378) [0x55ca70801198]
30  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55ca7081dd93]
31  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55ca707ffdb3]
32  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x378) [0x55ca70801198]
33  /datasets/pentschev/miniconda3/envs/gdf/bin/python(+0x1b0dfc) [0x55ca70801dfc]
34  /datasets/pentschev/miniconda3/envs/gdf/bin/python(PyObject_Call+0x5e) [0x55ca7077616e]
35  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0x21bf) [0x55ca7081f4ef]
36  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55ca70800fc6]
37  /datasets/pentschev/miniconda3/envs/gdf/bin/python(+0x1b0dfc) [0x55ca70801dfc]
38  /datasets/pentschev/miniconda3/envs/gdf/bin/python(PyObject_Call+0x5e) [0x55ca7077616e]
39  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0x21bf) [0x55ca7081f4ef]
40  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55ca70800fc6]
41  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55ca7081dd93]
42  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55ca707ffdb3]
43  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x378) [0x55ca70801198]
44  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55ca7081dd93]
45  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55ca70800fc6]
46  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0x947) [0x55ca7081dc77]
47  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55ca707ffdb3]
48  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyFunction_Vectorcall+0x378) [0x55ca70801198]
49  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalFrameDefault+0x181e) [0x55ca7081eb4e]
50  /datasets/pentschev/miniconda3/envs/gdf/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55ca707ffdb3]
51  /datasets/pentschev/miniconda3/envs/gdf/bin/python(PyEval_EvalCodeEx+0x39) [0x55ca70800e19]
52  /datasets/pentschev/miniconda3/envs/gdf/bin/python(PyEval_EvalCode+0x1b) [0x55ca708a324b]
53  /datasets/pentschev/miniconda3/envs/gdf/bin/python(+0x2522e3) [0x55ca708a32e3]
54  /datasets/pentschev/miniconda3/envs/gdf/bin/python(+0x26e543) [0x55ca708bf543]
55  /datasets/pentschev/miniconda3/envs/gdf/bin/python(PyRun_StringFlags+0x7d) [0x55ca708c3dad]
56  /datasets/pentschev/miniconda3/envs/gdf/bin/python(PyRun_SimpleStringFlags+0x3d) [0x55ca708c3e0d]
57  /datasets/pentschev/miniconda3/envs/gdf/bin/python(Py_RunMain+0x158) [0x55ca708c4aa8]
58  /datasets/pentschev/miniconda3/envs/gdf/bin/python(Py_BytesMain+0x39) [0x55ca708c4e79]
59  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7fc097887bf7]
60  /datasets/pentschev/miniconda3/envs/gdf/bin/python(+0x1e6d69) [0x55ca70837d69]
=================================

@yosefe
Copy link
Contributor

yosefe commented Oct 20, 2021 via email

@pentschev
Copy link
Contributor

I assume UCX was rebuilt from source after this upgrade?

Yes, several times and it sometimes causes the processes to go into uninterruptible sleep state, requiring a reboot.

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe This is also a target for ucx-1.12 and I'm not sure if the errors we're seeing are related to the PR.

@yosefe
Copy link
Contributor

yosefe commented Oct 26, 2021

@yosefe This is also a target for ucx-1.12 and I'm not sure if the errors we're seeing are related to the PR.

I think it's related: This PR fails all GPU tests consistently, while all other PRs run to completion successfully

@pentschev
Copy link
Contributor

I also think they are related, as per details in #7522 (comment) . And to be clear, this is a blocker for UCX-Py in UCX 1.12.

@Akshay-Venkatesh
Copy link
Contributor Author

I also think they are related, as per details in #7522 (comment) . And to be clear, this is a blocker for UCX-Py in UCX 1.12.

@pentschev I'm looking at possible issue with rcache alignment but I don't expect the following issue to come from this PR:

[1634739418.316511] [dgx14:71163:0]           ib_md.c:349  UCX  ERROR ibv_reg_mr(address=0x7fc05387ae20, length=140449725546512, access=0xf) failed: Cannot allocate memory
[1634739418.316532] [dgx14:71163:0]          ucp_mm.c:153  UCX  ERROR failed to register address 0x7fc05387ae2b mem_type bit 0x2 length 140449725546496 on md[3]=mlx5_1: Input/output error (md reg_mem_types 0x3)
[1634739418.316537] [dgx14:71163:0]     ucp_request.c:501  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7fb951e1a300 len 100000000: Input/output error

Are you applying https://github.com/openucx/ucx/pull/7522.diff on top of master to test md_invalidate support? If not, can you try that? because the errors resemble those we fixed with #7485

@Akshay-Venkatesh
Copy link
Contributor Author

I also think they are related, as per details in #7522 (comment) . And to be clear, this is a blocker for UCX-Py in UCX 1.12.

@pentschev I'm looking at possible issue with rcache alignment but I don't expect the following issue to come from this PR:

[1634739418.316511] [dgx14:71163:0]           ib_md.c:349  UCX  ERROR ibv_reg_mr(address=0x7fc05387ae20, length=140449725546512, access=0xf) failed: Cannot allocate memory
[1634739418.316532] [dgx14:71163:0]          ucp_mm.c:153  UCX  ERROR failed to register address 0x7fc05387ae2b mem_type bit 0x2 length 140449725546496 on md[3]=mlx5_1: Input/output error (md reg_mem_types 0x3)
[1634739418.316537] [dgx14:71163:0]     ucp_request.c:501  UCX  ERROR failed to register user buffer datatype 0x8 address 0x7fb951e1a300 len 100000000: Input/output error

Are you applying https://github.com/openucx/ucx/pull/7522.diff on top of master to test md_invalidate support? If not, can you try that? because the errors resemble those we fixed with #7485

@pentschev I've updated the PR again without using rcache. Can you check if you still get cuda_ipc support with this? (hopefully there are no errors as you saw previously)

cc @yosefe

@Akshay-Venkatesh Akshay-Venkatesh changed the title UCT/CUDA_IPC: add rcache instance to support MD_FLAG_INVALIDATE UCT/CUDA_IPC: MD_FLAG_INVALIDATE stub implementation Oct 27, 2021
@pentschev
Copy link
Contributor

Sorry @Akshay-Venkatesh , you're right. I mixed things up, the ibv_reg_mr isn't related to this PR, but rather it's caused by UCX_MEMTYPE_CACHE=n being unset now, as I mentioned in #7575 . This PR seems ok (no segfaults/errors), but I'll run the full UCX-Py test/benchmark set with it to be safe and report back later or tomorrow morning.

@pentschev
Copy link
Contributor

@Akshay-Venkatesh tests and benchmarks are looking good with the PR, including CUDA IPC performance, so definitely a +1 to get this in for UCX-Py. Thanks so much for all the work here!

@@ -655,6 +655,10 @@ UCS_TEST_SKIP_COND_P(test_md, invalidate, !check_caps(UCT_MD_FLAG_INVALIDATE))
ucs_status_t status;
uct_md_mem_dereg_params_t params;

if (!strcmp(GetParam().md_name.c_str(), "cuda_ipc")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (GetParam().md_name == "cuda_ipc")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or better: has_cuda_ipc()

@Akshay-Venkatesh
Copy link
Contributor Author

@yosefe errors look unrelated to the PR.

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls squash without rebase

@@ -653,6 +653,10 @@ UCS_TEST_SKIP_COND_P(test_md, invalidate, !check_caps(UCT_MD_FLAG_INVALIDATE))
ucs_status_t status;
uct_md_mem_dereg_params_t params;

if (GetParam().md_name == "cuda_ipc") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using has_cuda_ipc() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that has_cuda_ipc is a virtual defined in uct_test class which is not accessible in this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, test_md does not inherit from uct_test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants