Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set UCT parameters from UCP API #7019

Merged
merged 1 commit into from
Sep 24, 2021
Merged

Conversation

changchengx
Copy link
Contributor

@changchengx changchengx commented Jun 29, 2021

What

  1. cache UCT configurations in ucp_config_modify API, apply UCT configurations when creating ucp_context and ucp_worker
  2. print UCT configurations in ucp_config_print API
  3. print unused cached UCT configurations in ucp_warn_unused_uct_config API

Why ?

Feture requirement

How ?

  1. cache input key/value in the ucp_config object
  2. copy the cached key/value into ucp_context, apply them to UCT configurations when creating ucp_context
  3. apply the cached key/value into UCT configurations when creating ucp_worker

@changchengx changchengx changed the title Set UCT parameters from UCP API [DNM/WIP]Set UCT parameters from UCP API Jun 29, 2021
@swx-jenkins3
Copy link
Collaborator

Can one of the admins verify this patch?

@changchengx changchengx force-pushed the set_uct_env branch 2 times, most recently from da464be to ecf088d Compare July 1, 2021 08:11
@changchengx changchengx force-pushed the set_uct_env branch 2 times, most recently from 5d802ee to d85f2e9 Compare July 13, 2021 06:52
@changchengx
Copy link
Contributor Author

It seems there's something wrong with the CI system. I checked some other PRs, they also meet with the same failure case.

@changchengx changchengx changed the title [DNM/WIP]Set UCT parameters from UCP API Set UCT parameters from UCP API Jul 15, 2021
src/ucp/core/ucp_context.c Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Show resolved Hide resolved
src/ucs/config/parser.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucs/config/parser.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/uct/base/uct_md.c Outdated Show resolved Hide resolved
src/ucs/config/parser.h Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
@changchengx
Copy link
Contributor Author

There's one conflict here. Shoud I rebase the old used master branch to the current master branch?

@changchengx
Copy link
Contributor Author

changchengx commented Jul 29, 2021

Confirmed by Bin & Yossi: we don't need to check the key

@leibin2014
Copy link
Contributor

Confirmed by Bin & Yossi: we don't need to check the key

No, actually we checked, we just don't need to double check here.

@changchengx
Copy link
Contributor Author

@leibin2014
I'm not sure why there're 5 cancelled CI test cases. There's no obvious log shows the failure reason.

src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/uct/base/uct_md.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
@changchengx changchengx force-pushed the set_uct_env branch 3 times, most recently from c73237c to 555266d Compare August 8, 2021 13:16
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
Copy link
Contributor

@leibin2014 leibin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more minor comments. Otherwise is ok for me.
@yosefe Could you please start review?

src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.h Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.h Outdated Show resolved Hide resolved
src/uct/api/uct.h Outdated Show resolved Hide resolved
@changchengx
Copy link
Contributor Author

@leibin2014 Please help trigger test

@leibin2014
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@changchengx
Copy link
Contributor Author

@yosefe I've changed the code according to your comments. I checked the failed CI cases, it seems that they're not related with this PR. Some other PRs also failed at these CI cases around these days.

@changchengx
Copy link
Contributor Author

@yosefe Pass all CI test cases.

@changchengx
Copy link
Contributor Author

It seems that below error is not related with this PR
https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=25219&view=logs&jobId=388a871c-1c79-55fb-9b87-030460a06453

2021-09-13T02:55:09.0312560Z [1631501708.994886] [cb23a5575958:25813:0]       ib_device.c:1332 UCX  ERROR   ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=fe80::bace:f6ff:fe0a:11f8 sgid_index=0 traffic_class=0) for UD mlx5 connect on mlx5_bond_1 failed: No such device
2021-09-13T02:55:09.0313481Z [cb23a5575958:25813:0:25813]    ud_iface.c:53   Fatal: iface 0x2505300: failed to get peer address

@changchengx
Copy link
Contributor Author

@changchengx
Copy link
Contributor Author

I finished code rebase because there's conflict with master branch.
All CI test cases have been passed.

src/ucp/core/ucp_context.c Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.h Outdated Show resolved Hide resolved
src/ucs/config/parser.h Outdated Show resolved Hide resolved
src/ucs/config/parser.h Show resolved Hide resolved
test/gtest/ucp/ucp_test.cc Outdated Show resolved Hide resolved
@changchengx
Copy link
Contributor Author

@yosefe I've updated the PR according to your 2nd round code-review. All CI test cases have been passed.

src/ucp/api/ucp.h Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucs/config/parser.h Outdated Show resolved Hide resolved
src/ucs/config/parser.h Outdated Show resolved Hide resolved
src/ucs/config/parser.h Outdated Show resolved Hide resolved
src/ucs/config/parser.h Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Show resolved Hide resolved
src/ucp/core/ucp_worker.c Outdated Show resolved Hide resolved
src/ucp/core/ucp_worker.c Show resolved Hide resolved
@changchengx
Copy link
Contributor Author

@changchengx
Copy link
Contributor Author

@yosefe All CR comments have been fixed and CI test cases have been passed.

Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls squash without rebase

Signed-off-by: Changcheng Liu <jerrliu@nvidia.com>
@yosefe yosefe merged commit b6197c8 into openucx:master Sep 24, 2021
@changchengx changchengx deleted the set_uct_env branch September 25, 2021 14:42
changchengx added a commit to changchengx/ucx that referenced this pull request Sep 26, 2021
Backport below PR from master to integration3 branch
openucx#7019

Signed-off-by: Changcheng Liu <jerrliu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants