Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/WIREUP: Consider local distance during slow lanes dropping - v1.16.x - revert #9781

Conversation

ivankochin
Copy link
Contributor

What

Reverting the backport of #9512 since it breaks wire compatibility.

@ivankochin ivankochin changed the base branch from master to v1.16.x March 26, 2024 10:54
@ivankochin ivankochin self-assigned this Mar 26, 2024
@ivankochin ivankochin changed the title UCP/WIREUP: Consider local distance during slow lanes dropping - v1.16.x UCP/WIREUP: Consider local distance during slow lanes dropping - v1.16.x - revert Mar 26, 2024
@yosefe
Copy link
Contributor

yosefe commented Mar 26, 2024

Do we need to revert it on master branch?
How did you find the issue? Why we didn't catch it in CI?

@ivankochin
Copy link
Contributor Author

ivankochin commented Mar 26, 2024

Do we need to revert it on master branch?

No, we didn't, the fix works OK on master.

How did you find the issue?

It was just a logical conclusion. The original fixes that works for master checked that UCX version is >= 17 and then applied new logic, with older UCX version it leaves the logic as is to avoid different lanes performance on old UCX. If we somehow change that for UCX 1.16, it will break the logic for master branch.

Why we didn't catch it in CI?

I think this error can require some specific setup to be reproduced.

@yosefe
Copy link
Contributor

yosefe commented Mar 26, 2024

I think this error can require some specific setup to be reproduced.

Maybe because CI doesn't test current v1.16.x vs master branch?
@brminich WDYT?

@ivankochin
Copy link
Contributor Author

I think this error can require some specific setup to be reproduced.

Maybe because CI doesn't test current v1.16.x vs master branch? @brminich WDYT?

Yes, that should be the case, but also it should be the system with several NICs and these NICs should have different distance to GPU.

The original issue can be reproduced on DGX1

@yosefe
Copy link
Contributor

yosefe commented Mar 26, 2024

ok
but probably need to fix the commit title by force push

@ivankochin
Copy link
Contributor Author

ok but probably need to fix the commit title by force push

What's wrong with the commit title?

@brminich
Copy link
Contributor

I think this error can require some specific setup to be reproduced.

Maybe because CI doesn't test current v1.16.x vs master branch? @brminich WDYT?

we do test it https://github.com/openucx/ucx/blob/master/buildlib/pr/wire_compat.yml#L11
i think we need a specific server config to get it

@yosefe
Copy link
Contributor

yosefe commented Mar 26, 2024

I think this error can require some specific setup to be reproduced.

Maybe because CI doesn't test current v1.16.x vs master branch? @brminich WDYT?

we do test it https://github.com/openucx/ucx/blob/master/buildlib/pr/wire_compat.yml#L11 i think we need a specific server config to get it

I think in the current CI config - future master PRs would have failed after we merged something bad to v1.16.x
but we need a bad v1.16.x PR to fail. maybe need to test master branch PRs vs. release tags, and release branch PRs vs. master?

@ivankochin ivankochin force-pushed the ucp/wireup/consider-local-distance-duing-slow-lanes-dropping-1.16.x-revert branch from efe33b8 to 0287705 Compare March 27, 2024 08:58
@ivankochin
Copy link
Contributor Author

@yosefe pls review

@ivankochin ivankochin merged commit aca9748 into openucx:v1.16.x Mar 31, 2024
107 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants