Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PSA: UCX 1.16 defaulting to protov2 #992

Open
pentschev opened this issue Oct 3, 2023 · 0 comments
Open

PSA: UCX 1.16 defaulting to protov2 #992

pentschev opened this issue Oct 3, 2023 · 0 comments

Comments

@pentschev
Copy link
Member

pentschev commented Oct 3, 2023

The upcoming UCX 1.16 release will default to using UCP protov2 (UCX_PROTO_ENABLE=y). We have been testing Dask-CUDA and UCX-Py with protov2 for a while and are not aware of any issues, in fact some known issues related to non-fully-NVLink connected systems (e.g., DGX-1) are now fixed. However, historically, large changes like this have been associated with increased risk of new problems. To mitigate potential risks, we strongly encourage all UCX-related testing to be duplicated with UCX 1.15.0 (released September 28, 2023, already available in conda-forge), running duplicates with UCX_PROTO_ENABLE=y. If possible, testing with latest UCX master changes is preferred (although that requires building UCX from source).

Mitigation strategy: if we encounter any blocking issues it is still possible to fallback to protov1 by explicitly setting UCX_PROTO_ENABLE=n, which can be setup as default in UCX-Py, this is a last resort as protov2 should become the new norm and we need to adapt and resolve any issues with it we may encounter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant