Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to using strict channel priority during RAPIDS builds #84

Open
vyasr opened this issue Jul 22, 2024 · 2 comments
Open

Switch to using strict channel priority during RAPIDS builds #84

vyasr opened this issue Jul 22, 2024 · 2 comments

Comments

@vyasr
Copy link
Contributor

vyasr commented Jul 22, 2024

RAPIDS conda packages currently do not install successfully when using strict channel priority. This has caused some difficulty for users in the past. strict channel priority also in general leads to faster solves. The reason that RAPIDS requires flexible channel priority is that there are some packages that have historically been published to both the rapidsai[-nightly] and conda-forge channels. Typically this occurred because RAPIDS needed specific versions/builds of packages that were either not yet available on conda-forge. However, in recent years we have moved to a much stronger reliance on building and maintaining conda-forge packages as needed, so most of the packages that we've done this for in the past (ucx, nccl) are now made regularly available on conda-forge and no longer updated on the rapidsai[-nightly] channel.

We should clean out the old packages in the rapidsai[-nightly] channel that prevent strict solving from working. Rather than removing them altogether, we can move them under a new label so that old versions could still be installed with that label installed (although in general installing old versions will be quite challenging without a fully specified environment lock file anyway due to how conda-forge's global pinnings move and other packages on there are released).

@raydouglass
Copy link
Member

This is mostly documenting some of my tests for installing and running older versions of RAPIDS.

We also need to test arm64 installs because RAPIDS supported arm64 before many conda-forge packages did and we released those packages in our rapidsai conda channel.

This is the test script I used to check for import errors. It is not comprehensive. https://gist.github.com/raydouglass/ff100a114c2a370b68131af55959afc0

Test machine:

  • Driver 550.78
  • System CTK 12.3
  • x86_64
  • Ubuntu 22.04.4
  • 2x Quadro RTX 8000
  • Tests were run bare-metal until otherwise stated

Here are the conda list for each environment below: https://gist.github.com/raydouglass/5948d6cab3d3c9f29cc02533bb2b4d25

23.02

Solves & tested: mamba create -n rapids-23.02 python=3.10 cudatoolkit=11.8 rapids=23.02

22.02

Solved with mamba create -n rapids-22.02 python=3.9 cudatoolkit=11.5 rapids=22.02.

Test errored with:

Traceback (most recent call last):
  File "/home/rdouglass/workspace/snippets/test_rapids.py", line 2, in <module>
    import cudf
  File "/home/rdouglass/mambaforge/envs/rapids-22.02/lib/python3.9/site-packages/cudf/__init__.py", line 5, in <module>
    validate_setup()
  File "/home/rdouglass/mambaforge/envs/rapids-22.02/lib/python3.9/site-packages/cudf/utils/gpu_utils.py", line 20, in validate_setup
    from rmm._cuda.gpu import (
  File "/home/rdouglass/mambaforge/envs/rapids-22.02/lib/python3.9/site-packages/rmm/__init__.py", line 16, in <module>
    from rmm import mr
  File "/home/rdouglass/mambaforge/envs/rapids-22.02/lib/python3.9/site-packages/rmm/mr.py", line 14, in <module>
    from rmm._lib.memory_resource import (
  File "/home/rdouglass/mambaforge/envs/rapids-22.02/lib/python3.9/site-packages/rmm/_lib/__init__.py", line 15, in <module>
    from .device_buffer import DeviceBuffer
  File "rmm/_lib/device_buffer.pyx", line 1, in init rmm._lib.device_buffer
TypeError: C function cuda.ccudart.cudaStreamSynchronize has wrong signature (expected __pyx_t_4cuda_7ccudart_cudaError_t (__pyx_t_4cuda_7ccudart_cudaStream_t), got cudaError_t (cudaStream_t))

I think this is a system CTK issue since running the script in the original unmodified rapidsai/rapidsai:22.02-cuda11.5-runtime-ubuntu20.04-py3.9 image works for cudf/cuml. I did not reinstall the rapids package in the container.

0.10

This is the first version with the rapids meta package.

Solves with mamba create -n rapids-0.10 python=3.6 cudatoolkit=9.2 rapids=0.10

I did not test this.

@vyasr
Copy link
Contributor Author

vyasr commented Sep 23, 2024

Now that we have an idea of what works, the next step is to figure out what could break with strict channel priority and packages removed. The approach I would follow is to run the same installation commands as above, but adding the --strict-channel-priority flag. For a first pass, a dry run should be sufficient. For each version of RAPIDS tested, inspect the output list of packages and find which ones are being installed from the rapidsai channel. If any of them are packages that we plan to remove from the rapidsai channel, we should add those to the command with a channel specifier, e.g. for ucx mamba create ... rapids=${VERSION} conda-forge::ucx. This will force the conda solver to pull the ucx package from conda-forge instead of rapidsai(-nightly). The dry runs should be sufficient to put together this list and evaluate what will solve. Once a complete list is compiled going back to 23.02 (selected since that was the last working version tested above, but we could go back further) then we should actually create the environments (no dry run) and run the test script posted above to see if any results change. I expect that the dry runs should tell us most of what we need to know though since unless there are incompatible binaries of the same package with the same version on two different channels (hopefully unlikely) then if the solve succeeds we'll be getting package versions that should function according to the constraints in our package dependency spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants