-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] create a new 'rapids' conda environment instead of installing packages in the 'base' environment #713
Conversation
The current state of this PR (c836a9d) is at least sufficient to get all CI passing, which confirms that #712 was the root cause here (not issues with the notebooks or library code). But as we've been discussing on #712, it by itself would be an unacceptably large amount of breakage, and there might be ways to make this less disruptive: #712 (comment). I'll explore those next. |
PATH=/opt/conda/envs/rapids/bin:/opt/conda/condabin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \ | ||
PROJ_DATA=/opt/conda/envs/rapids/share/proj \ | ||
PROJ_NETWORK=ON \ | ||
XML_CATALOG_FILES="file:///opt/conda/envs/rapids/etc/xml/catalog file:///etc/xml/catalog" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this is to make it feel like conda activate rapids
was run even though it wasn't, in situations where you can't rely on the entrypoint script being run (described by @jacobtomlinson in #712 (comment)).
It's a hack and could hopefully be reverted completely in RAPIDS 24.12 through some combination of the following:
conda
relaxing itsmamba
constraint and therefore itsfmt
constraint- (e.g.
conda install -y -n base conda 'fmt>=11'
working) - (see [BUG] Notebook tests failing on latest 24.10 nightlies #712 (comment))
- (e.g.
- RAPIDS removing its runtime dependency on
fmt
(Investigate removing spdlog from all public APIs build-planning#104)
If reviewers agree with this approach, I'll write up a separate issue to track that work of reverting all of this.
NOTE: I'm intentionally not doing this in the raft-ann-bench
images... those are expected to be used with explicit entrypoints, as far as I can tell, and I've modified all those entrypoints with the appropriate conda activation commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this if it works for the use cases that @jacobtomlinson defined in #712.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this @jameslamb. This seems like it should be a good workaround.
Once it has been merged and nightlies are pushed I'll check a few of the deployment places and verify things are working as expected.
.github/workflows/test-notebooks.yml
Outdated
@@ -88,7 +88,9 @@ jobs: | |||
rapids-logger "nvidia-smi" | |||
nvidia-smi | |||
- name: Test notebooks | |||
run: /home/rapids/test_notebooks.py -i /home/rapids/notebooks -o /home/rapids/notebooks_output | |||
run: | | |||
. /opt/conda/etc/profile.d/conda.sh; conda activate rapids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the environment hacking is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah you're right, it shouldn't be! I'll try reverting it.
Put a Will come back and update in a bit, once this build hopefully publishes new |
Very happy to say.... upstream changes made this unnecessary 😁 details: #712 (comment) Thanks for the help everyone! |
Fixes #712
Caused by rapidsai/build-planning#56.
Nightly notebook runs have been failing here for the last couple days. It looks like the root cause is "mismatched RAPIDS nightly versions, e.g. using a newer
cuml
with an olderlibraft
, becauseconda
refuses to install packages depending onfmt>=11
".This proposes fixing that by installing all packages into a new
rapids
environment in the images produced here, instead of into thebase
environment. See #712 (comment) for more details.Notes for Reviewers
Benefits of this change
Ensures that we can produce container images for 24.10, and all future releases as RAPIDS now pins to
fmt>=11.0.2,<12
.Reduces the risk of similar conflicts the next time that
conda
/mamba
(fromconda-forge
) and RAPIDS are using incompatible versions offmt
,spdlog
, or any other dependencies they share.Costs of this change
This is a user-facing breaking change for anyone relying on these images having the RAPIDS packages in the
base
environment. Moving to thebase
environment was an explicit requirement of therapidsai/docker
overhaul done about 1.5 years ago in #539.It'll require some changes to the RAPIDS deployment docs, for any places documenting the
base
environment, e.g.https://github.com/rapidsai/deployment/blob/003524e074cf0df70730d63296618b12893b2e9c/source/examples/rapids-sagemaker-higgs/Dockerfile#L6-L10
How I tested this
On an x86_64 machine with CUDA 12.2, built the
notebooks
image locally:Ran it with not command / entrypoint specified. Saw Jupyter Lab come up successfully, and I was able to run the
cuml/arima_demo.ipynb
notebook end-to-end successfully 🎉docker run \ --rm \ --gpus "0,1" \ -p 1234:8888 \ -it rapidsai/base:delete-me-local
Confirmed that if you bypass the entrypoint, the libraries are still all found (click me)