Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] create a new 'rapids' conda environment instead of installing packages in the 'base' environment #713

Closed
wants to merge 15 commits into from

Conversation

jameslamb
Copy link
Member

@jameslamb jameslamb commented Sep 30, 2024

Fixes #712

Caused by rapidsai/build-planning#56.

Nightly notebook runs have been failing here for the last couple days. It looks like the root cause is "mismatched RAPIDS nightly versions, e.g. using a newer cuml with an older libraft, because conda refuses to install packages depending on fmt>=11".

This proposes fixing that by installing all packages into a new rapids environment in the images produced here, instead of into the base environment. See #712 (comment) for more details.

Notes for Reviewers

Benefits of this change

Ensures that we can produce container images for 24.10, and all future releases as RAPIDS now pins to fmt>=11.0.2,<12.

Reduces the risk of similar conflicts the next time that conda / mamba (from conda-forge) and RAPIDS are using incompatible versions of fmt, spdlog, or any other dependencies they share.

Costs of this change

This is a user-facing breaking change for anyone relying on these images having the RAPIDS packages in the base environment. Moving to the base environment was an explicit requirement of the rapidsai/docker overhaul done about 1.5 years ago in #539.

It'll require some changes to the RAPIDS deployment docs, for any places documenting the base environment, e.g.

https://github.com/rapidsai/deployment/blob/003524e074cf0df70730d63296618b12893b2e9c/source/examples/rapids-sagemaker-higgs/Dockerfile#L6-L10

How I tested this

On an x86_64 machine with CUDA 12.2, built the notebooks image locally:

docker build \
    --build-arg CUDA_VER=11.8.0 \
    --build-arg PYTHON_VER=3.12 \
    --build-arg LINUX_DISTRO=ubuntu \
    --build-arg LINUX_DISTRO_VER=22.04 \
    --build-arg LINUX_VER=ubuntu22.04 \
    --build-arg RAPIDS_VER=24.10 \
    --target notebooks \
    -f ./Dockerfile \
    -t rapidsai/base:delete-me-local \
    context/

Ran it with not command / entrypoint specified. Saw Jupyter Lab come up successfully, and I was able to run the cuml/arima_demo.ipynb notebook end-to-end successfully 🎉

docker run \
    --rm \
    --gpus "0,1" \
    -p 1234:8888 \
    -it rapidsai/base:delete-me-local
Confirmed that if you bypass the entrypoint, the libraries are still all found (click me)
docker run \
    --rm \
    --gpus "0,1" \
    --entrypoint="" \
    -it rapidsai/base:delete-me-local \
    python -c "import cudf; print(cudf.__version__)"

# 24.10.00a400

docker run \
    --rm \
    --gpus "0,1" \
    --entrypoint="" \
    -it rapidsai/base:delete-me-local \
    which jupyter
# /opt/conda/envs/rapids/bin/jupyter

docker run \
    --rm \
    --gpus "0,1" \
    --entrypoint="" \
    -it rapidsai/base:delete-me-local \
    jupyter labextension list

# /opt/conda/envs/rapids/share/jupyter/labextensions
#         jupyterlab_pygments v0.3.0 enabled OK (python, jupyterlab_pygments)
#         dask-labextension v7.0.0 enabled OK (python, dask_labextension)
#        ...
#
# Disabled extensions:
#     @jupyterlab/apputils-extension:announcements

@jameslamb jameslamb added bug Something isn't working 2 - In Progress Currenty a work in progress labels Sep 30, 2024
@jameslamb jameslamb changed the title WIP: switch from 'mamba' to 'conda' executable WIP: create a new 'rapids' conda environment instead of installing packages in the 'base' environment Oct 1, 2024
@jameslamb jameslamb added the breaking Breaking change label Oct 1, 2024
@jameslamb
Copy link
Member Author

The current state of this PR (c836a9d) is at least sufficient to get all CI passing, which confirms that #712 was the root cause here (not issues with the notebooks or library code).

(build link)

But as we've been discussing on #712, it by itself would be an unacceptably large amount of breakage, and there might be ways to make this less disruptive: #712 (comment). I'll explore those next.

PATH=/opt/conda/envs/rapids/bin:/opt/conda/condabin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin \
PROJ_DATA=/opt/conda/envs/rapids/share/proj \
PROJ_NETWORK=ON \
XML_CATALOG_FILES="file:///opt/conda/envs/rapids/etc/xml/catalog file:///etc/xml/catalog"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this is to make it feel like conda activate rapids was run even though it wasn't, in situations where you can't rely on the entrypoint script being run (described by @jacobtomlinson in #712 (comment)).

It's a hack and could hopefully be reverted completely in RAPIDS 24.12 through some combination of the following:

If reviewers agree with this approach, I'll write up a separate issue to track that work of reverting all of this.


NOTE: I'm intentionally not doing this in the raft-ann-bench images... those are expected to be used with explicit entrypoints, as far as I can tell, and I've modified all those entrypoints with the appropriate conda activation commands.

@jameslamb jameslamb changed the title WIP: create a new 'rapids' conda environment instead of installing packages in the 'base' environment create a new 'rapids' conda environment instead of installing packages in the 'base' environment Oct 1, 2024
@jameslamb jameslamb marked this pull request as ready for review October 1, 2024 21:29
@jameslamb jameslamb requested a review from a team as a code owner October 1, 2024 21:29
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with this if it works for the use cases that @jacobtomlinson defined in #712.

@jameslamb jameslamb added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currenty a work in progress labels Oct 1, 2024
Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for digging into this @jameslamb. This seems like it should be a good workaround.

Once it has been merged and nightlies are pushed I'll check a few of the deployment places and verify things are working as expected.

@@ -88,7 +88,9 @@ jobs:
rapids-logger "nvidia-smi"
nvidia-smi
- name: Test notebooks
run: /home/rapids/test_notebooks.py -i /home/rapids/notebooks -o /home/rapids/notebooks_output
run: |
. /opt/conda/etc/profile.d/conda.sh; conda activate rapids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the environment hacking is this necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right, it shouldn't be! I'll try reverting it.

@jameslamb jameslamb added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Oct 2, 2024
@jameslamb
Copy link
Member Author

Put a DO NOT MERGE label on this... I'm hoping that conda-forge/mamba-feedstock#253 will resolve the root cause and we won't need this PR at all.

Will come back and update in a bit, once this build hopefully publishes new libmambapy packages: https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=1044497&view=results

@jameslamb jameslamb changed the title create a new 'rapids' conda environment instead of installing packages in the 'base' environment [DO NOT MERGE] create a new 'rapids' conda environment instead of installing packages in the 'base' environment Oct 2, 2024
@jameslamb
Copy link
Member Author

Very happy to say.... upstream changes made this unnecessary 😁

details: #712 (comment)

Thanks for the help everyone!

@jameslamb jameslamb closed this Oct 2, 2024
@jameslamb jameslamb deleted the fix/notebook-tests branch October 2, 2024 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 5 - DO NOT MERGE Hold off on merging; see PR for details breaking Breaking change bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Notebook tests failing on latest 24.10 nightlies
3 participants