Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pytorch-notebook image variants with cuda 11 and 12 (x86_64 versions only) #2091

Merged
merged 19 commits into from
Feb 24, 2024
Merged

Add pytorch-notebook image variants with cuda 11 and 12 (x86_64 versions only) #2091

merged 19 commits into from
Feb 24, 2024

Conversation

johanna-reiml-hpi
Copy link
Contributor

@johanna-reiml-hpi johanna-reiml-hpi commented Feb 6, 2024

Describe your changes

This reworks the build pipeline to allow building CUDA variants for the pytorch-notebook image. Examples images: https://quay.io/repository/ai-hpi/pytorch-notebook

  • Commit "feat: build cuda variants of pytorch" (43f2d32): minimal example for how to add cuda variants for the pytorch-notebook image.
  • Commit "feat: build with variant tag" (6198575): rework the pipeline to build these images under the tag of the pytorch-notebook instead (adds a "cuda-" and "cuda11-" prefix). This logic is implemented by adding a new "variant" argument for the GitHub Actions and the tagging scripts. I also refactored docker.yaml to get rid of the redundancy for the aarch64 vs x86_64 build.

In the issue there were discussions about using the official CUDA images instead of ubuntu:22.04 as the root container. This is not desirable in my opinion, as these images are not updated frequently and are based on older ubuntu:22.04 images. Instead a simpler approach works: packages like pytorch and tensorflow bundle their own CUDA binaries via pip these days.

I looked into CUDA builds for TensorFlow and PyTorch (I am not sure if CUDA support is really a priority for other notebooks). In order to support NVIDIA driver versions older than 525, builds for CUDA 11.8 should also be included.

  • For PyTorch (https://pytorch.org/get-started/locally/) everything works as expected. I also added NVIDIA's pip index for their most recent packages as an install argument --extra-index-url=https://pypi.nvidia.com, but it's likely not needed.
  • For TensorFlow (https://www.tensorflow.org/install/source?hl=en#gpu) there were multiple issues:
    • CUDA 11: The most recent version (2.15.0) updated the CUDA dependencies from 11.8 to 12.2. So the last version which supports CUDA 11.8 is 2.14.0. However, that version is not functional under python 3.11, because a dependency (tensorrt==8.5.3.1) is not available for python 3.11: tensorrt==8.5.3.1 from [and-cuda] not available in Python 3.11 tensorflow/tensorflow#61986
    • CUDA 12 (latest): Using pip install --no-cache-dir --extra-index-url=https://pypi.nvidia.com tensorrt tensorflow[and-cuda] didn't seem to work for me and tensorflow was unable to find tensorrt. I assume the workaround in the linked issue would work, but it seems better to either manually build the package, try nightly builds, or wait for 2.16.0.

Issue ticket if applicable

Fix (for pytorch): #1557

Checklist (especially for first-time contributors)

@johanna-reiml-hpi
Copy link
Contributor Author

johanna-reiml-hpi commented Feb 6, 2024

From what I can see the build for aarch64 seems to fail due to running out of disk space (seems to error when archiving the image, I had similar errors for CUDA builds running out of disk space). Is there an issue with also enabling this part for the self-hosted aarch64 runners? Right now it only runs for x86_64.

      # Image with CUDA needs extra disk space
      - name: Free disk space 🧹
        if: contains(inputs.variant, 'cuda') && inputs.platform == 'x86_64'
        uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be
        with:
          tool-cache: false
          android: true
          dotnet: true
          haskell: true
          large-packages: false
          docker-images: false
          swap-storage: false

tagging/get_prefix.py Outdated Show resolved Hide resolved
docs/using/selecting.md Outdated Show resolved Hide resolved
Makefile Show resolved Hide resolved
.github/workflows/docker-build-test-upload-all.yml Outdated Show resolved Hide resolved
.github/workflows/docker-build-test-upload-all.yml Outdated Show resolved Hide resolved
.github/workflows/docker-build-test-upload-all.yml Outdated Show resolved Hide resolved
.github/workflows/docker-build-test-upload-all.yml Outdated Show resolved Hide resolved
.github/workflows/docker-build-test-upload.yml Outdated Show resolved Hide resolved
.github/workflows/docker.yml Outdated Show resolved Hide resolved
.github/workflows/docker.yml Outdated Show resolved Hide resolved
@mathbunnyru
Copy link
Member

there an issue with also enabling this part for the self-hosted aarch64 runners?

Self-hosted runners are not usually bloated with lots of software. I created our aarch64 runners using ubuntu provided by Google Cloud and running this on top: https://github.com/jupyter/docker-stacks/blob/main/aarch64-runner/setup.sh

Could you please tell me how much free disk is needed to build one GPU image?
I am not sure people need aarch64 GPU images, so let's not build them in this PR, though.
Also, this will increase the build time (we have limited amount of self-hosted runners)

Makefile Outdated Show resolved Hide resolved
@mathbunnyru
Copy link
Member

@johanna-reiml-hpi I think the quality of this PR is awesome 👍
Great job.
I left a few comments of some things that may be improved, please, take a look.

We now have a policy for merging new images and adding new packages 🎉

@mathbunnyru, @consideRatio, @yuvipanda, and @manics, please vote 👍 to accept this change and 👎 not to accept it (use a reaction to this message)
The voting deadline is the 8th of March (a month since I posted this message).
The change is accepted, if there are at least 2 positive votes.

We can have a discussion until the deadline, so please express your opinions.

@johanna-reiml-hpi
Copy link
Contributor Author

there an issue with also enabling this part for the self-hosted aarch64 runners?

Self-hosted runners are not usually bloated with lots of software. I created our aarch64 runners using ubuntu provided by Google Cloud and running this on top: https://github.com/jupyter/docker-stacks/blob/main/aarch64-runner/setup.sh

Could you please tell me how much free disk is needed to build one GPU image? I am not sure people need aarch64 GPU images, so let's not build them in this PR, though. Also, this will increase the build time (we have limited amount of self-hosted runners)

I am not sure how much space is exactly needed, but the 14 GB provided for ubuntu-latest weren't enough and resulted in failed builds. I assume probably like 20 GB in total would be sufficient.

Copy link
Member

@mathbunnyru mathbunnyru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small comments. If you have some time, let's change these small details as well.

.github/workflows/docker-build-test-upload.yml Outdated Show resolved Hide resolved
.github/workflows/docker.yml Outdated Show resolved Hide resolved
.github/workflows/docker.yml Outdated Show resolved Hide resolved
tagging/apply_tags.py Outdated Show resolved Hide resolved
@mathbunnyru
Copy link
Member

there an issue with also enabling this part for the self-hosted aarch64 runners?

Self-hosted runners are not usually bloated with lots of software. I created our aarch64 runners using ubuntu provided by Google Cloud and running this on top: https://github.com/jupyter/docker-stacks/blob/main/aarch64-runner/setup.sh
Could you please tell me how much free disk is needed to build one GPU image? I am not sure people need aarch64 GPU images, so let's not build them in this PR, though. Also, this will increase the build time (we have limited amount of self-hosted runners)

I am not sure how much space is exactly needed, but the 14 GB provided for ubuntu-latest weren't enough and resulted in failed builds. I assume probably like 20 GB in total would be sufficient.

I cleaned up the space on our aarch64 runners.
Unfortunately, Docker has an issue, where cleaning up doesn't remove everything it should, so had to clean up manually.

@mathbunnyru
Copy link
Member

mathbunnyru commented Feb 12, 2024

It would be nice to have a GPU version of tensorflow as well.
I saw the comment in the description, so thank you for trying.

I would suggest trying to install conda-forge version of tensorflow package as well when we merge this PR.
If it works, create another PR after this will be merged (should be small and easy), if not, nothing to do.
In both cases, please, write if it worked or not (in this PR).

@mathbunnyru
Copy link
Member

there an issue with also enabling this part for the self-hosted aarch64 runners?

Self-hosted runners are not usually bloated with lots of software. I created our aarch64 runners using ubuntu provided by Google Cloud and running this on top: https://github.com/jupyter/docker-stacks/blob/main/aarch64-runner/setup.sh
Could you please tell me how much free disk is needed to build one GPU image? I am not sure people need aarch64 GPU images, so let's not build them in this PR, though. Also, this will increase the build time (we have limited amount of self-hosted runners)

I am not sure how much space is exactly needed, but the 14 GB provided for ubuntu-latest weren't enough and resulted in failed builds. I assume probably like 20 GB in total would be sufficient.

I cleaned up the space on our aarch64 runners. Unfortunately, Docker has an issue, where cleaning up doesn't remove everything it should, so had to clean up manually.

Seems to work - the build is green 🎉

@mathbunnyru
Copy link
Member

mathbunnyru commented Feb 12, 2024

One more thought - I don't know a lot about cuda versioning, but it might be worth it to add more precise version of cuda as a tag prefix (so, we need to add a new tagger to cuda flavored images).
People might want to (or be forced to) stay on 12.1 even when 12.2 pytorch-cuda is released.

So, the prefix tag will stay cuda12- (nothing to change here), but the final image can look something like jupyter/pytorch-notebook:cuda12-cuda12.1, so people will be to more precisely say they cuda version.

What do you think?
I don't consider this to be a blocker for this PR, but it would definitely be nice to add this tag later.

@johanna-reiml-hpi
Copy link
Contributor Author

I had another look at the official pytorch Dockerfile, CUDA Dockerfiles and nvidia documentation for some env variables, so here are a few extra points to consider:

  • The following ENV should be added to the Dockerfile according to the linked NVIDIA documentation:
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
  • I don't think the CUDA installation for aarch64 worked actually. The CUDA variant build times are identical to the CPU ones. I checked the official install guide and when selecting MacOS you get: # CUDA is not available on MacOS, please use default package. The official pytorch Dockerfile also only installs the CPU version for aarch64. I will exclude aarch64 from this PR.
  • The official pytorch Dockerfile installs the CUDA dependencies via conda. This adds CUDA binaries to conventional paths instead of having CUDA as a pip package (ENV LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64, ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH). I am not sure which approach is better. The current version might be nicer in terms of simplicity (and follows the official install guide).

@johanna-reiml-hpi
Copy link
Contributor Author

One more thought - I don't know a lot about cuda versioning, but it might be worth it to add more precise version of cuda as a tag prefix (so, we need to add a new tagger to cuda flavored images). People might want to (or be forced to) stay on 12.1 even when 12.2 pytorch-cuda is released.

So, the prefix tag will stay cuda12- (nothing to change here), but the final image can look something like jupyter/pytorch-notebook:cuda12-cuda12.1, so people will be to more precisely say they cuda version.

What do you think? I don't consider this to be a blocker for this PR, but it would definitely be nice to add this tag later.

Seems okay to add. However, it's also possible that pytorch might add a --index-url 'https://download.pytorch.org/whl/cu122' alongside the current --index-url 'https://download.pytorch.org/whl/cu121'. In that case image variants would have to be built for both versions. I think it might be less confusing to do something like this instead:

  • jupyter/pytorch-notebook:cuda12 (latest, so cuda12.2)
  • jupyter/pytorch-notebook:cuda12.2
  • jupyter/pytorch-notebook:cuda12.1

@mathbunnyru
Copy link
Member

mathbunnyru commented Feb 12, 2024

One more thought - I don't know a lot about cuda versioning, but it might be worth it to add more precise version of cuda as a tag prefix (so, we need to add a new tagger to cuda flavored images). People might want to (or be forced to) stay on 12.1 even when 12.2 pytorch-cuda is released.
So, the prefix tag will stay cuda12- (nothing to change here), but the final image can look something like jupyter/pytorch-notebook:cuda12-cuda12.1, so people will be to more precisely say they cuda version.
What do you think? I don't consider this to be a blocker for this PR, but it would definitely be nice to add this tag later.

Seems okay to add. However, it's also possible that pytorch might add a --index-url 'https://download.pytorch.org/whl/cu122' alongside the current --index-url 'https://download.pytorch.org/whl/cu121'. In that case image variants would have to be built for both versions. I think it might be less confusing to do something like this instead:

  • jupyter/pytorch-notebook:cuda12 (latest, so cuda12.2)
  • jupyter/pytorch-notebook:cuda12.2
  • jupyter/pytorch-notebook:cuda12.1

If you can implement this as purely a tag feature, and not a whole new image - then it would be great.
It might be easier to implement it as I mentioned though (adding a new tagger which adds a more explicit cuda version).

I don't want to have a separate image for each X.Y version of cuda.
We usually only support latest versions of packages.
In case of cuda it makes sense to support both cuda11 and cuda12, but having to always support cuda12.Y is too much.

@mathbunnyru
Copy link
Member

@consideRatio @manics please, vote here: #2091 (comment)

I am ready to merge this before the vote deadline if I receive one more positive vote.

Copy link
Collaborator

@consideRatio consideRatio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @johanna-reiml-hpi for working this so thoroughly!!!

A significant value for users

My experience is that installing CUDA next to PyTorch and/or Tensorflow can be a hurdle, so if this project takes on providing image variants of pytorch-notebook and/or tensorflow-notebook, its a big value for end users that doesn't have to do it themselves, but it could be a notable mainteannce burden for this project.

Vote to support CUDA ambitions for pytorch-notebook

I think the complexity addition and value tradeoff could be worth it, and since @mathbunnyru has voted 👍 who in does the most maintenance in this repo, I'm 👍 to seeing this repo committing to providing cuda variants.

Vote to drop CUDA ambitions for tensorflow-notebook

Since there is already issues getting CUDA and Tensorflow installed, I think it should be seen as an indication that its going to be a hurdle long term as well. My experience is that the tensorflow ecosystem is less user friendly for users installing it, and the experience reported by @johanna-reiml-hpi strengthen this belief.

Due to this, I suggest that jupyter/docker-stacks doesn't go for CUDA variants of tensorflow-notebook, thinking that is a maintenance burden that is too large. I understand pytorch is more popular than tensorflow, so its reasonable to invest more effort towards pytorch as well than tensorflow I think.

Misc PR review

CUDA variants policy

  • Supported architectures
    I think right now, this only provides a x86_64 build of this image. If this PR is merged to only provide that, it would be good to clarify why not also aarch64. It would be good if its reflected also in the PR title if this is a x86_64 only feature.
  • Variant description
    I understand that the variant is an image variant. I think it would be good if the description of variant flags clarifies that some images includes image variants, and that it will result in a tag suffix being appended.
  • PR title
    • Update PR to not reference an issue in the title as its often not practically useful, its far easier if its instead referenced in the PR description as "fixes #..." or "partially resolves #..." as then its clickable for example.
    • Example PR title: Add pytorch-notebook image variants with cuda 11 and 12 (x86_64 versions only)
  • CUDA policy
    I think we should provide a maintenance constraining policy, describing that no more than the latest two major versions of CUDA will be supported, so when the next major version comes out and this projects builds it successfully, the project can stop building new images with CUDA 11. Having a note / policy about this could be good to have already.

@mathbunnyru
Copy link
Member

Vote to drop CUDA ambitions for tensorflow-notebook

If someone manages to enable cuda-enabled tensorflow image for our docker stacks in a sane way (installing pip/mamba packages, no manual build), then it means that the ecosystem is mature enough for us to support it.
I won't be against merging such a change.
In any case, after merging this PR, it should be easy to experiment.

I mostly agree with @consideRatio's "Misc PR review" section - @johanna-reiml-hpi please, take a look.

Supported architectures

👍

Variant description

I prefer not to change anything here (I really like all the cuda11 image tags have a long common prefix like jupyter/pytorch-notebook:cuda11-)

PR Title

👍

Cuda policy

👍

@consideRatio
Copy link
Collaborator

Ah, i thought it was a suffix seeing f"{platform}-{variant}" somewhere, and that it should be seen as a suffix being put last there. But I think i misunderstood this and that its really a high prio prefix in the tag, so it makes sense to reference it as a prefix still.

I think the description could be improved on still though as i struggled to piece together things. Now i think something like "Some images has variants, they will be published with the same image name, but get a prefix added to the tag." could have been helpful to read somewhere.

@mathbunnyru mathbunnyru changed the title #1557 Add cuda tag prefix for pytorch-notebook Add pytorch-notebook image variants with cuda 11 and 12 (x86_64 versions only) Feb 24, 2024
@mathbunnyru mathbunnyru merged commit eccda24 into jupyter:main Feb 24, 2024
73 checks passed
@mathbunnyru
Copy link
Member

mathbunnyru commented Feb 24, 2024

The code is great, and the voting is also successful, so I am merging this one - I will add a policy and some documentation myself.

@mathbunnyru
Copy link
Member

@mathbunnyru
Copy link
Member

New images are pushed and ready to use 🎉
https://quay.io/repository/jupyter/pytorch-notebook?tab=tags

@mathbunnyru
Copy link
Member

@johanna-reiml-hpi thank you for this PR.
It's nice to have GPU enabled image at least for pytorch and I think variant feature might be used in other circumstances.

max-muoto pushed a commit to max-muoto/docker-stacks that referenced this pull request Mar 10, 2024
…ons only) (jupyter#2091)

* feat: build cuda variants of pytorch

* feat: build with variant tag

* style: remove unused import

* refactor: rename get_prefix params

(cherry picked from commit 12b50af)

* revert: drop ROOT_CONTAINER addition from Makefile

(cherry picked from commit f423145)

* style: use consistent three empty lines in Makefile

(cherry picked from commit 446b45a)

* refactor: add default value for parent-image

(cherry picked from commit 32955ce)

* revert: use original workflow structure

(cherry picked from commit 68c6744)

* refactor: use single build image step

(cherry picked from commit 5f1ac0a)

* fix: run merge tags regardless of repository owner

(cherry picked from commit 3fce366)

* refactor: build cuda12 instead of cuda tag

(cherry picked from commit 217144e)

* docs: add note about CUDA tags to documentation

* refactor: add default value for variant in build-test-upload

* refactor: swap ordering of cuda11/cuda12 variants

* refactor: remove optional str type in arg parser

* fix: add proper env variables to CUDA Dockerfiles

* fix: remove CUDA build for aarch64

* fix: use latest NVIDIA documentation link

* fix: skip aarch64 tags file for CUDA variants

---------

Co-authored-by: zynaa <[email protected]>
max-muoto pushed a commit to max-muoto/docker-stacks that referenced this pull request Mar 10, 2024
…ons only) (jupyter#2091)

* feat: build cuda variants of pytorch

* feat: build with variant tag

* style: remove unused import

* refactor: rename get_prefix params

(cherry picked from commit 12b50af)

* revert: drop ROOT_CONTAINER addition from Makefile

(cherry picked from commit f423145)

* style: use consistent three empty lines in Makefile

(cherry picked from commit 446b45a)

* refactor: add default value for parent-image

(cherry picked from commit 32955ce)

* revert: use original workflow structure

(cherry picked from commit 68c6744)

* refactor: use single build image step

(cherry picked from commit 5f1ac0a)

* fix: run merge tags regardless of repository owner

(cherry picked from commit 3fce366)

* refactor: build cuda12 instead of cuda tag

(cherry picked from commit 217144e)

* docs: add note about CUDA tags to documentation

* refactor: add default value for variant in build-test-upload

* refactor: swap ordering of cuda11/cuda12 variants

* refactor: remove optional str type in arg parser

* fix: add proper env variables to CUDA Dockerfiles

* fix: remove CUDA build for aarch64

* fix: use latest NVIDIA documentation link

* fix: skip aarch64 tags file for CUDA variants

---------

Co-authored-by: zynaa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants