-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pytorch-notebook image variants with cuda 11 and 12 (x86_64 versions only) #2091
Add pytorch-notebook image variants with cuda 11 and 12 (x86_64 versions only) #2091
Conversation
From what I can see the build for aarch64 seems to fail due to running out of disk space (seems to error when archiving the image, I had similar errors for CUDA builds running out of disk space). Is there an issue with also enabling this part for the self-hosted aarch64 runners? Right now it only runs for x86_64. # Image with CUDA needs extra disk space
- name: Free disk space 🧹
if: contains(inputs.variant, 'cuda') && inputs.platform == 'x86_64'
uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be
with:
tool-cache: false
android: true
dotnet: true
haskell: true
large-packages: false
docker-images: false
swap-storage: false |
Self-hosted runners are not usually bloated with lots of software. I created our aarch64 runners using ubuntu provided by Google Cloud and running this on top: https://github.com/jupyter/docker-stacks/blob/main/aarch64-runner/setup.sh Could you please tell me how much free disk is needed to build one GPU image? |
@johanna-reiml-hpi I think the quality of this PR is awesome 👍 We now have a policy for merging new images and adding new packages 🎉 @mathbunnyru, @consideRatio, @yuvipanda, and @manics, please vote 👍 to accept this change and 👎 not to accept it (use a reaction to this message) We can have a discussion until the deadline, so please express your opinions. |
I am not sure how much space is exactly needed, but the 14 GB provided for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only small comments. If you have some time, let's change these small details as well.
I cleaned up the space on our aarch64 runners. |
It would be nice to have a GPU version of tensorflow as well. I would suggest trying to install conda-forge version of tensorflow package as well when we merge this PR. |
Seems to work - the build is green 🎉 |
One more thought - I don't know a lot about cuda versioning, but it might be worth it to add more precise version of cuda as a tag prefix (so, we need to add a new tagger to cuda flavored images). So, the prefix tag will stay What do you think? |
I had another look at the official pytorch Dockerfile, CUDA Dockerfiles and nvidia documentation for some env variables, so here are a few extra points to consider:
|
Seems okay to add. However, it's also possible that pytorch might add a
|
If you can implement this as purely a tag feature, and not a whole new image - then it would be great. I don't want to have a separate image for each |
@consideRatio @manics please, vote here: #2091 (comment) I am ready to merge this before the vote deadline if I receive one more positive vote. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @johanna-reiml-hpi for working this so thoroughly!!!
A significant value for users
My experience is that installing CUDA next to PyTorch and/or Tensorflow can be a hurdle, so if this project takes on providing image variants of pytorch-notebook and/or tensorflow-notebook, its a big value for end users that doesn't have to do it themselves, but it could be a notable mainteannce burden for this project.
Vote to support CUDA ambitions for pytorch-notebook
I think the complexity addition and value tradeoff could be worth it, and since @mathbunnyru has voted 👍 who in does the most maintenance in this repo, I'm 👍 to seeing this repo committing to providing cuda variants.
Vote to drop CUDA ambitions for tensorflow-notebook
Since there is already issues getting CUDA and Tensorflow installed, I think it should be seen as an indication that its going to be a hurdle long term as well. My experience is that the tensorflow ecosystem is less user friendly for users installing it, and the experience reported by @johanna-reiml-hpi strengthen this belief.
Due to this, I suggest that jupyter/docker-stacks doesn't go for CUDA variants of tensorflow-notebook, thinking that is a maintenance burden that is too large. I understand pytorch is more popular than tensorflow, so its reasonable to invest more effort towards pytorch as well than tensorflow I think.
Misc PR review
CUDA variants policy
- Supported architectures
I think right now, this only provides a x86_64 build of this image. If this PR is merged to only provide that, it would be good to clarify why not also aarch64. It would be good if its reflected also in the PR title if this is a x86_64 only feature. - Variant description
I understand that thevariant
is an image variant. I think it would be good if the description ofvariant
flags clarifies that some images includes image variants, and that it will result in a tag suffix being appended. - PR title
- Update PR to not reference an issue in the title as its often not practically useful, its far easier if its instead referenced in the PR description as "fixes #..." or "partially resolves #..." as then its clickable for example.
- Example PR title:
Add pytorch-notebook image variants with cuda 11 and 12 (x86_64 versions only)
- CUDA policy
I think we should provide a maintenance constraining policy, describing that no more than the latest two major versions of CUDA will be supported, so when the next major version comes out and this projects builds it successfully, the project can stop building new images with CUDA 11. Having a note / policy about this could be good to have already.
If someone manages to enable cuda-enabled tensorflow image for our docker stacks in a sane way (installing pip/mamba packages, no manual build), then it means that the ecosystem is mature enough for us to support it. I mostly agree with @consideRatio's "Misc PR review" section - @johanna-reiml-hpi please, take a look.
👍
I prefer not to change anything here (I really like all the
👍
👍 |
Ah, i thought it was a suffix seeing f"{platform}-{variant}" somewhere, and that it should be seen as a suffix being put last there. But I think i misunderstood this and that its really a high prio prefix in the tag, so it makes sense to reference it as a prefix still. I think the description could be improved on still though as i struggled to piece together things. Now i think something like "Some images has variants, they will be published with the same image name, but get a prefix added to the tag." could have been helpful to read somewhere. |
The code is great, and the voting is also successful, so I am merging this one - I will add a policy and some documentation myself. |
New images are pushed and ready to use 🎉 |
@johanna-reiml-hpi thank you for this PR. |
…ons only) (jupyter#2091) * feat: build cuda variants of pytorch * feat: build with variant tag * style: remove unused import * refactor: rename get_prefix params (cherry picked from commit 12b50af) * revert: drop ROOT_CONTAINER addition from Makefile (cherry picked from commit f423145) * style: use consistent three empty lines in Makefile (cherry picked from commit 446b45a) * refactor: add default value for parent-image (cherry picked from commit 32955ce) * revert: use original workflow structure (cherry picked from commit 68c6744) * refactor: use single build image step (cherry picked from commit 5f1ac0a) * fix: run merge tags regardless of repository owner (cherry picked from commit 3fce366) * refactor: build cuda12 instead of cuda tag (cherry picked from commit 217144e) * docs: add note about CUDA tags to documentation * refactor: add default value for variant in build-test-upload * refactor: swap ordering of cuda11/cuda12 variants * refactor: remove optional str type in arg parser * fix: add proper env variables to CUDA Dockerfiles * fix: remove CUDA build for aarch64 * fix: use latest NVIDIA documentation link * fix: skip aarch64 tags file for CUDA variants --------- Co-authored-by: zynaa <[email protected]>
…ons only) (jupyter#2091) * feat: build cuda variants of pytorch * feat: build with variant tag * style: remove unused import * refactor: rename get_prefix params (cherry picked from commit 12b50af) * revert: drop ROOT_CONTAINER addition from Makefile (cherry picked from commit f423145) * style: use consistent three empty lines in Makefile (cherry picked from commit 446b45a) * refactor: add default value for parent-image (cherry picked from commit 32955ce) * revert: use original workflow structure (cherry picked from commit 68c6744) * refactor: use single build image step (cherry picked from commit 5f1ac0a) * fix: run merge tags regardless of repository owner (cherry picked from commit 3fce366) * refactor: build cuda12 instead of cuda tag (cherry picked from commit 217144e) * docs: add note about CUDA tags to documentation * refactor: add default value for variant in build-test-upload * refactor: swap ordering of cuda11/cuda12 variants * refactor: remove optional str type in arg parser * fix: add proper env variables to CUDA Dockerfiles * fix: remove CUDA build for aarch64 * fix: use latest NVIDIA documentation link * fix: skip aarch64 tags file for CUDA variants --------- Co-authored-by: zynaa <[email protected]>
Describe your changes
This reworks the build pipeline to allow building CUDA variants for the
pytorch-notebook
image. Examples images: https://quay.io/repository/ai-hpi/pytorch-notebookpytorch-notebook
instead (adds a "cuda-" and "cuda11-" prefix). This logic is implemented by adding a new "variant" argument for the GitHub Actions and the tagging scripts. I also refactoreddocker.yaml
to get rid of the redundancy for the aarch64 vs x86_64 build.In the issue there were discussions about using the official CUDA images instead of
ubuntu:22.04
as the root container. This is not desirable in my opinion, as these images are not updated frequently and are based on olderubuntu:22.04
images. Instead a simpler approach works: packages like pytorch and tensorflow bundle their own CUDA binaries via pip these days.I looked into CUDA builds for TensorFlow and PyTorch (I am not sure if CUDA support is really a priority for other notebooks). In order to support NVIDIA driver versions older than 525, builds for CUDA 11.8 should also be included.
--extra-index-url=https://pypi.nvidia.com
, but it's likely not needed.pip install --no-cache-dir --extra-index-url=https://pypi.nvidia.com tensorrt tensorflow[and-cuda]
didn't seem to work for me and tensorflow was unable to find tensorrt. I assume the workaround in the linked issue would work, but it seems better to either manually build the package, try nightly builds, or wait for 2.16.0.Issue ticket if applicable
Fix (for pytorch): #1557
Checklist (especially for first-time contributors)