-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Use distroless to build the python container #29001
Conversation
Codecov Report
@@ Coverage Diff @@
## master #29001 +/- ##
=======================================
Coverage 38.29% 38.30%
=======================================
Files 690 690
Lines 102048 102048
=======================================
+ Hits 39082 39085 +3
+ Misses 61382 61380 -2
+ Partials 1584 1583 -1
Flags with carried forward coverage won't be shown. Click here to find out more. see 1 file with indirect coverage changes 📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! It makes sense that we have to use multi-stage, and this looks a bit more complex.
Can we write some header comments just sharing in general what is the Dockerfile doing? This will help others get some context instead of figuring out why we have some duplicated steps.
What is the underlying motivation here?
…On Wed, Oct 18, 2023, 3:24 PM tvalentyn ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sdks/python/container/Dockerfile
<#29001 (comment)>:
>
- pip install --upgrade pip setuptools wheel && \
+RUN python -m venv venv-beam
Beam SDK workers create a venv as well, which is superimposed upon the
default environment. I think having Beam dependencies packages in a
separate venv would break this setup.
—
Reply to this email directly, view it on GitHub
<#29001 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADWVAOZSTTFQKPC3E7CVYLYABJKDAVCNFSM6AAAAAA6BMBU6WVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTMOBWGMYDANRTGE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I also wonder if the maintainers of python images are aware of the potential vulnerability concerns and perhaps those issues could be resolved upstream of Beam without us not getting into building Python images from scratch. |
We just create the temp venv in the python-base image to install all the packages and later copy them to the distroless image. |
Tested the wordcount example using Dataflow. It works. |
Assigning reviewers. If you would like to opt out of this review, comment R: @AnandInguva for label python. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Run Python_Transforms PreCommit |
ARG TARGETARCH | ||
|
||
# copy commands & libs to distroless | ||
COPY --from=python-base /bin /bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my information:
- how did we choose what content to copy from base?
- do distroless images have a package manager (
apt
) ? Will users be able to install additional software into these images if they want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/bin contains some system commands like ls
. We do not need them in theory.
no apt on distroless. If you are talking about the customer container, it will be better for users to use other images as base or use the way in this PR to copy the packages over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to announce this in breaking changes in CHANGES.MD and in make adjustments to custom container documentation in Beam / Dataflow docs. I do see that this could complicate the UX for some custom container users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will update CHANGES.md after we all agree with using distroless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker here, just a general comment: I'd rather pick specific binaries rather than copying every /bin
binary. I understand it probably means more breakages, but it's also a smaller vulnerability surface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think lack of apt
is the main inconvenience for me. I think this is a real friction point, while the vulnerabilities reported by docker image checkers more often than not, are not applicable to execution of Beam pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think we need /bin
. I added it here only for the convenience. @tvalentyn I doubt users care about apt
since if they need to touch this file, they usually build their own images. The purpose for us to adopt distorless is only to reduce the vulnerabilities when users use our containers as the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before custom containers we also mentioned running custom commands in:
https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython
Not sure how much usage this has though.
Run Python ValidatesContainer Dataflow ARM 3.8 |
Run Python ValidatesContainer Dataflow ARM 3.11 |
Run Python Dataflow ValidatesContainer |
Run Python ValidatesContainer Dataflow ARM 3.11 |
Run Python_Integration PreCommit 3.8 |
Run Python Dataflow ValidatesContainer |
Run Python ValidatesContainer Dataflow ARM 3.11 |
I can find test results - could you paste a link for container test suites please after we iron out any remaining issues with test infra? |
I am not sure why ARM suite is not showing up, it's on github actions, I thought only Jenkins had issues |
Run Python ValidatesContainer Dataflow ARM 3.8 |
Run Python ValidatesContainer Dataflow ARM |
Run Python 3.8 Postcommit |
Regular postcommits wouldn't pick up these changes. ARM postcommits should |
Run Python PostCommit Arm |
Reminder, please take a look at this pr: @AnandInguva |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @riteshghorse for label python. Available commands:
|
Close this for now since we plan to support more container images with different purposes. |
Fix #28991.
distroless does not provide the images with different python versions. And the Beam python containers also contains extras system packages.
Ideas:
FROM python:"${py_version}"-bookworm as python-base
as the python base image to install everything Beam needs including both system packages and python packages under/venv
Using the python bullseye as the base,
Using the distroless as the base,
Fix #28991
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.