Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Use distroless to build the python container #29001

Closed
wants to merge 9 commits into from

Conversation

liferoad
Copy link
Collaborator

@liferoad liferoad commented Oct 16, 2023

Fix #28991.

distroless does not provide the images with different python versions. And the Beam python containers also contains extras system packages.

Ideas:

  • Use FROM python:"${py_version}"-bookworm as python-base as the python base image to install everything Beam needs including both system packages and python packages under /venv
  • copy all the packages over to the distroless image

Using the python bullseye as the base,

image

Using the distroless as the base,

image

Fix #28991


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@codecov
Copy link

codecov bot commented Oct 16, 2023

Codecov Report

Merging #29001 (a8dbd7b) into master (87ca614) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #29001   +/-   ##
=======================================
  Coverage   38.29%   38.30%           
=======================================
  Files         690      690           
  Lines      102048   102048           
=======================================
+ Hits        39082    39085    +3     
+ Misses      61382    61380    -2     
+ Partials     1584     1583    -1     
Flag Coverage Δ
go 53.45% <ø> (+<0.01%) ⬆️
python 29.88% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

see 1 file with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

Copy link
Contributor

@bvolpato bvolpato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! It makes sense that we have to use multi-stage, and this looks a bit more complex.

Can we write some header comments just sharing in general what is the Dockerfile doing? This will help others get some context instead of figuring out why we have some duplicated steps.

@robertwb
Copy link
Contributor

robertwb commented Oct 18, 2023 via email

sdks/python/container/Dockerfile Outdated Show resolved Hide resolved
sdks/python/container/Dockerfile Outdated Show resolved Hide resolved
@tvalentyn
Copy link
Contributor

I also wonder if the maintainers of python images are aware of the potential vulnerability concerns and perhaps those issues could be resolved upstream of Beam without us not getting into building Python images from scratch.

@liferoad
Copy link
Collaborator Author

What is the underlying motivation here?

On Wed, Oct 18, 2023, 3:24 PM tvalentyn @.> wrote: @.* commented on this pull request. ------------------------------ In sdks/python/container/Dockerfile <#29001 (comment)>: > - pip install --upgrade pip setuptools wheel && \ +RUN python -m venv venv-beam Beam SDK workers create a venv as well, which is superimposed upon the default environment. I think having Beam dependencies packages in a separate venv would break this setup. — Reply to this email directly, view it on GitHub <#29001 (review)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADWVAOZSTTFQKPC3E7CVYLYABJKDAVCNFSM6AAAAAA6BMBU6WVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTMOBWGMYDANRTGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

We just create the temp venv in the python-base image to install all the packages and later copy them to the distroless image.

@liferoad liferoad marked this pull request as ready for review October 28, 2023 04:45
@liferoad
Copy link
Collaborator Author

Tested the wordcount example using Dataflow. It works.

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @AnandInguva for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@liferoad
Copy link
Collaborator Author

Run Python_Transforms PreCommit

sdks/python/container/Dockerfile Outdated Show resolved Hide resolved
sdks/python/container/Dockerfile Show resolved Hide resolved
ARG TARGETARCH

# copy commands & libs to distroless
COPY --from=python-base /bin /bin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my information:

  1. how did we choose what content to copy from base?
  2. do distroless images have a package manager (apt) ? Will users be able to install additional software into these images if they want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/bin contains some system commands like ls. We do not need them in theory.

no apt on distroless. If you are talking about the customer container, it will be better for users to use other images as base or use the way in this PR to copy the packages over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to announce this in breaking changes in CHANGES.MD and in make adjustments to custom container documentation in Beam / Dataflow docs. I do see that this could complicate the UX for some custom container users.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will update CHANGES.md after we all agree with using distroless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker here, just a general comment: I'd rather pick specific binaries rather than copying every /bin binary. I understand it probably means more breakages, but it's also a smaller vulnerability surface.

Copy link
Contributor

@tvalentyn tvalentyn Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lack of apt is the main inconvenience for me. I think this is a real friction point, while the vulnerabilities reported by docker image checkers more often than not, are not applicable to execution of Beam pipelines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need /bin. I added it here only for the convenience. @tvalentyn I doubt users care about apt since if they need to touch this file, they usually build their own images. The purpose for us to adopt distorless is only to reduce the vulnerabilities when users use our containers as the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before custom containers we also mentioned running custom commands in:

https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython

Not sure how much usage this has though.

sdks/python/container/Dockerfile Outdated Show resolved Hide resolved
@tvalentyn
Copy link
Contributor

Run Python ValidatesContainer Dataflow ARM 3.8

@tvalentyn
Copy link
Contributor

Run Python ValidatesContainer Dataflow ARM 3.11

@tvalentyn
Copy link
Contributor

Run Python Dataflow ValidatesContainer

@tvalentyn
Copy link
Contributor

Run Python ValidatesContainer Dataflow ARM 3.11

@liferoad
Copy link
Collaborator Author

liferoad commented Nov 4, 2023

Run Python_Integration PreCommit 3.8

@liferoad
Copy link
Collaborator Author

liferoad commented Nov 7, 2023

Run Python Dataflow ValidatesContainer

@liferoad
Copy link
Collaborator Author

liferoad commented Nov 7, 2023

Run Python ValidatesContainer Dataflow ARM 3.11

@tvalentyn
Copy link
Contributor

I can find test results - could you paste a link for container test suites please after we iron out any remaining issues with test infra?

@tvalentyn
Copy link
Contributor

I am not sure why ARM suite is not showing up, it's on github actions, I thought only Jenkins had issues

@tvalentyn
Copy link
Contributor

Run Python ValidatesContainer Dataflow ARM 3.8

@tvalentyn
Copy link
Contributor

Run Python ValidatesContainer Dataflow ARM

@tvalentyn
Copy link
Contributor

Run Python 3.8 Postcommit

@tvalentyn
Copy link
Contributor

Regular postcommits wouldn't pick up these changes. ARM postcommits should

@tvalentyn
Copy link
Contributor

Run Python PostCommit Arm

@liferoad liferoad changed the title Use distroless to build the python container [DO NOT MERGE] Use distroless to build the python container Nov 9, 2023
Copy link
Contributor

Reminder, please take a look at this pr: @AnandInguva

Copy link
Contributor

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @riteshghorse for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

@liferoad
Copy link
Collaborator Author

Close this for now since we plan to support more container images with different purposes.

@liferoad liferoad closed this Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task]: Use "Distroless" Container Images to reduce potential vulnerabilities for Beam Python SDK containers
4 participants