Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Use distroless to build the python container #29001

Closed
wants to merge 9 commits into from
37 changes: 28 additions & 9 deletions sdks/python/container/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -17,18 +17,15 @@
###############################################################################

ARG py_version
FROM python:"${py_version}"-bullseye as beam
LABEL Author "Apache Beam <[email protected]>"
ARG TARGETOS
ARG TARGETARCH

COPY target/base_image_requirements.txt /tmp/base_image_requirements.txt
COPY target/apache-beam.tar.gz /opt/apache/beam/tars/
COPY target/launcher/${TARGETOS}_${TARGETARCH}/boot target/LICENSE target/NOTICE target/LICENSE.python /opt/apache/beam/
FROM python:"${py_version}"-bookworm as python-base
liferoad marked this conversation as resolved.
Show resolved Hide resolved

ENV CLOUDSDK_CORE_DISABLE_PROMPTS yes
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin

COPY target/base_image_requirements.txt /tmp/base_image_requirements.txt
COPY target/apache-beam.tar.gz /opt/apache/beam/tars/

# Use one RUN command to reduce the number of layers.
RUN \
# Install native bindings required for dependencies.
Expand All @@ -44,8 +41,7 @@ RUN \
libgeos-dev \
&& \
rm -rf /var/lib/apt/lists/* && \

pip install --upgrade pip setuptools wheel && \
pip install --upgrade setuptools wheel && \

# Install required packages for Beam Python SDK and common dependencies used by users.
# use --no-deps to ensure the list includes all transitive dependencies.
Expand Down Expand Up @@ -82,6 +78,29 @@ RUN \
# Remove pip cache.
rm -rf /root/.cache/pip

FROM gcr.io/distroless/cc-debian12 as beam
LABEL Author "Apache Beam <[email protected]>"
ARG TARGETOS
ARG TARGETARCH

# copy commands & libs to distroless
COPY --from=python-base /bin /bin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my information:

  1. how did we choose what content to copy from base?
  2. do distroless images have a package manager (apt) ? Will users be able to install additional software into these images if they want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/bin contains some system commands like ls. We do not need them in theory.

no apt on distroless. If you are talking about the customer container, it will be better for users to use other images as base or use the way in this PR to copy the packages over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have to announce this in breaking changes in CHANGES.MD and in make adjustments to custom container documentation in Beam / Dataflow docs. I do see that this could complicate the UX for some custom container users.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will update CHANGES.md after we all agree with using distroless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker here, just a general comment: I'd rather pick specific binaries rather than copying every /bin binary. I understand it probably means more breakages, but it's also a smaller vulnerability surface.

Copy link
Contributor

@tvalentyn tvalentyn Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lack of apt is the main inconvenience for me. I think this is a real friction point, while the vulnerabilities reported by docker image checkers more often than not, are not applicable to execution of Beam pipelines.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need /bin. I added it here only for the convenience. @tvalentyn I doubt users care about apt since if they need to touch this file, they usually build their own images. The purpose for us to adopt distorless is only to reduce the vulnerabilities when users use our containers as the default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before custom containers we also mentioned running custom commands in:

https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython

Not sure how much usage this has though.

COPY --from=python-base /lib lib
COPY --from=python-base /etc/ld.so.cache /etc/ld.so.cache
COPY --from=python-base /usr/bin/which /usr/bin/which
liferoad marked this conversation as resolved.
Show resolved Hide resolved

# copy packages to distroless
COPY --from=python-base /usr/local/lib /usr/local/lib
COPY --from=python-base /usr/local/gcloud /usr/local/gcloud
COPY --from=python-base /usr/local/bin /usr/local/bin

ENV PATH="/usr/local/bin:/usr/local/gcloud/google-cloud-sdk/bin:$PATH"

COPY target/launcher/${TARGETOS}_${TARGETARCH}/boot target/LICENSE target/NOTICE target/LICENSE.python /opt/apache/beam/

ENV CLOUDSDK_CORE_DISABLE_PROMPTS yes
ENV PATH $PATH:/usr/local/gcloud/google-cloud-sdk/bin

ENTRYPOINT ["/opt/apache/beam/boot"]

####
Expand Down
Loading