Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Comet docker file #675

Merged
merged 3 commits into from
Jul 17, 2024
Merged

Create Comet docker file #675

merged 3 commits into from
Jul 17, 2024

Conversation

comphead
Copy link
Contributor

Which issue does this PR close?

Closes #643 .

Rationale for this change

Create a Comet dockerfile to create a Spark based image with integrated Comet jar

What changes are included in this PR?

How are these changes tested?

kube/Dockerfile Outdated
Comment on lines 1 to 26
FROM apache/spark:3.4.2

USER root

# Installing JDK11 as the image comes with JRE
RUN apt update \
&& apt install -y git \
&& apt install -y curl \
&& apt install -y openjdk-11-jdk \
&& apt clean

RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
ENV RUSTFLAGS="-C debuginfo=line-tables-only -C incremental=false"
ENV SPARK_VERSION=3.4
ENV SCALA_VERSION=2.12

# Pick the JDK instead of JRE to compile Comet
RUN cd /opt \
&& git clone https://github.com/apache/datafusion-comet.git \
&& cd datafusion-comet \
&& JAVA_HOME=$(readlink -f $(which javac) | sed "s/\/bin\/javac//") make release PROFILES="-Pspark-$SPARK_VERSION -Pscala-$SCALA_VERSION"

RUN cp /opt/datafusion-comet/spark/target/comet-spark-spark${SPARK_VERSION}_$SCALA_VERSION-0.1.0-SNAPSHOT.jar $SPARK_HOME/jars

USER ${spark_uid}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can reduce the size of the final image from ~8 GB to ~1 GB by using a multi stage build.

With the Dockerfile in this PR:

REPOSITORY                         TAG       IMAGE ID       CREATED          SIZE
andygrove/comet                    latest    223fb8a90579   21 seconds ago   8.42GB

With the version suggested below:

REPOSITORY                         TAG       IMAGE ID       CREATED          SIZE
andygrove/comet                    latest    44695163de5d   31 seconds ago   996MB
Suggested change
FROM apache/spark:3.4.2
USER root
# Installing JDK11 as the image comes with JRE
RUN apt update \
&& apt install -y git \
&& apt install -y curl \
&& apt install -y openjdk-11-jdk \
&& apt clean
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
ENV RUSTFLAGS="-C debuginfo=line-tables-only -C incremental=false"
ENV SPARK_VERSION=3.4
ENV SCALA_VERSION=2.12
# Pick the JDK instead of JRE to compile Comet
RUN cd /opt \
&& git clone https://github.com/apache/datafusion-comet.git \
&& cd datafusion-comet \
&& JAVA_HOME=$(readlink -f $(which javac) | sed "s/\/bin\/javac//") make release PROFILES="-Pspark-$SPARK_VERSION -Pscala-$SCALA_VERSION"
RUN cp /opt/datafusion-comet/spark/target/comet-spark-spark${SPARK_VERSION}_$SCALA_VERSION-0.1.0-SNAPSHOT.jar $SPARK_HOME/jars
USER ${spark_uid}
FROM apache/spark:3.4.2 AS builder
USER root
# Installing JDK11 as the image comes with JRE
RUN apt update \
&& apt install -y git \
&& apt install -y curl \
&& apt install -y openjdk-11-jdk \
&& apt clean
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
ENV RUSTFLAGS="-C debuginfo=line-tables-only -C incremental=false"
ENV SPARK_VERSION=3.4
ENV SCALA_VERSION=2.12
# Pick the JDK instead of JRE to compile Comet
RUN cd /opt \
&& git clone https://github.com/apache/datafusion-comet.git \
&& cd datafusion-comet \
&& JAVA_HOME=$(readlink -f $(which javac) | sed "s/\/bin\/javac//") make release PROFILES="-Pspark-$SPARK_VERSION -Pscala-$SCALA_VERSION"
FROM apache/spark:3.4.2
ENV SPARK_VERSION=3.4
ENV SCALA_VERSION=2.12
USER root
COPY --from=builder /opt/datafusion-comet/spark/target/comet-spark-spark${SPARK_VERSION}_$SCALA_VERSION-0.1.0-SNAPSHOT.jar $SPARK_HOME/jars
USER ${spark_uid}

Locally I am seeing this warning:

 1 warning found (use --debug to expand):
 - UndefinedVar: Usage of undefined variable '$spark_uid' (line 33)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

kube/Dockerfile Outdated
@@ -0,0 +1,26 @@
FROM apache/spark:3.4.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add ASF header

Comment on lines +5 to +11
# Installing JDK11 as the image comes with JRE
RUN apt update \
&& apt install -y git \
&& apt install -y curl \
&& apt install -y openjdk-11-jdk \
&& apt clean

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't apache/spark:3.4.2 have Java installed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it has JRE but not JDK based on the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, its only JRE but the folder name is confusing and is said it is openjdk, but in fact no javac in the bin folder

@comphead comphead marked this pull request as draft July 16, 2024 19:28
@comphead comphead marked this pull request as ready for review July 16, 2024 21:47
@comphead comphead requested review from andygrove and viirya July 16, 2024 21:47
Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dockerfile builds without error for me. I have not tried deploying it in k8s.

Should there be a README as well, or any update to the user or contributor guides?

@comphead
Copy link
Contributor Author

Thanks @andygrove I'll add docs later when I wrap up with local kubectl. I'm not sure how useful just a raw docker file, the user most likely wants to run it in the cluster?

@andygrove
Copy link
Member

Thanks @andygrove I'll add docs later when I wrap up with local kubectl. I'm not sure how useful just a raw docker file, the user most likely wants to run it in the cluster?

Personally, I'm fine with adding the Dockerfile first and docs later.

kube/Dockerfile Outdated
# limitations under the License.
#

FROM apache/spark:3.4.2 AS builder
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using 3.4.3.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be good to keep this consistent with our main branch.

ENV SPARK_VERSION=3.4
ENV SCALA_VERSION=2.12
USER root
COPY --from=builder /opt/datafusion-comet/spark/target/comet-spark-spark${SPARK_VERSION}_$SCALA_VERSION-0.1.0-SNAPSHOT.jar $SPARK_HOME/jars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should change 0.1.0-SNAPSHOT to ENV?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we going to parametrize variables in follow up PR

kube/Dockerfile Outdated
# limitations under the License.
#

FROM apache/spark:3.4.2 AS builder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use ARG to parameterize this -

ARG BASE_IMAGE=apache/spark-3.4.3
FROM ${BASE_IMAGE}

Then the user can specify a different version of the Spark image -

    docker build -f kube/Dockerfile --platform linux/amd64 \
          --build-arg BASE_IMAGE=MY_BASE_IMAGE \
          ...

I don't think we need to parameterize this as part of this PR. But potentially we could parameterize a bunch of the ENV variables similarly.

See also: https://docs.docker.com/build/guide/build-args/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That a neat idea, I'll do a parametrization in followup PR, it also requires some documentation

@comphead comphead merged commit 7ac2fb9 into apache:main Jul 17, 2024
74 checks passed
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
* Create Comet docker file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a Comet dockerfile
5 participants