-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create Comet docker file #675
Conversation
kube/Dockerfile
Outdated
FROM apache/spark:3.4.2 | ||
|
||
USER root | ||
|
||
# Installing JDK11 as the image comes with JRE | ||
RUN apt update \ | ||
&& apt install -y git \ | ||
&& apt install -y curl \ | ||
&& apt install -y openjdk-11-jdk \ | ||
&& apt clean | ||
|
||
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y | ||
ENV PATH="/root/.cargo/bin:${PATH}" | ||
ENV RUSTFLAGS="-C debuginfo=line-tables-only -C incremental=false" | ||
ENV SPARK_VERSION=3.4 | ||
ENV SCALA_VERSION=2.12 | ||
|
||
# Pick the JDK instead of JRE to compile Comet | ||
RUN cd /opt \ | ||
&& git clone https://github.com/apache/datafusion-comet.git \ | ||
&& cd datafusion-comet \ | ||
&& JAVA_HOME=$(readlink -f $(which javac) | sed "s/\/bin\/javac//") make release PROFILES="-Pspark-$SPARK_VERSION -Pscala-$SCALA_VERSION" | ||
|
||
RUN cp /opt/datafusion-comet/spark/target/comet-spark-spark${SPARK_VERSION}_$SCALA_VERSION-0.1.0-SNAPSHOT.jar $SPARK_HOME/jars | ||
|
||
USER ${spark_uid} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can reduce the size of the final image from ~8 GB to ~1 GB by using a multi stage build.
With the Dockerfile in this PR:
REPOSITORY TAG IMAGE ID CREATED SIZE
andygrove/comet latest 223fb8a90579 21 seconds ago 8.42GB
With the version suggested below:
REPOSITORY TAG IMAGE ID CREATED SIZE
andygrove/comet latest 44695163de5d 31 seconds ago 996MB
FROM apache/spark:3.4.2 | |
USER root | |
# Installing JDK11 as the image comes with JRE | |
RUN apt update \ | |
&& apt install -y git \ | |
&& apt install -y curl \ | |
&& apt install -y openjdk-11-jdk \ | |
&& apt clean | |
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y | |
ENV PATH="/root/.cargo/bin:${PATH}" | |
ENV RUSTFLAGS="-C debuginfo=line-tables-only -C incremental=false" | |
ENV SPARK_VERSION=3.4 | |
ENV SCALA_VERSION=2.12 | |
# Pick the JDK instead of JRE to compile Comet | |
RUN cd /opt \ | |
&& git clone https://github.com/apache/datafusion-comet.git \ | |
&& cd datafusion-comet \ | |
&& JAVA_HOME=$(readlink -f $(which javac) | sed "s/\/bin\/javac//") make release PROFILES="-Pspark-$SPARK_VERSION -Pscala-$SCALA_VERSION" | |
RUN cp /opt/datafusion-comet/spark/target/comet-spark-spark${SPARK_VERSION}_$SCALA_VERSION-0.1.0-SNAPSHOT.jar $SPARK_HOME/jars | |
USER ${spark_uid} | |
FROM apache/spark:3.4.2 AS builder | |
USER root | |
# Installing JDK11 as the image comes with JRE | |
RUN apt update \ | |
&& apt install -y git \ | |
&& apt install -y curl \ | |
&& apt install -y openjdk-11-jdk \ | |
&& apt clean | |
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y | |
ENV PATH="/root/.cargo/bin:${PATH}" | |
ENV RUSTFLAGS="-C debuginfo=line-tables-only -C incremental=false" | |
ENV SPARK_VERSION=3.4 | |
ENV SCALA_VERSION=2.12 | |
# Pick the JDK instead of JRE to compile Comet | |
RUN cd /opt \ | |
&& git clone https://github.com/apache/datafusion-comet.git \ | |
&& cd datafusion-comet \ | |
&& JAVA_HOME=$(readlink -f $(which javac) | sed "s/\/bin\/javac//") make release PROFILES="-Pspark-$SPARK_VERSION -Pscala-$SCALA_VERSION" | |
FROM apache/spark:3.4.2 | |
ENV SPARK_VERSION=3.4 | |
ENV SCALA_VERSION=2.12 | |
USER root | |
COPY --from=builder /opt/datafusion-comet/spark/target/comet-spark-spark${SPARK_VERSION}_$SCALA_VERSION-0.1.0-SNAPSHOT.jar $SPARK_HOME/jars | |
USER ${spark_uid} |
Locally I am seeing this warning:
1 warning found (use --debug to expand):
- UndefinedVar: Usage of undefined variable '$spark_uid' (line 33)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
kube/Dockerfile
Outdated
@@ -0,0 +1,26 @@ | |||
FROM apache/spark:3.4.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add ASF header
# Installing JDK11 as the image comes with JRE | ||
RUN apt update \ | ||
&& apt install -y git \ | ||
&& apt install -y curl \ | ||
&& apt install -y openjdk-11-jdk \ | ||
&& apt clean | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't apache/spark:3.4.2
have Java installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it has JRE but not JDK based on the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, its only JRE but the folder name is confusing and is said it is openjdk, but in fact no javac in the bin folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dockerfile builds without error for me. I have not tried deploying it in k8s.
Should there be a README as well, or any update to the user or contributor guides?
Thanks @andygrove I'll add docs later when I wrap up with local kubectl. I'm not sure how useful just a raw docker file, the user most likely wants to run it in the cluster? |
Personally, I'm fine with adding the Dockerfile first and docs later. |
kube/Dockerfile
Outdated
# limitations under the License. | ||
# | ||
|
||
FROM apache/spark:3.4.2 AS builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using 3.4.3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be good to keep this consistent with our main
branch.
ENV SPARK_VERSION=3.4 | ||
ENV SCALA_VERSION=2.12 | ||
USER root | ||
COPY --from=builder /opt/datafusion-comet/spark/target/comet-spark-spark${SPARK_VERSION}_$SCALA_VERSION-0.1.0-SNAPSHOT.jar $SPARK_HOME/jars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should change 0.1.0-SNAPSHOT
to ENV
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we going to parametrize variables in follow up PR
kube/Dockerfile
Outdated
# limitations under the License. | ||
# | ||
|
||
FROM apache/spark:3.4.2 AS builder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use ARG
to parameterize this -
ARG BASE_IMAGE=apache/spark-3.4.3
FROM ${BASE_IMAGE}
Then the user can specify a different version of the Spark image -
docker build -f kube/Dockerfile --platform linux/amd64 \
--build-arg BASE_IMAGE=MY_BASE_IMAGE \
...
I don't think we need to parameterize this as part of this PR. But potentially we could parameterize a bunch of the ENV variables similarly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That a neat idea, I'll do a parametrization in followup PR, it also requires some documentation
* Create Comet docker file
Which issue does this PR close?
Closes #643 .
Rationale for this change
Create a Comet dockerfile to create a Spark based image with integrated Comet jar
What changes are included in this PR?
How are these changes tested?