Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Create initial release process scripts for official ASF source release #429

Merged
merged 13 commits into from
Jun 8, 2024

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented May 14, 2024

Which issue does this PR close?

Part of #394

Rationale for this change

We need to establish the basic ASF source release process

What changes are included in this PR?

  • Copy create-tarball, release-tarball, verify-release-candidate scripts from DataFusion and modify them for Comet
  • Change the crate name from comet to datafusion-comet (comet was taken, and I have reserved datafusion-comet in crates.io)
  • Add placeholder README

How are these changes tested?

They are not tested yet. I will test them prior to our first release.

@@ -16,7 +16,7 @@
# under the License.

[package]
name = "comet"
name = "datafusion-comet"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a comet crate, so I renamed this. I have reserved the crate: https://crates.io/crates/datafusion-comet

@andygrove andygrove marked this pull request as ready for review May 14, 2024 16:14
@andygrove andygrove requested review from sunchao and viirya May 14, 2024 16:14
@andygrove
Copy link
Member Author

@snmvaughan @parthchandra please review when you have the time

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this is for publishing cargo package. But I think Comet should be released as a maven package at maven repository? So users can simply add Comet as a dependency in their pom file?

@andygrove
Copy link
Member Author

Hmm, I think this is for publishing cargo package. But I think Comet should be released as a maven package at maven repository? So users can simply add Comet as a dependency in their pom file?

It depends whether we want to allow other Rust projects to be able to use the Spark-compatible expressions that we are implementing, outside of the Comet/Spark integration.

The maven release will be needed for sure.

@andygrove andygrove changed the title chore: Create initial release process scripts chore: Create initial release process scripts for official ASF source release May 14, 2024
core/Cargo.toml Outdated Show resolved Hide resolved
@andygrove
Copy link
Member Author

Hmm, I think this is for publishing cargo package. But I think Comet should be released as a maven package at maven repository? So users can simply add Comet as a dependency in their pom file?

Also, regardless of whether we choose to publish the crate or not, I think it is better to not use the name of an existing crate, and good to have the author information be consistent with other DataFusion crates.

@andygrove
Copy link
Member Author

@viirya @comphead @parthchandra @kazuyukitanimura @huaxingao @advancedxy Could I get more feedback on this PR so that we can prepare for the upcoming 0.1.0 source release, which we are planning to do once we have switched back to depending on an official DataFusion release.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. I think the scripts are in good shape.

core/Cargo.toml Outdated
homepage = "https://datafusion.apache.org/comet"
repository = "https://github.com/apache/datafusion-comet"
authors = ["Apache DataFusion <[email protected]>"]
description = "Apache DataFusion Comet: Apache Spark native query execution plugin"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be consistent with the README.md in the repo, this could be changed as Apache DataFusion Comet: a high-performance accelerator for Apache Spark?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to standardize our description

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for the description in the README.
nit: high performance is two words (i.e. not hyphenated)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated this


set -e

SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather call it DEV_RELEASE_DIR instead.

(cd "${SOURCE_TOP_DIR}" && git archive ${release_hash} --prefix ${release}/ | gzip > ${tarball})

echo "Running rat license checker on ${tarball}"
${SOURCE_DIR}/run-rat.sh ${tarball}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you missed this run-rat.sh in this commit.

Comment on lines 124 to 125
cargo build
cargo test --all --features=avro
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to build and test rust code, we should do it for java/scala code as well?

It just occurred to me maybe we should make this script as a github action and provided in the vote email so that the basic verification is already performed and checked. The community is also welcome to test more combinations.
The previous statement is not a blocker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point about testing the scala code. I will add that.

# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do have a RAT check from maven, is this check a replacement or covers other usecase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we somehow missed adding the RAT check in maven. I was also surprised.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comphead I misread your comment. Where is the rat check in maven?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nm, I see it now.

We need to run the rat check against the source tarball before we release it. The maven project is checking the contents of the repo not the release tarball, so maybe we need both. I will check how we are doing this in DataFusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, ideally to remove duplications, perhaps we can remove RAT check in maven then and live only with the release one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFusion uses the same rat script from CI and from verify-release.sh.

If we do the same with Comet and remove the maven task it could cause more work for developers because they would not catch issues as part of a standard maven build and would have to run a separate script. I am not sure if it is worth the effort to change this right now.

Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good. I feel we can merge and start using this.

pushd core
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
popd
./mvnw install -Prelease -DskipTests -P"spark-3.4" -Dmaven.gitcommitid.skip=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark-3.4 is hardcoded now. What about making it a parameter.

Also does it make sense to use one of our make commands

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently have a make command that specifies -Dmaven.gitcommitid.skip=true although one is being added in #512

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, one hack could be something like make release PROFILES="-Dmaven.gitcommitid.skip=true"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we run tests as well? All comet tests pass for me when I run them though they do take a while to run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, one hack could be something like make release PROFILES="-Dmaven.gitcommitid.skip=true"

That would be a hack. I would leave it as it is. I assume that we will update/deprecate some of this once we move to publishing binary releases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we run tests as well? All comet tests pass for me when I run them though they do take a while to run.

I think that the tests take too long. We create the release candidate tarball from a tag in GitHub where we have already run all of the tests, so running them all again is potentially redundant.

Perhaps there is some subset of tests that we could run? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least build the scala/java code here, the test could be skipped as all the tests should be ran on CI actions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark-3.4 is hardcoded now. What about making it a parameter.

+1 for this. It could be default to 3.4 though.

./mvnw install -Prelease -DskipTests

IIRC, we should use ./mvnw verify instead? Because the install command will also deploy the comet jar into the local repository, which might mess up other tests/projects that will depend on comet-jar in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark-3.4 is hardcoded now. What about making it a parameter.

I added a comment explaining that we are testing the latest supported Spark version.

I think we can make more improvements in the future but I would like to get the basic scripts in place first so that we have the ability to make a release.

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Some minor comments.

core/Cargo.toml Outdated
homepage = "https://datafusion.apache.org/comet"
repository = "https://github.com/apache/datafusion-comet"
authors = ["Apache DataFusion <[email protected]>"]
description = "Apache DataFusion Comet: Apache Spark native query execution plugin"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would vote for the description in the README.
nit: high performance is two words (i.e. not hyphenated)

(cd ${distdir} && shasum -a 512 ${tarname}) > ${tarball}.sha512


echo "Uploading to datafusion dist/dev to ${url}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can committers do this or only PMC?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add to the README that the entire release process requires PMC (or committer if that is sufficient).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some steps can be completed by committers (such as creating the PR to update the project version number and generating the changelog) but some steps do require PMC members.

I can add this to the README.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think committer permission should be sufficient? The bandwidth of PMC is limited, it would be great that we can make this committer sufficient. When releasing on apache uniffle, I don't think we need PMC to be the release manager.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have expanded the README to explain the different steps in the release process and who can do each part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added some better info on verifying release candidates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think committer permission should be sufficient? The bandwidth of PMC is limited, it would be great that we can make this committer sufficient. When releasing on apache uniffle, I don't think we need PMC to be the release manager.

Only PMC members have write access to the subversion repositories as far as I know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, IIRC, when we do binary releases, for maven artifacts, promoting release candidates to releases also requires PMC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information, good to know.

pushd core
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
popd
./mvnw install -Prelease -DskipTests -P"spark-3.4" -Dmaven.gitcommitid.skip=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we run tests as well? All comet tests pass for me when I run them though they do take a while to run.

pushd core
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
popd
./mvnw install -Prelease -DskipTests -P"spark-3.4" -Dmaven.gitcommitid.skip=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, one hack could be something like make release PROFILES="-Dmaven.gitcommitid.skip=true"

That would be a hack. I would leave it as it is. I assume that we will update/deprecate some of this once we move to publishing binary releases.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments, otherwise LGTM.

(cd ${distdir} && shasum -a 512 ${tarname}) > ${tarball}.sha512


echo "Uploading to datafusion dist/dev to ${url}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think committer permission should be sufficient? The bandwidth of PMC is limited, it would be great that we can make this committer sufficient. When releasing on apache uniffle, I don't think we need PMC to be the release manager.

# generate the rat report
$RAT $1 > rat.txt
python3 $RELEASE_DIR/check-rat-report.py $RELEASE_DIR/rat_exclude_files.txt rat.txt > filtered_rat.txt
cat filtered_rat.txt
Copy link
Contributor

@advancedxy advancedxy Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line is needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows the list of files that failed the rat test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Nit:
How about moving this line to the else block in L41 and L42? So that it doesn't pollute the stdout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I have made that change.

core/Cargo.lock
core/testdata/backtrace.txt
core/testdata/stacktrace.txt
dev/copyright/scala-header.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this line? I think scala-header.txt is just a soft link to java-header.txt, which are identical

# generate the rat report
$RAT $1 > rat.txt
python3 $RELEASE_DIR/check-rat-report.py $RELEASE_DIR/rat_exclude_files.txt rat.txt > filtered_rat.txt
cat filtered_rat.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Nit:
How about moving this line to the else block in L41 and L42? So that it doesn't pollute the stdout.

pushd core
RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
popd
./mvnw install -Prelease -DskipTests -P"spark-3.4" -Dmaven.gitcommitid.skip=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark-3.4 is hardcoded now. What about making it a parameter.

+1 for this. It could be default to 3.4 though.

./mvnw install -Prelease -DskipTests

IIRC, we should use ./mvnw verify instead? Because the install command will also deploy the comet jar into the local repository, which might mess up other tests/projects that will depend on comet-jar in the future.

@andygrove
Copy link
Member Author

@parthchandra @advancedxy I think I have addressed all of the feedback now

@parthchandra
Copy link
Contributor

@parthchandra @advancedxy I think I have addressed all of the feedback now

LGTM

@andygrove
Copy link
Member Author

@viirya @kazuyukitanimura @huaxingao could I get a committer review/approval

RUSTFLAGS="-Ctarget-cpu=native" cargo build --release
popd
# test with the latest supported version of Spark
./mvnw verify -Prelease -DskipTests -P"spark-3.4" -Dmaven.gitcommitid.skip=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean if we support Spark 3.5, Spark4.0, we need to change this line?
Wondering should we run it against all supported versions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to verify against the latest Spark version here. I'm hoping that people verify release by building for their Spark version and running against their workflows. This script is just a basic verification that the release builds ok.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @andygrove
left a small comment

@andygrove andygrove merged commit c87fabe into apache:main Jun 8, 2024
43 checks passed
himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
@andygrove andygrove deleted the release-process branch December 3, 2024 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants