Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Create initial release process scripts for official ASF source release #429

Merged
merged 13 commits into from
Jun 8, 2024
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,7 @@ dependency-reduced-pom.xml
core/src/execution/generated
prebuild
.flattened-pom.xml
rat.txt
filtered_rat.txt
dev/dist
apache-rat-*.jar
10 changes: 7 additions & 3 deletions core/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,17 +16,21 @@
# under the License.

[package]
name = "comet"
name = "datafusion-comet"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a comet crate, so I renamed this. I have reserved the crate: https://crates.io/crates/datafusion-comet

version = "0.1.0"
homepage = "https://datafusion.apache.org/comet"
repository = "https://github.com/apache/datafusion-comet"
authors = ["Apache DataFusion <[email protected]>"]
description = "Apache DataFusion Comet: High performance accelerator for Apache Spark"
readme = "README.md"
license = "Apache-2.0"
edition = "2021"
include = [
"benches/*.rs",
"src/**/*.rs",
"Cargo.toml",
]

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
parquet-format = "4.0.0" # This must be kept in sync with that from parquet crate
arrow = { git = "https://github.com/viirya/arrow-rs.git", rev = "3f1ae0c", features = ["prettyprint", "ffi", "chrono-tz"] }
Expand Down
85 changes: 85 additions & 0 deletions dev/release/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

# Comet Release Process

This documentation is for creating an official source release of Apache DataFusion Comet.

The release process is based on the parent Apache DataFusion project, so please refer to the
[DataFusion Release Process](https://github.com/apache/datafusion/blob/main/dev/release/README.md) for detailed
instructions if you are not familiar with the release process here.

Here is a brief overview of the steps involved in creating a release:

## Creating the Release Candidate

This part of the process can be performed by any committer.

- Create and merge a PR to update the version number & update the changelog
- Push a release candidate tag (e.g. 0.1.0-rc1) to the Apache repository

## Publishing the Release Candidate

This part of the process can mostly only be performed by a PMC member.

- Run the create-tarball script to create the source tarball and upload it to the dev subversion repository
- Start an email voting thread
- Once the vote passes, run the release-tarball script to move the tarball to the release subversion repository
- Register the release with the [Apache Reporter Service](https://reporter.apache.org/addrelease.html?datafusion) using
a version such as `COMET-0.1.0`
- Delete old release candidates and releases from the subversion repositories
- Push a release tag (e.g. 0.1.0) to the Apache repository
- Reply to the vote thread to close the vote and announce the release

## Publishing JAR Files to Maven

The process for publishing JAR files to Maven is not defined yet.

## Publishing to crates.io

We may choose to publish the `datafusion-comet` to crates.io so that other Rust projects can leverage the
Spark-compatible operators and expressions outside of Spark.

## Verifying Release Candidates

The vote email will link to this section of this document, so this is where we will need to provide instructions for
verifying a release candidate.

The `dev/release/verify-release-candidate.sh` is a script in this repository that can assist in the verification
process. It checks the hashes and runs the build. It does not run the test suite because this takes a long time
for this project and the test suites already run in CI before we create the release candidate, so running them
again is somewhat redundant.

```shell
./dev/release/verify-release-candidate.sh 0.1.0 1
```

We hope that users will verify the release beyond running this script by testing the release candidate with their
existing Spark jobs and report any functional issues or performance regressions.

Another way of verifying the release is to follow the
[Comet Benchmarking Guide](https://datafusion.apache.org/comet/contributor-guide/benchmarking.html) and compare
performance with the previous release.

## Post Release Activities

Writing a blog post about the release is a great way to generate more interest in the project. We typically create a
Google document where the community can collaborate on a blog post. Once the content is agreed then a PR can be
created against the [datafusion-site](https://github.com/apache/datafusion-site) repository to add the blog post. Any
contributor can drive this process.
59 changes: 59 additions & 0 deletions dev/release/check-rat-report.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/usr/bin/python
##############################################################################
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do have a RAT check from maven, is this check a replacement or covers other usecase?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we somehow missed adding the RAT check in maven. I was also surprised.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comphead I misread your comment. Where is the rat check in maven?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nm, I see it now.

We need to run the rat check against the source tarball before we release it. The maven project is checking the contents of the repo not the release tarball, so maybe we need both. I will check how we are doing this in DataFusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, ideally to remove duplications, perhaps we can remove RAT check in maven then and live only with the release one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFusion uses the same rat script from CI and from verify-release.sh.

If we do the same with Comet and remove the maven task it could cause more work for developers because they would not catch issues as part of a standard maven build and would have to run a separate script. I am not sure if it is worth the effort to change this right now.

#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
##############################################################################
import fnmatch
import re
import sys
import xml.etree.ElementTree as ET

if len(sys.argv) != 3:
sys.stderr.write("Usage: %s exclude_globs.lst rat_report.xml\n" %
sys.argv[0])
sys.exit(1)

exclude_globs_filename = sys.argv[1]
xml_filename = sys.argv[2]

globs = [line.strip() for line in open(exclude_globs_filename, "r")]

tree = ET.parse(xml_filename)
root = tree.getroot()
resources = root.findall('resource')

all_ok = True
for r in resources:
approvals = r.findall('license-approval')
if not approvals or approvals[0].attrib['name'] == 'true':
continue
clean_name = re.sub('^[^/]+/', '', r.attrib['name'])
excluded = False
for g in globs:
if fnmatch.fnmatch(clean_name, g):
excluded = True
break
if not excluded:
sys.stdout.write("NOT APPROVED: %s (%s): %s\n" % (
clean_name, r.attrib['name'], approvals[0].attrib['name']))
all_ok = False

if not all_ok:
sys.exit(1)

print('OK')
sys.exit(0)
135 changes: 135 additions & 0 deletions dev/release/create-tarball.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#

# Adapted from https://github.com/apache/arrow-rs/tree/master/dev/release/create-tarball.sh

# This script creates a signed tarball in
# dev/dist/apache-datafusion-comet-<version>-<sha>.tar.gz and uploads it to
# the "dev" area of the dist.apache.datafusion repository and prepares an
# email for sending to the [email protected] list for a formal
# vote.
#
# See release/README.md for full release instructions
#
# Requirements:
#
# 1. gpg setup for signing and have uploaded your public
# signature to https://pgp.mit.edu/
#
# 2. Logged into the apache svn server with the appropriate
# credentials
#
# 3. Install the requests python package
#
#
# Based in part on 02-source.sh from apache/arrow
#

set -e

DEV_RELEASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
DEV_RELEASE_TOP_DIR="$(cd "${DEV_RELEASE_DIR}/../../" && pwd)"

if [ "$#" -ne 2 ]; then
echo "Usage: $0 <version> <rc>"
echo "ex. $0 4.1.0 2"
exit
fi

if [[ -z "${GH_TOKEN}" ]]; then
echo "Please set personal github token through GH_TOKEN environment variable"
exit
fi

version=$1
rc=$2
tag="${version}-rc${rc}"

echo "Attempting to create ${tarball} from tag ${tag}"
release_hash=$(cd "${DEV_RELEASE_TOP_DIR}" && git rev-list --max-count=1 ${tag})

release=apache-datafusion-comet-${version}
distdir=${DEV_RELEASE_TOP_DIR}/dev/dist/${release}-rc${rc}
tarname=${release}.tar.gz
tarball=${distdir}/${tarname}
url="https://dist.apache.org/repos/dist/dev/datafusion/${release}-rc${rc}"

if [ -z "$release_hash" ]; then
echo "Cannot continue: unknown git tag: ${tag}"
fi

echo "Draft email for [email protected] mailing list"
echo ""
echo "---------------------------------------------------------"
cat <<MAIL
To: [email protected]
Subject: [VOTE] Release Apache DataFusion Comet ${version} RC${rc}
Hi,

I would like to propose a release of Apache DataFusion Comet version ${version}.

This release candidate is based on commit: ${release_hash} [1]
The proposed release tarball and signatures are hosted at [2].
The changelog is located at [3].

Please download, verify checksums and signatures, run the unit tests, and vote
on the release. The vote will be open for at least 72 hours.

Only votes from PMC members are binding, but all members of the community are
encouraged to test the release and vote with "(non-binding)".

The standard verification procedure is documented at https://github.com/apache/datafusion-comet/blob/main/dev/release/README.md#verifying-release-candidates.

[ ] +1 Release this as Apache DataFusion Comet ${version}
[ ] +0
[ ] -1 Do not release this as Apache DataFusion Comet ${version} because...

Here is my vote:

+1

[1]: https://github.com/apache/datafusion-comet/tree/${release_hash}
[2]: ${url}
[3]: https://github.com/apache/datafusion-comet/blob/${release_hash}/CHANGELOG.md
MAIL
echo "---------------------------------------------------------"


# create <tarball> containing the files in git at $release_hash
# the files in the tarball are prefixed with {version} (e.g. 4.0.1)
mkdir -p ${distdir}
(cd "${DEV_RELEASE_TOP_DIR}" && git archive ${release_hash} --prefix ${release}/ | gzip > ${tarball})

echo "Running rat license checker on ${tarball}"
${DEV_RELEASE_DIR}/run-rat.sh ${tarball}

echo "Signing tarball and creating checksums"
gpg --armor --output ${tarball}.asc --detach-sig ${tarball}
# create signing with relative path of tarball
# so that they can be verified with a command such as
# shasum --check apache-datafusion-comet-0.1.0-rc1.tar.gz.sha512
(cd ${distdir} && shasum -a 256 ${tarname}) > ${tarball}.sha256
(cd ${distdir} && shasum -a 512 ${tarname}) > ${tarball}.sha512


echo "Uploading to datafusion dist/dev to ${url}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can committers do this or only PMC?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add to the README that the entire release process requires PMC (or committer if that is sufficient).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some steps can be completed by committers (such as creating the PR to update the project version number and generating the changelog) but some steps do require PMC members.

I can add this to the README.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think committer permission should be sufficient? The bandwidth of PMC is limited, it would be great that we can make this committer sufficient. When releasing on apache uniffle, I don't think we need PMC to be the release manager.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have expanded the README to explain the different steps in the release process and who can do each part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added some better info on verifying release candidates.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think committer permission should be sufficient? The bandwidth of PMC is limited, it would be great that we can make this committer sufficient. When releasing on apache uniffle, I don't think we need PMC to be the release manager.

Only PMC members have write access to the subversion repositories as far as I know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, IIRC, when we do binary releases, for maven artifacts, promoting release candidates to releases also requires PMC.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the information, good to know.

svn co --depth=empty https://dist.apache.org/repos/dist/dev/datafusion ${DEV_RELEASE_TOP_DIR}/dev/dist
svn add ${distdir}
svn ci -m "Apache DataFusion Comet ${version} ${rc}" ${distdir}
16 changes: 16 additions & 0 deletions dev/release/rat_exclude_files.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
*.gitignore
*.dockerignore
.github/pull_request_template.md
.gitmodules
core/Cargo.lock
core/testdata/backtrace.txt
core/testdata/stacktrace.txt
docs/spark_builtin_expr_coverage.txt
docs/source/contributor-guide/benchmark-results/**/*.json
rust-toolchain
spark/src/test/resources/tpcds-query-results/*.out
spark/src/test/resources/tpcds-plan-stability/approved-plans*/**/explain.txt
spark/src/test/resources/tpcds-plan-stability/approved-plans*/**/simplified.txt
spark/src/test/resources/tpch-query-results/*.out
spark/src/test/resources/tpch-extended/q1.sql
spark/inspections/CometTPC*results.txt
Loading
Loading