-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add code change guide contributor-doc #30879
Conversation
d916d86
to
dceb59e
Compare
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
Assigning reviewers. If you would like to opt out of this review, comment R: @robertwb added as fallback since no labels match configuration Available commands:
The PR bot will only process comments in the main thread (not review comments). |
R: @rszper |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
``` | ||
./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar | ||
``` | ||
The jar is located in the build output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jar is located in the build output | |
The jar is located in the build output. |
./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar | ||
``` | ||
The jar is located in the build output | ||
2. Pass the pipelineOption |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Pass the pipelineOption | |
2. Use the following command to pass `pipelineOption`: |
--dataflowWorkerJar=/.../beam-runners-google-cloud-dataflow-java-legacy-worker-2.XX.0-SNAPSHOT.jar | ||
``` | ||
|
||
* [Dataflow runner v2] If `sdks/java/harness` or its dependency (like `sdks/java/core`) are changed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* [Dataflow runner v2] If `sdks/java/harness` or its dependency (like `sdks/java/core`) are changed: | |
* If you're using Dataflow Runner v2 and `sdks/java/harness` or its dependency (like `sdks/java/core`) have changed, do the following: |
``` | ||
|
||
* [Dataflow runner v2] If `sdks/java/harness` or its dependency (like `sdks/java/core`) are changed: | ||
1. Build SDK harness container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Build SDK harness container | |
1. Use the following command to build the SDK harness container: |
"us.gcr.io/apache-beam-testing/beam_java11_sdk:2.49.0-custom" # change to your container registry | ||
docker push "us.gcr.io/apache-beam-testing/beam_java11_sdk:2.49.0-custom" | ||
``` | ||
2. Run a Runner v2 pipeline with the following options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Run a Runner v2 pipeline with the following options: | |
2. Run the pipeline with the following options: |
--sdkContainerImage="us.gcr.io/apache-beam-testing/beam_java11_sdk:2.49.0-custom" | ||
``` | ||
|
||
# Python Guide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Python Guide | |
# Python guide |
|
||
# Python Guide | ||
|
||
Beam Python SDK is distributed as a single wheel (which is much simpler than Java SDK), and the Python development is consequently more straightforward. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beam Python SDK is distributed as a single wheel (which is much simpler than Java SDK), and the Python development is consequently more straightforward. | |
The Beam Python SDK is distributed as a single wheel, which is more straightforward than the Java SDK. Python development is consequently more less complicated. |
|
||
## Console (shell) setup | ||
|
||
In the following, working directory is set to `sdks/python`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the following, working directory is set to `sdks/python`. | |
These instructions explain how to configure your console. In this example, the working directory is set to `sdks/python`. |
|
||
In the following, working directory is set to `sdks/python`. | ||
|
||
1. Install python interpreter with pyenv (recommended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Install python interpreter with pyenv (recommended) | |
1. Recommended: Install the Python interpreter by using `pyenv`. Use the following commands: |
a. install prerequisites | ||
b. `curl https://pyenv.run | bash` | ||
c. `pyenv install 3.X` (a supported Python version, refer to python_version in [project property](https://github.com/apache/beam/blob/master/gradle.properties)) | ||
2. Setup and activate virtual environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Setup and activate virtual environment | |
2. Use the following commands to set up and activate the virtual environment: |
|
||
2. Install the Beam SDK under your Python virtual environment with necessary extensions (e.g. `pip install /path/to/apache-beam.tar.gz[gcp]`) | ||
|
||
3. Initiate your Python script running the Pipeline with command like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. Initiate your Python script running the Pipeline with command like | |
3. Initiate your Python script, running the Pipeline with command similar to the following example: |
python my_pipeline.py --runner=DataflowRunner --sdk_location=/path/to/apache-beam.tar.gz --project=my_project --region=us-central1 --temp_location=gs://my-bucket/temp ... | ||
``` | ||
|
||
Tips for Dataflow runner: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tips for Dataflow runner: | |
Tips for using the Dataflow runner: |
|
||
Tips for Dataflow runner: | ||
|
||
* Python worker will install the provided Apache Beam SDK before processing work items, so usually no need to provide custom worker container unless your GCP VM does not have Internet access and transient dependencies are changed from the officially released container images. In this case, refer to "Building containers for modified SDK code" instruction above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Python worker will install the provided Apache Beam SDK before processing work items, so usually no need to provide custom worker container unless your GCP VM does not have Internet access and transient dependencies are changed from the officially released container images. In this case, refer to "Building containers for modified SDK code" instruction above | |
* The Python worker installs the Apache Beam SDK before processing work items. Therefore, you don't usually need to provide a custom worker container. If your Google Cloud VM doesn't have Internet access and transient dependencies are changed from the officially released container images, you do need to provide a custom worker container. In this case, see the section "Build containers for modified SDK code." |
|
||
* Python worker will install the provided Apache Beam SDK before processing work items, so usually no need to provide custom worker container unless your GCP VM does not have Internet access and transient dependencies are changed from the officially released container images. In this case, refer to "Building containers for modified SDK code" instruction above | ||
|
||
* Installing Beam Python SDK from source is known to be slow (3.5 min for `n1-standard-1` machine). Alternatively, if the host machine is amd64 architecture, one can build a wheel (instead of a tarball) with command with command (e.g. `./gradle :sdks:python:bdistPy311linux` for Python 3.11) and pass the built wheel to `--sdk_location`. The installation should then just take seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Installing Beam Python SDK from source is known to be slow (3.5 min for `n1-standard-1` machine). Alternatively, if the host machine is amd64 architecture, one can build a wheel (instead of a tarball) with command with command (e.g. `./gradle :sdks:python:bdistPy311linux` for Python 3.11) and pass the built wheel to `--sdk_location`. The installation should then just take seconds. | |
* Installing the Beam Python SDK from source can be slow (3.5 minutes for a`n1-standard-1` machine). As an alternative, if the host machine uses amd64 architecture, you can build a wheel instead of a tarball by using a command similar to `./gradle :sdks:python:bdistPy311linux` (for Python 3.11). Pass the built wheel using the `--sdk_location` option. That installation completes in seconds. |
|
||
### Caveat - `save_main_session` | ||
|
||
* `NameError` when running DoFn on remote runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `NameError` when running DoFn on remote runner | |
* `NameError` when running `DoFn` on remote runner |
Co-authored-by: Rebecca Szper <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better! Just a bit more cleanup.
Co-authored-by: Rebecca Szper <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of good information here, but it's unclear if it's targeted at the casual (I'm just trying to contribute a single transform) vs. advanced (I just got hired to develop Beam for a living) developer. Also, how much redundancy is there with the existing docs/wiki?
* `sdks/java/core` Java core | ||
* `sdks/java/harness` SDK harness (entrypoint of SDK container) | ||
|
||
* `runners` runner supports, written in Java. For example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are only relevant for Java. It might make sense to state that for everything but Java, all relevant files are in sdks/LANG, whereas for Java the code is split up more widely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this information to code path:
"Java code paths are mainly found in two directories:" ...
"For SDKS in other language, all relevant files are in sdks/LANG
, for example," ...
|
||
To check if your environment is set up, follow these steps: | ||
|
||
Your `PATH` needs to have the following elements configured. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are not needed for all development, e.g. one can do much Go/Python development without java installed, and vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified to "Depending on the languages involved, your PATH
needs to have ..." and for each item below there is a comment about when it is needed
* A Go environment. Install the latest Go version. | ||
* Needed for Go SDK development and SDK container change (for all SDKs), because | ||
the container entrypoint scripts are written in Go. | ||
* A Docker environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need docker (and, consequently, go) for cross-language anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I noted "some cross-language functionality (if you run a SDK container image)". To clarify further, added note "not required since Beam 2.53.0")
It aims to help both casual and future long-term developers quickstart. As for redundancy, it is a combination of several internal docs (see internal tracker 283183339). But every often we find user want to need to patch Beam code to fix issues on their side, and thus need a public facing guide, and ideally a self contained one (that easy to follow) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a "last updated" note somewhere in this file? Any thoughts on keeping things up to date as Beam evolves?
added "last updated" timestamp. Keeping up-to-date is not easy. Our wiki has some similar instructions but largely outdated (they refer largely to Jenkins). I try to avoid language that known to be outdated in the future, e.g. specific java and python version. If not possible noted like "Java 8 as of 2024" |
Please add a meaningful description for your change here
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.