From 143566c0ecb5ab0ea6aad6b69208d555b3a8b8fb Mon Sep 17 00:00:00 2001 From: tvalentyn Date: Fri, 8 Mar 2024 20:15:04 -0800 Subject: [PATCH] Document requirements cache behavior differences. (#30493) * Revise documentation on managing dependencies. * Describe the staging of requirements cache behavior. * Document that setup_file option doesn't stage dependencies. * Apply suggestions from code review Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --------- Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- .../sdks/python-pipeline-dependencies.md | 21 +++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md index 2261c92eef09..286e90a37053 100644 --- a/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md +++ b/website/www/site/content/en/documentation/sdks/python-pipeline-dependencies.md @@ -46,11 +46,21 @@ To supply a requirements.txt file: The runner will use the `requirements.txt` file to install your additional dependencies onto the remote workers. -> **NOTE**: An alternative to `pip freeze` is to use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile all the dependencies required for the pipeline from a `--requirements_file`, where only top-level dependencies are mentioned. +> **NOTE**: As an alternative to `pip freeze`, use a library like [pip-tools](https://github.com/jazzband/pip-tools) to compile all of the dependencies required for the pipeline from a `requirements.in` file. In the `requirements.in` file, only the top-level dependencies are mentioned. + +When you supply the `--requirements_file` pipeline option, during pipeline submission, Beam downloads +the specified packages locally into a requirements cache directory, +and then stages the requirements cache directory to the runner. +At runtime, when available, Beam installs packages from the requirements cache. +This mechanism makes it possible to stage the dependency packages to the runner +at submission. At runtime, the runner workers might be able to install the +packages from the cache without needing a connection to PyPI. To disable staging the +requirements, use the `--requirements_cache=skip` pipeline option. +For more information, see the [help descriptions of these pipeline options](https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/options/pipeline_options.html#SetupOptions). ## Custom Containers {#custom-containers} -You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline instead of `requirements.txt`. [Follow the instructions on how to run pipeline with Custom Container images](/documentation/runtime/environments/#running-pipelines). +You can pass a [container](https://hub.docker.com/search?q=apache%2Fbeam&type=image) image with all the dependencies that are needed for the pipeline. [Follow the instructions the show how to run the pipeline with custom container images](/documentation/runtime/environments/#running-pipelines). 1. If you are using a custom container image, we recommend that you install the dependencies from the `--requirements_file` directly into your image at build time. In this case, you do not need to pass `--requirements_file` option at runtime, which will reduce the pipeline startup time. @@ -118,7 +128,10 @@ Often, your pipeline code spans multiple files. To run your project remotely, yo --setup_file /path/to/setup.py -**Note:** If you [created a requirements.txt file](#pypi-dependencies) and your project spans multiple files, you can get rid of the `requirements.txt` file and instead, add all packages contained in `requirements.txt` to the `install_requires` field of the setup call (in step 1). +**Note:** It is not necessary to supply the `--requirements_file` [option](#pypi-dependencies) if the dependencies of your package are defined in the `install_requires` field of the `setup.py` file (see step 1). +However unlike with the `--requirements_file` option, when you use the `--setup_file` option, Beam doesn't stage the dependent packages to the runner. +Only the pipeline package is staged. If they aren't already provided in the runtime environment, +the package dependencies are installed from PyPI at runtime. ## Non-Python Dependencies or PyPI Dependencies with Non-Python Dependencies {#nonpython} @@ -132,7 +145,7 @@ Otherwise, you must perform the following steps. **Note:** You must verify that these commands run on the remote worker. For example, if you use `apt`, the remote worker needs `apt` support. -4. Run your pipeline with the following command-line option: +3. Run your pipeline with the following command-line option: --setup_file /path/to/setup.py