From 07543abf7d66eadfb249de6fa5d05181107c8ac0 Mon Sep 17 00:00:00 2001 From: Oleh Borysevych Date: Mon, 4 Mar 2024 05:26:21 +0100 Subject: [PATCH 1/4] transform service prompt added --- .../49_transform_service.md | 58 +++++++++++++++++++ .../documentation-lookup/48_sdk_harness.md | 0 .../49_transform_service.md | 58 +++++++++++++++++++ 3 files changed, 116 insertions(+) create mode 100644 learning/prompts/documentation-lookup-nolinks/49_transform_service.md create mode 100644 learning/prompts/documentation-lookup/48_sdk_harness.md create mode 100644 learning/prompts/documentation-lookup/49_transform_service.md diff --git a/learning/prompts/documentation-lookup-nolinks/49_transform_service.md b/learning/prompts/documentation-lookup-nolinks/49_transform_service.md new file mode 100644 index 000000000000..0073a0fc22e7 --- /dev/null +++ b/learning/prompts/documentation-lookup-nolinks/49_transform_service.md @@ -0,0 +1,58 @@ +Prompt: +What is a Transformer Service in Apache Beam? + +Response: +The `Transform service` is Docker Compose service included into Apache Beam SDK versions 2.49.0 and later. You can use Transform service to upgrade or downgrade Beam SDK version of individual supported transforms in your pipeline without changing the Beam version of the pipeline. This feature is only supported for Beam Java SDK 2.53.0 and later. + +To use this feature simply execute Java pipeline with additional pipeline options specifying the URNs of the transforms you want to upgrade and the Beam version you want to upgrade to: + +```java +--transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1 --transformServiceBeamVersion=2.xy.z +``` + +In the above example, `--transformsToOverride` specifies the URN of the transform you want to upgrade or downgrade, and `--transformServiceBeamVersion` specifies the Beam version you want to upgrade to. + +The framework will automatically download the specified version of Docker containers for the transforms and use them in the pipeline. You must have Docker installed on the machine that starts the service. + +Currently the following transforms are supported: + +* BigQuery read transform: `beam:transform:org.apache.beam:bigquery_read:v1` +* BigQuery write transform: `beam:transform:org.apache.beam:bigquery_write:v1` +* Kafka read transform: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2` +* Kafka write transform: `beam:transform:org.apache.beam:kafka_write:v2` + +Transform service implement Beam expansion API. This means you can use the Transform service to construct and execute multi-language pipelines. For example, you can build a Python pipeline that uses Java `KafkaIO` transform and execute in without installing Java locally. + +Transform service can be started automatically by Apache Beam SDK or manually by users: + +```java +java -jar beam-sdks-java-transform-service-app-.jar --port --beam_version --project_name --command up +``` + +Beam transform service includes a number of transforms implemented in the Apache Beam Java and Python SDKs: + +* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and the JDBC I/O connector +* Python transforms: all portable transforms implemented within the Apache Beam Python SDK, such as RunInference and DataFrame transforms. + + + + + + + + + + + + + + + + + + + + + + + diff --git a/learning/prompts/documentation-lookup/48_sdk_harness.md b/learning/prompts/documentation-lookup/48_sdk_harness.md new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/learning/prompts/documentation-lookup/49_transform_service.md b/learning/prompts/documentation-lookup/49_transform_service.md new file mode 100644 index 000000000000..fad2e5159f56 --- /dev/null +++ b/learning/prompts/documentation-lookup/49_transform_service.md @@ -0,0 +1,58 @@ +Prompt: +What is a Transformer Service in Apache Beam? + +Response: +The `Transform service` is [Docker Compose service](https://docs.docker.com/compose/) included into Apache Beam SDK versions 2.49.0 and later. You can use Transform service to upgrade or downgrade Beam SDK version of individual supported transforms in your pipeline without changing the Beam version of the pipeline. This feature is only supported for Beam Java SDK 2.53.0 and later. + +To use this feature simply execute Java pipeline with additional pipeline options specifying the URNs of the transforms you want to upgrade and the Beam version you want to upgrade to: + +```java +--transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1 --transformServiceBeamVersion=2.xy.z +``` + +In the above example, `--transformsToOverride` specifies the URN of the transform you want to upgrade or downgrade, and `--transformServiceBeamVersion` specifies the Beam version you want to upgrade to. + +The framework will automatically download the specified version of Docker containers for the transforms and use them in the pipeline. You must have Docker installed on the machine that starts the service. + +Currently the following transforms are supported: + +* BigQuery read transform: `beam:transform:org.apache.beam:bigquery_read:v1` +* BigQuery write transform: `beam:transform:org.apache.beam:bigquery_write:v1` +* Kafka read transform: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2` +* Kafka write transform: `beam:transform:org.apache.beam:kafka_write:v2` + +Transform service implement Beam expansion API. This means you can use the Transform service to construct and execute multi-language pipelines. For example, you can build a Python pipeline that uses Java `KafkaIO` transform and execute in without installing Java locally. + +Transform service can be started automatically by Apache Beam SDK or manually by users: + +```java +java -jar beam-sdks-java-transform-service-app-.jar --port --beam_version --project_name --command up +``` + +Beam transform service includes a number of transforms implemented in the Apache Beam Java and Python SDKs: + +* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and the JDBC I/O connector +* Python transforms: all portable transforms implemented within the Apache Beam Python SDK, such as RunInference and DataFrame transforms. + + + + + + + + + + + + + + + + + + + + + + + From 18bc5649ea2a9f73698c14616de3735def8f2e72 Mon Sep 17 00:00:00 2001 From: Oleh Borysevych Date: Tue, 5 Mar 2024 23:55:21 +0100 Subject: [PATCH 2/4] SDK harness added --- .../48_sdk_harness.md | 19 +++++++++++++++++ .../documentation-lookup/48_sdk_harness.md | 21 +++++++++++++++++++ 2 files changed, 40 insertions(+) create mode 100644 learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md diff --git a/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md b/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md new file mode 100644 index 000000000000..3bfa5b22c47d --- /dev/null +++ b/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md @@ -0,0 +1,19 @@ +Prompt: +What is SDK Harness in Apache Beam? + +Response: +Interoperability between Apache Beam SDKs and runners is a key aspect of Apache Beam. The portability framework introduces well-defined, language-neutral data structures and protocols between the SDK and runner. This interop layer – called the `Portability API` – ensures that SDKs and runners can work with each other uniformly, reducing the interoperability burden for both SDKs and runners to a constant effort. + +The portability API consists of a set of smaller contracts that isolate SDKs and runners for job submission, management and execution. These contracts use `protobuf`s and `gRPC` for broad language support. All SDKs currently support the portability framework. + +The SDK harness is a SDK-provided program responsible for executing user code and is run separately from the runner. SDK harness initialization relies on the Provision and `Artifact API`s for obtaining staged files, pipeline options and environment information. + +Apache Beam allows configuration of the SDK harness to accommodate varying cluster setups: + +* **environment_type**: determines where user code will be executed: + * **DOCKER**: User code is executed within a container started on each worker node. This requires docker to be installed on worker nodes (default). Use `environment_config` to specify the Docker image URL. Official Docker images are used by default. Alternatively, you can build your own image. Prebuilt SDK container images are released per supported language during Beam releases and pushed to Docker Hub. + * **PROCESS**: User code is executed by processes that are automatically started by the runner on each worker node. + * **EXTERNAL**: User code will be dispatched to an external service. Use `environment_config` to specify the address for the external service, e.g. `localhost:50000`. + * **LOOPBACK**: User code is executed within the same process that submitted the pipeline. + +* **sdk_worker_parallelism**: sets the number of SDK workers that run on each worker node. The default is 1. If 0, the value is automatically set by the runner by looking at different parameters, such as the number of CPU cores on the worker machine. \ No newline at end of file diff --git a/learning/prompts/documentation-lookup/48_sdk_harness.md b/learning/prompts/documentation-lookup/48_sdk_harness.md index e69de29bb2d1..c563a370f510 100644 --- a/learning/prompts/documentation-lookup/48_sdk_harness.md +++ b/learning/prompts/documentation-lookup/48_sdk_harness.md @@ -0,0 +1,21 @@ +Prompt: +What is SDK Harness in Apache Beam? + +Response: +Interoperability between Apache Beam SDKs and runners is a key aspect of Apache Beam. The [portability framework](https://beam.apache.org/roadmap/portability/) introduces well-defined, language-neutral data structures and protocols between the SDK and runner. This interop layer – called the `Portability API` – ensures that SDKs and runners can work with each other uniformly, reducing the interoperability burden for both SDKs and runners to a constant effort. + +The portability API consists of a set of smaller contracts that isolate SDKs and runners for job submission, management and execution. These contracts use `protobuf`s and `gRPC` for broad language support. + +All SDKs currently support the portability framework. See the [Portability support table](https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit#gid=0) for details. + +The SDK harness is a SDK-provided program responsible for executing user code and is run separately from the runner. SDK harness initialization relies on the Provision and `Artifact API`s for obtaining staged files, pipeline options and environment information. + +Apache Beam allows configuration of the SDK harness to accommodate varying cluster setups: + +* **environment_type**: determines where user code will be executed: + * **DOCKER**: User code is executed within a container started on each worker node. This requires docker to be installed on worker nodes (default). Use `environment_config` to specify the Docker image URL. Official Docker images are used by default. Alternatively, you can build your own image by following the instructions [here](https://beam.apache.org/documentation/runtime/environments/). Prebuilt SDK container images are released per supported language during Beam releases and pushed to [Docker Hub](https://hub.docker.com/search?q=apache%2Fbeam&type=image). + * **PROCESS**: User code is executed by processes that are automatically started by the runner on each worker node. + * **EXTERNAL**: User code will be dispatched to an external service. Use `environment_config` to specify the address for the external service, e.g. `localhost:50000`. + * **LOOPBACK**: User code is executed within the same process that submitted the pipeline. + +* **sdk_worker_parallelism**: sets the number of SDK workers that run on each worker node. The default is 1. If 0, the value is automatically set by the runner by looking at different parameters, such as the number of CPU cores on the worker machine. \ No newline at end of file From f9354835528304b84567f7e83d0765ed3d76bcf9 Mon Sep 17 00:00:00 2001 From: Oleh Borysevych Date: Wed, 6 Mar 2024 03:28:19 +0100 Subject: [PATCH 3/4] fix whitespace --- .../documentation-lookup-nolinks/49_transform_service.md | 4 ++-- learning/prompts/documentation-lookup/49_transform_service.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/learning/prompts/documentation-lookup-nolinks/49_transform_service.md b/learning/prompts/documentation-lookup-nolinks/49_transform_service.md index 0073a0fc22e7..961c40855730 100644 --- a/learning/prompts/documentation-lookup-nolinks/49_transform_service.md +++ b/learning/prompts/documentation-lookup-nolinks/49_transform_service.md @@ -10,7 +10,7 @@ To use this feature simply execute Java pipeline with additional pipeline option --transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1 --transformServiceBeamVersion=2.xy.z ``` -In the above example, `--transformsToOverride` specifies the URN of the transform you want to upgrade or downgrade, and `--transformServiceBeamVersion` specifies the Beam version you want to upgrade to. +In the above example, `--transformsToOverride` specifies the URN of the transform you want to upgrade or downgrade, and `--transformServiceBeamVersion` specifies the Beam version you want to upgrade to. The framework will automatically download the specified version of Docker containers for the transforms and use them in the pipeline. You must have Docker installed on the machine that starts the service. @@ -24,7 +24,7 @@ Currently the following transforms are supported: Transform service implement Beam expansion API. This means you can use the Transform service to construct and execute multi-language pipelines. For example, you can build a Python pipeline that uses Java `KafkaIO` transform and execute in without installing Java locally. Transform service can be started automatically by Apache Beam SDK or manually by users: - + ```java java -jar beam-sdks-java-transform-service-app-.jar --port --beam_version --project_name --command up ``` diff --git a/learning/prompts/documentation-lookup/49_transform_service.md b/learning/prompts/documentation-lookup/49_transform_service.md index fad2e5159f56..29c541c2ed4b 100644 --- a/learning/prompts/documentation-lookup/49_transform_service.md +++ b/learning/prompts/documentation-lookup/49_transform_service.md @@ -10,7 +10,7 @@ To use this feature simply execute Java pipeline with additional pipeline option --transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1 --transformServiceBeamVersion=2.xy.z ``` -In the above example, `--transformsToOverride` specifies the URN of the transform you want to upgrade or downgrade, and `--transformServiceBeamVersion` specifies the Beam version you want to upgrade to. +In the above example, `--transformsToOverride` specifies the URN of the transform you want to upgrade or downgrade, and `--transformServiceBeamVersion` specifies the Beam version you want to upgrade to. The framework will automatically download the specified version of Docker containers for the transforms and use them in the pipeline. You must have Docker installed on the machine that starts the service. @@ -24,7 +24,7 @@ Currently the following transforms are supported: Transform service implement Beam expansion API. This means you can use the Transform service to construct and execute multi-language pipelines. For example, you can build a Python pipeline that uses Java `KafkaIO` transform and execute in without installing Java locally. Transform service can be started automatically by Apache Beam SDK or manually by users: - + ```java java -jar beam-sdks-java-transform-service-app-.jar --port --beam_version --project_name --command up ``` From 07d2d6c08f39cbd04d5066627b0b4bfa054ebc0d Mon Sep 17 00:00:00 2001 From: Oleh Borysevych Date: Mon, 11 Mar 2024 08:42:40 +0100 Subject: [PATCH 4/4] Apply suggestions from code review Co-authored-by: Daria Bezkorovaina <99457894+dariabezkorovaina@users.noreply.github.com> --- .../48_sdk_harness.md | 24 ++++----- .../49_transform_service.md | 52 +++++-------------- .../documentation-lookup/48_sdk_harness.md | 26 +++++----- .../49_transform_service.md | 52 +++++-------------- 4 files changed, 51 insertions(+), 103 deletions(-) diff --git a/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md b/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md index 3bfa5b22c47d..3611ac78fd48 100644 --- a/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md +++ b/learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md @@ -1,19 +1,17 @@ Prompt: -What is SDK Harness in Apache Beam? +What is SDK harness in Apache Beam? Response: -Interoperability between Apache Beam SDKs and runners is a key aspect of Apache Beam. The portability framework introduces well-defined, language-neutral data structures and protocols between the SDK and runner. This interop layer – called the `Portability API` – ensures that SDKs and runners can work with each other uniformly, reducing the interoperability burden for both SDKs and runners to a constant effort. +Interoperability between Apache Beam SDKs and runners is a key aspect of Apache Beam. The portability framework introduces well-defined, language-neutral data structures and protocols between the SDK and runner. This interoperability layer, known as the 'portability API', ensures that SDKs and runners can seamlessly work with each other, reducing the interoperability burden for both SDKs and runners to a constant effort. -The portability API consists of a set of smaller contracts that isolate SDKs and runners for job submission, management and execution. These contracts use `protobuf`s and `gRPC` for broad language support. All SDKs currently support the portability framework. +The portability API consists of a set of smaller contracts that isolate SDKs and runners for job submission, management, and execution. These contracts utilize protocols like `protobuf` and `gRPC` to provide broad language support. Currently, all SDKs support the portability framework. -The SDK harness is a SDK-provided program responsible for executing user code and is run separately from the runner. SDK harness initialization relies on the Provision and `Artifact API`s for obtaining staged files, pipeline options and environment information. +The SDK harness is a program responsible for executing user code. This program is provided by an SDK and runs separately from the runner. SDK harness initialization relies on the provision and artifact APIs for obtaining staged files, pipeline options, and environment information. -Apache Beam allows configuration of the SDK harness to accommodate varying cluster setups: - -* **environment_type**: determines where user code will be executed: - * **DOCKER**: User code is executed within a container started on each worker node. This requires docker to be installed on worker nodes (default). Use `environment_config` to specify the Docker image URL. Official Docker images are used by default. Alternatively, you can build your own image. Prebuilt SDK container images are released per supported language during Beam releases and pushed to Docker Hub. - * **PROCESS**: User code is executed by processes that are automatically started by the runner on each worker node. - * **EXTERNAL**: User code will be dispatched to an external service. Use `environment_config` to specify the address for the external service, e.g. `localhost:50000`. - * **LOOPBACK**: User code is executed within the same process that submitted the pipeline. - -* **sdk_worker_parallelism**: sets the number of SDK workers that run on each worker node. The default is 1. If 0, the value is automatically set by the runner by looking at different parameters, such as the number of CPU cores on the worker machine. \ No newline at end of file +Apache Beam offers configuration options for the SDK harness to cater to diverse cluster setups. These options include: +1. **`environment_type`**: determines where user code is executed. The `environment_config` parameter configures the environment based on the value of `environment_type`: + * `DOCKER`: executes user code within a container on each worker node. Docker must be installed on worker nodes. You can specify the Docker image URL using the `environment_config` parameter. Prebuilt SDK container images are available with each Apache Beam release and pushed to Docker Hub. You can also build your custom image. + * `PROCESS`: executes user code through processes that are automatically initiated by the runner on each worker node. + * `EXTERNAL`: dispatches user code to an external service. Use the `environment_config` parameter to specify the service address, for example, `localhost:50000`. + * `LOOPBACK`: executes user code within the same process that submitted the pipeline. +2. **`sdk_worker_parallelism`**: determines the number of SDK workers per worker node. The default value is 1, but setting it to 0 enables automatic determination by the runner based on factors like the number of CPU cores on the worker machine. \ No newline at end of file diff --git a/learning/prompts/documentation-lookup-nolinks/49_transform_service.md b/learning/prompts/documentation-lookup-nolinks/49_transform_service.md index 961c40855730..1be6b89b6f76 100644 --- a/learning/prompts/documentation-lookup-nolinks/49_transform_service.md +++ b/learning/prompts/documentation-lookup-nolinks/49_transform_service.md @@ -1,58 +1,34 @@ Prompt: -What is a Transformer Service in Apache Beam? +What is Transform service in Apache Beam? Response: -The `Transform service` is Docker Compose service included into Apache Beam SDK versions 2.49.0 and later. You can use Transform service to upgrade or downgrade Beam SDK version of individual supported transforms in your pipeline without changing the Beam version of the pipeline. This feature is only supported for Beam Java SDK 2.53.0 and later. +The Transform service is a Docker Compose service included in Apache Beam SDK versions 2.49.0 and later. It enables you to upgrade or downgrade the Beam SDK version of individual supported transforms in your pipeline without changing the overall Beam version of the pipeline. Additionally, you can utilize the Transform service to create and execute multi-language pipelines without needing to install support for additional language runtimes. -To use this feature simply execute Java pipeline with additional pipeline options specifying the URNs of the transforms you want to upgrade and the Beam version you want to upgrade to: +You can utilize the Transform service to upgrade specific transforms only if you are using Beam Java SDK 2.53.0 and later. To employ this feature, execute a Java pipeline with additional pipeline options specifying the URNs of the transforms you want to upgrade and the desired Apache Beam version: ```java --transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1 --transformServiceBeamVersion=2.xy.z ``` -In the above example, `--transformsToOverride` specifies the URN of the transform you want to upgrade or downgrade, and `--transformServiceBeamVersion` specifies the Beam version you want to upgrade to. +In the provided example, `--transformsToOverride` specifies the URN of the transform to upgrade or downgrade, while `--transformServiceBeamVersion` specifies the target Beam version. -The framework will automatically download the specified version of Docker containers for the transforms and use them in the pipeline. You must have Docker installed on the machine that starts the service. +The framework automatically downloads the specified version of Docker containers for the transforms and uses them in the pipeline. You must have Docker installed on the machine that starts the service. -Currently the following transforms are supported: +Currently, the Transform service can upgrade the following transforms: +* BigQuery read: `beam:transform:org.apache.beam:bigquery_read:v1` +* BigQuery write: `beam:transform:org.apache.beam:bigquery_write:v1` +* Kafka read: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2` +* Kafka write: `beam:transform:org.apache.beam:kafka_write:v2` -* BigQuery read transform: `beam:transform:org.apache.beam:bigquery_read:v1` -* BigQuery write transform: `beam:transform:org.apache.beam:bigquery_write:v1` -* Kafka read transform: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2` -* Kafka write transform: `beam:transform:org.apache.beam:kafka_write:v2` +The Transform service implements the Beam expansion API, enabling multi-language pipelines to leverage it for expanding supported transforms. This feature allows you to create and run multi-language pipelines without additional language runtimes. For instance, you can build a Python pipeline that utilizes a Java `KafkaIO` transform without the need to install Java locally. -Transform service implement Beam expansion API. This means you can use the Transform service to construct and execute multi-language pipelines. For example, you can build a Python pipeline that uses Java `KafkaIO` transform and execute in without installing Java locally. - -Transform service can be started automatically by Apache Beam SDK or manually by users: +In some cases, Apache Beam SDKs can start the Transform service automatically, provided that Docker is available locally. You can also start the Transform service manually by running the following command: ```java java -jar beam-sdks-java-transform-service-app-.jar --port --beam_version --project_name --command up ``` -Beam transform service includes a number of transforms implemented in the Apache Beam Java and Python SDKs: - -* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and the JDBC I/O connector +The Beam Transform service includes several portable transforms implemented in the Apache Beam Java and Python SDKs: +* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and the JDBC I/O connector. * Python transforms: all portable transforms implemented within the Apache Beam Python SDK, such as RunInference and DataFrame transforms. - - - - - - - - - - - - - - - - - - - - - - diff --git a/learning/prompts/documentation-lookup/48_sdk_harness.md b/learning/prompts/documentation-lookup/48_sdk_harness.md index c563a370f510..084c55182002 100644 --- a/learning/prompts/documentation-lookup/48_sdk_harness.md +++ b/learning/prompts/documentation-lookup/48_sdk_harness.md @@ -1,21 +1,19 @@ Prompt: -What is SDK Harness in Apache Beam? +What is SDK harness in Apache Beam? Response: -Interoperability between Apache Beam SDKs and runners is a key aspect of Apache Beam. The [portability framework](https://beam.apache.org/roadmap/portability/) introduces well-defined, language-neutral data structures and protocols between the SDK and runner. This interop layer – called the `Portability API` – ensures that SDKs and runners can work with each other uniformly, reducing the interoperability burden for both SDKs and runners to a constant effort. +Interoperability between Apache Beam SDKs and runners is a key aspect of Apache Beam. The [portability framework](https://beam.apache.org/roadmap/portability/) introduces well-defined, language-neutral data structures and protocols between the SDK and runner. This interoperability layer, known as the 'portability API', ensures that SDKs and runners can seamlessly work with each other, reducing the interoperability burden for both SDKs and runners to a constant effort. -The portability API consists of a set of smaller contracts that isolate SDKs and runners for job submission, management and execution. These contracts use `protobuf`s and `gRPC` for broad language support. +The portability API consists of a set of smaller contracts that isolate SDKs and runners for job submission, management, and execution. These contracts utilize protocols like `protobuf` and `gRPC` to provide broad language support. -All SDKs currently support the portability framework. See the [Portability support table](https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit#gid=0) for details. +Currently, all SDKs support the portability framework. For the latest information on portability support across SDKs, features, and runners, refer to the [Apache Beam Portability Support Matrix](https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit#gid=0). -The SDK harness is a SDK-provided program responsible for executing user code and is run separately from the runner. SDK harness initialization relies on the Provision and `Artifact API`s for obtaining staged files, pipeline options and environment information. +The SDK harness is a program responsible for executing user code. This program is provided by an SDK and runs separately from the runner. SDK harness initialization relies on the provision and artifact APIs for obtaining staged files, pipeline options, and environment information. -Apache Beam allows configuration of the SDK harness to accommodate varying cluster setups: - -* **environment_type**: determines where user code will be executed: - * **DOCKER**: User code is executed within a container started on each worker node. This requires docker to be installed on worker nodes (default). Use `environment_config` to specify the Docker image URL. Official Docker images are used by default. Alternatively, you can build your own image by following the instructions [here](https://beam.apache.org/documentation/runtime/environments/). Prebuilt SDK container images are released per supported language during Beam releases and pushed to [Docker Hub](https://hub.docker.com/search?q=apache%2Fbeam&type=image). - * **PROCESS**: User code is executed by processes that are automatically started by the runner on each worker node. - * **EXTERNAL**: User code will be dispatched to an external service. Use `environment_config` to specify the address for the external service, e.g. `localhost:50000`. - * **LOOPBACK**: User code is executed within the same process that submitted the pipeline. - -* **sdk_worker_parallelism**: sets the number of SDK workers that run on each worker node. The default is 1. If 0, the value is automatically set by the runner by looking at different parameters, such as the number of CPU cores on the worker machine. \ No newline at end of file +Apache Beam offers configuration options for the SDK harness to cater to diverse cluster setups. These options include: +1. **`environment_type`**: determines where user code is executed. The `environment_config` parameter configures the environment based on the value of `environment_type`: + * `DOCKER`: executes user code within a container on each worker node. Docker must be installed on worker nodes. You can specify the Docker image URL using the `environment_config` parameter. Prebuilt SDK container images are available with each Apache Beam release and pushed to [Docker Hub](https://hub.docker.com/search?q=apache%2Fbeam&type=image). You can also [build your custom image](https://beam.apache.org/documentation/runtime/environments/). + * `PROCESS`: executes user code through processes that are automatically initiated by the runner on each worker node. + * `EXTERNAL`: dispatches user code to an external service. Use the `environment_config` parameter to specify the service address, for example, `localhost:50000`. + * `LOOPBACK`: executes user code within the same process that submitted the pipeline. +2. **`sdk_worker_parallelism`**: determines the number of SDK workers per worker node. The default value is 1, but setting it to 0 enables automatic determination by the runner based on factors like the number of CPU cores on the worker machine. \ No newline at end of file diff --git a/learning/prompts/documentation-lookup/49_transform_service.md b/learning/prompts/documentation-lookup/49_transform_service.md index 29c541c2ed4b..6f803b9b6578 100644 --- a/learning/prompts/documentation-lookup/49_transform_service.md +++ b/learning/prompts/documentation-lookup/49_transform_service.md @@ -1,58 +1,34 @@ Prompt: -What is a Transformer Service in Apache Beam? +What is Transform service in Apache Beam? Response: -The `Transform service` is [Docker Compose service](https://docs.docker.com/compose/) included into Apache Beam SDK versions 2.49.0 and later. You can use Transform service to upgrade or downgrade Beam SDK version of individual supported transforms in your pipeline without changing the Beam version of the pipeline. This feature is only supported for Beam Java SDK 2.53.0 and later. +The Transform service is a [Docker Compose](https://docs.docker.com/compose/) service included in Apache Beam SDK versions 2.49.0 and later. It enables you to upgrade or downgrade the Beam SDK version of individual supported transforms in your pipeline without changing the overall Beam version of the pipeline. Additionally, you can utilize the Transform service to create and execute multi-language pipelines without needing to install support for additional language runtimes. -To use this feature simply execute Java pipeline with additional pipeline options specifying the URNs of the transforms you want to upgrade and the Beam version you want to upgrade to: +You can utilize the Transform service to upgrade specific transforms only if you are using Beam Java SDK 2.53.0 and later. To employ this feature, execute a Java pipeline with additional pipeline options specifying the URNs of the transforms you want to upgrade and the desired Apache Beam version: ```java --transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1 --transformServiceBeamVersion=2.xy.z ``` -In the above example, `--transformsToOverride` specifies the URN of the transform you want to upgrade or downgrade, and `--transformServiceBeamVersion` specifies the Beam version you want to upgrade to. +In the provided example, `--transformsToOverride` specifies the URN of the transform to upgrade or downgrade, while `--transformServiceBeamVersion` specifies the target Beam version. -The framework will automatically download the specified version of Docker containers for the transforms and use them in the pipeline. You must have Docker installed on the machine that starts the service. +The framework automatically downloads the specified version of Docker containers for the transforms and uses them in the pipeline. You must have Docker installed on the machine that starts the service. -Currently the following transforms are supported: +Currently, the Transform service can upgrade the following transforms: +* BigQuery read: `beam:transform:org.apache.beam:bigquery_read:v1` +* BigQuery write: `beam:transform:org.apache.beam:bigquery_write:v1` +* Kafka read: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2` +* Kafka write: `beam:transform:org.apache.beam:kafka_write:v2` -* BigQuery read transform: `beam:transform:org.apache.beam:bigquery_read:v1` -* BigQuery write transform: `beam:transform:org.apache.beam:bigquery_write:v1` -* Kafka read transform: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2` -* Kafka write transform: `beam:transform:org.apache.beam:kafka_write:v2` +The Transform service implements the Beam expansion API, enabling multi-language pipelines to leverage it for expanding supported transforms. This feature allows you to create and run multi-language pipelines without additional language runtimes. For instance, you can build a Python pipeline that utilizes a Java `KafkaIO` transform without the need to install Java locally. -Transform service implement Beam expansion API. This means you can use the Transform service to construct and execute multi-language pipelines. For example, you can build a Python pipeline that uses Java `KafkaIO` transform and execute in without installing Java locally. - -Transform service can be started automatically by Apache Beam SDK or manually by users: +In some cases, Apache Beam SDKs can start the Transform service automatically, provided that Docker is available locally. You can also start the Transform service manually by running the following command: ```java java -jar beam-sdks-java-transform-service-app-.jar --port --beam_version --project_name --command up ``` -Beam transform service includes a number of transforms implemented in the Apache Beam Java and Python SDKs: - -* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and the JDBC I/O connector +The Beam Transform service includes several portable transforms implemented in the Apache Beam Java and Python SDKs: +* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and the JDBC I/O connector. * Python transforms: all portable transforms implemented within the Apache Beam Python SDK, such as RunInference and DataFrame transforms. - - - - - - - - - - - - - - - - - - - - - -