Skip to content

Commit

Permalink
Duet AI Transform service SDK Harness prompts (apache#30499)
Browse files Browse the repository at this point in the history
* transform service prompt added

* SDK harness added

* fix whitespace

* Apply suggestions from code review

Co-authored-by: Daria Bezkorovaina <[email protected]>

---------

Co-authored-by: Daria Bezkorovaina <[email protected]>
  • Loading branch information
2 people authored and hjtran committed Apr 4, 2024
1 parent 143566c commit e5c7b1d
Show file tree
Hide file tree
Showing 4 changed files with 104 additions and 0 deletions.
17 changes: 17 additions & 0 deletions learning/prompts/documentation-lookup-nolinks/48_sdk_harness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Prompt:
What is SDK harness in Apache Beam?

Response:
Interoperability between Apache Beam SDKs and runners is a key aspect of Apache Beam. The portability framework introduces well-defined, language-neutral data structures and protocols between the SDK and runner. This interoperability layer, known as the 'portability API', ensures that SDKs and runners can seamlessly work with each other, reducing the interoperability burden for both SDKs and runners to a constant effort.

The portability API consists of a set of smaller contracts that isolate SDKs and runners for job submission, management, and execution. These contracts utilize protocols like `protobuf` and `gRPC` to provide broad language support. Currently, all SDKs support the portability framework.

The SDK harness is a program responsible for executing user code. This program is provided by an SDK and runs separately from the runner. SDK harness initialization relies on the provision and artifact APIs for obtaining staged files, pipeline options, and environment information.

Apache Beam offers configuration options for the SDK harness to cater to diverse cluster setups. These options include:
1. **`environment_type`**: determines where user code is executed. The `environment_config` parameter configures the environment based on the value of `environment_type`:
* `DOCKER`: executes user code within a container on each worker node. Docker must be installed on worker nodes. You can specify the Docker image URL using the `environment_config` parameter. Prebuilt SDK container images are available with each Apache Beam release and pushed to Docker Hub. You can also build your custom image.
* `PROCESS`: executes user code through processes that are automatically initiated by the runner on each worker node.
* `EXTERNAL`: dispatches user code to an external service. Use the `environment_config` parameter to specify the service address, for example, `localhost:50000`.
* `LOOPBACK`: executes user code within the same process that submitted the pipeline.
2. **`sdk_worker_parallelism`**: determines the number of SDK workers per worker node. The default value is 1, but setting it to 0 enables automatic determination by the runner based on factors like the number of CPU cores on the worker machine.
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
Prompt:
What is Transform service in Apache Beam?

Response:
The Transform service is a Docker Compose service included in Apache Beam SDK versions 2.49.0 and later. It enables you to upgrade or downgrade the Beam SDK version of individual supported transforms in your pipeline without changing the overall Beam version of the pipeline. Additionally, you can utilize the Transform service to create and execute multi-language pipelines without needing to install support for additional language runtimes.

You can utilize the Transform service to upgrade specific transforms only if you are using Beam Java SDK 2.53.0 and later. To employ this feature, execute a Java pipeline with additional pipeline options specifying the URNs of the transforms you want to upgrade and the desired Apache Beam version:

```java
--transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1 --transformServiceBeamVersion=2.xy.z
```

In the provided example, `--transformsToOverride` specifies the URN of the transform to upgrade or downgrade, while `--transformServiceBeamVersion` specifies the target Beam version.

The framework automatically downloads the specified version of Docker containers for the transforms and uses them in the pipeline. You must have Docker installed on the machine that starts the service.

Currently, the Transform service can upgrade the following transforms:
* BigQuery read: `beam:transform:org.apache.beam:bigquery_read:v1`
* BigQuery write: `beam:transform:org.apache.beam:bigquery_write:v1`
* Kafka read: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2`
* Kafka write: `beam:transform:org.apache.beam:kafka_write:v2`

The Transform service implements the Beam expansion API, enabling multi-language pipelines to leverage it for expanding supported transforms. This feature allows you to create and run multi-language pipelines without additional language runtimes. For instance, you can build a Python pipeline that utilizes a Java `KafkaIO` transform without the need to install Java locally.

In some cases, Apache Beam SDKs can start the Transform service automatically, provided that Docker is available locally. You can also start the Transform service manually by running the following command:

```java
java -jar beam-sdks-java-transform-service-app-<Beam version for the jar>.jar --port <port> --beam_version <Beam version for the transform service> --project_name <a unique ID for the transform service> --command up
```

The Beam Transform service includes several portable transforms implemented in the Apache Beam Java and Python SDKs:
* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and the JDBC I/O connector.
* Python transforms: all portable transforms implemented within the Apache Beam Python SDK, such as RunInference and DataFrame transforms.

19 changes: 19 additions & 0 deletions learning/prompts/documentation-lookup/48_sdk_harness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Prompt:
What is SDK harness in Apache Beam?

Response:
Interoperability between Apache Beam SDKs and runners is a key aspect of Apache Beam. The [portability framework](https://beam.apache.org/roadmap/portability/) introduces well-defined, language-neutral data structures and protocols between the SDK and runner. This interoperability layer, known as the 'portability API', ensures that SDKs and runners can seamlessly work with each other, reducing the interoperability burden for both SDKs and runners to a constant effort.

The portability API consists of a set of smaller contracts that isolate SDKs and runners for job submission, management, and execution. These contracts utilize protocols like `protobuf` and `gRPC` to provide broad language support.

Currently, all SDKs support the portability framework. For the latest information on portability support across SDKs, features, and runners, refer to the [Apache Beam Portability Support Matrix](https://docs.google.com/spreadsheets/d/1KDa_FGn1ShjomGd-UUDOhuh2q73de2tPz6BqHpzqvNI/edit#gid=0).

The SDK harness is a program responsible for executing user code. This program is provided by an SDK and runs separately from the runner. SDK harness initialization relies on the provision and artifact APIs for obtaining staged files, pipeline options, and environment information.

Apache Beam offers configuration options for the SDK harness to cater to diverse cluster setups. These options include:
1. **`environment_type`**: determines where user code is executed. The `environment_config` parameter configures the environment based on the value of `environment_type`:
* `DOCKER`: executes user code within a container on each worker node. Docker must be installed on worker nodes. You can specify the Docker image URL using the `environment_config` parameter. Prebuilt SDK container images are available with each Apache Beam release and pushed to [Docker Hub](https://hub.docker.com/search?q=apache%2Fbeam&type=image). You can also [build your custom image](https://beam.apache.org/documentation/runtime/environments/).
* `PROCESS`: executes user code through processes that are automatically initiated by the runner on each worker node.
* `EXTERNAL`: dispatches user code to an external service. Use the `environment_config` parameter to specify the service address, for example, `localhost:50000`.
* `LOOPBACK`: executes user code within the same process that submitted the pipeline.
2. **`sdk_worker_parallelism`**: determines the number of SDK workers per worker node. The default value is 1, but setting it to 0 enables automatic determination by the runner based on factors like the number of CPU cores on the worker machine.
34 changes: 34 additions & 0 deletions learning/prompts/documentation-lookup/49_transform_service.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
Prompt:
What is Transform service in Apache Beam?

Response:
The Transform service is a [Docker Compose](https://docs.docker.com/compose/) service included in Apache Beam SDK versions 2.49.0 and later. It enables you to upgrade or downgrade the Beam SDK version of individual supported transforms in your pipeline without changing the overall Beam version of the pipeline. Additionally, you can utilize the Transform service to create and execute multi-language pipelines without needing to install support for additional language runtimes.

You can utilize the Transform service to upgrade specific transforms only if you are using Beam Java SDK 2.53.0 and later. To employ this feature, execute a Java pipeline with additional pipeline options specifying the URNs of the transforms you want to upgrade and the desired Apache Beam version:

```java
--transformsToOverride=beam:transform:org.apache.beam:bigquery_read:v1 --transformServiceBeamVersion=2.xy.z
```

In the provided example, `--transformsToOverride` specifies the URN of the transform to upgrade or downgrade, while `--transformServiceBeamVersion` specifies the target Beam version.

The framework automatically downloads the specified version of Docker containers for the transforms and uses them in the pipeline. You must have Docker installed on the machine that starts the service.

Currently, the Transform service can upgrade the following transforms:
* BigQuery read: `beam:transform:org.apache.beam:bigquery_read:v1`
* BigQuery write: `beam:transform:org.apache.beam:bigquery_write:v1`
* Kafka read: `beam:transform:org.apache.beam:kafka_read_with_metadata:v2`
* Kafka write: `beam:transform:org.apache.beam:kafka_write:v2`

The Transform service implements the Beam expansion API, enabling multi-language pipelines to leverage it for expanding supported transforms. This feature allows you to create and run multi-language pipelines without additional language runtimes. For instance, you can build a Python pipeline that utilizes a Java `KafkaIO` transform without the need to install Java locally.

In some cases, Apache Beam SDKs can start the Transform service automatically, provided that Docker is available locally. You can also start the Transform service manually by running the following command:

```java
java -jar beam-sdks-java-transform-service-app-<Beam version for the jar>.jar --port <port> --beam_version <Beam version for the transform service> --project_name <a unique ID for the transform service> --command up
```

The Beam Transform service includes several portable transforms implemented in the Apache Beam Java and Python SDKs:
* Java transforms: Google Cloud I/O connectors, the Kafka I/O connector, and the JDBC I/O connector.
* Python transforms: all portable transforms implemented within the Apache Beam Python SDK, such as RunInference and DataFrame transforms.

0 comments on commit e5c7b1d

Please sign in to comment.