From 6707c2c991ef91562a3775d21ea9e823bcaa0326 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Thu, 10 Oct 2024 16:54:04 +0200 Subject: [PATCH 01/17] [blog] - Beam YAML protobuf blogpost --- .../site/content/en/blog/beam-yaml-proto.md | 274 ++++++++++++++++++ 1 file changed, 274 insertions(+) create mode 100644 website/www/site/content/en/blog/beam-yaml-proto.md diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md new file mode 100644 index 000000000000..1d0590a99304 --- /dev/null +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -0,0 +1,274 @@ +--- +title: "Efficient Streaming Data Processing with Beam YAML and Protobuf" +date: "2024-09-20T11:53:38+02:00" +categories: + - blog +authors: + - ffernandez92 +--- + + +As streaming data processing grows, so do its maintenance, complexity, and costs. +This post will explain how to efficiently scale pipelines using [Protobuf](https://protobuf.dev/), +ensuring they are reusable and quick to deploy. Our goal is to keep this process simple +for engineers to implement using [Beam YAML](https://beam.apache.org/documentation/sdks/yaml/). + + + +# Simplifying Pipelines with Beam YAML + +Creating a pipeline in Beam can be somewhat difficult, especially for newcomers with little experience with Beam. +Setting up the project, managing dependencies, and so on can be challenging. +Beam YAML helps eliminate most of the boilerplate code, +allowing you to focus solely on the most important part: data transformation. + +Some of the main key benefits include: + +* **Readability:** By using a declarative language ([YAML](https://yaml.org/)), we improve the human readability + aspect of the pipeline configuration. +* **Reusability:** It is much simpler to reuse the same components across different pipelines. +* **Maintainability:** It simplifies pipeline maintenance and updates. + +The following template shows an example of reading events from a [Kafka](https://kafka.apache.org/intro) topic and +writing them into [BigQuery](https://cloud.google.com/bigquery?hl=en). + +```yaml +pipeline: + transforms: + - type: ReadFromKafka + name: ReadProtoMovieEvents + config: + topic: 'TOPIC_NAME' + format: RAW/AVRO/JSON/PROTO + bootstrap_servers: 'BOOTSTRAP_SERVERS' + schema: 'SCHEMA' + - type: WriteToBigQuery + name: WriteMovieEvents + input: ReadProtoMovieEvents + config: + table: 'PROJECT_ID.DATASET.MOVIE_EVENTS_TABLE' + useAtLeastOnceSemantics: true + +options: + streaming: true + dataflow_service_options: [streaming_mode_at_least_once] +``` + +# Bringing It All Together + +### Let's create a simple proto event: + +```protobuf +// events/v1/movie_event.proto + +syntax = "proto3"; + +package event.v1; + +import "bq_field.proto"; +import "bq_table.proto"; +import "buf/validate/validate.proto"; +import "google/protobuf/wrappers.proto"; + +message MovieEvent { + option (gen_bq_schema.bigquery_opts).table_name = "movie_table"; + google.protobuf.StringValue event_id = 1 [(gen_bq_schema.bigquery).description = "Unique Event ID"]; + google.protobuf.StringValue user_id = 2 [(gen_bq_schema.bigquery).description = "Unique User ID"]; + google.protobuf.StringValue movie_id = 3 [(gen_bq_schema.bigquery).description = "Unique Movie ID"]; + google.protobuf.Int32Value rating = 4 [(buf.validate.field).int32 = { + // validates the average rating is at least 0 + gte: 0, + // validates the average rating is at most 100 + lte: 100 + }, (gen_bq_schema.bigquery).description = "Movie rating"]; + string event_dt = 5 [ + (gen_bq_schema.bigquery).type_override = "DATETIME", + (gen_bq_schema.bigquery).description = "UTC Datetime representing when we received this event. Format: YYYY-MM-DDTHH:MM:SS", + (buf.validate.field) = { + string: { + pattern: "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}$" + }, + ignore_empty: false, + } + ]; +} +``` + +As you can see here, there are important points to consider. Since we are planning to write these events to BigQuery, +we have imported the *[bq_field](https://buf.build/googlecloudplatform/bq-schema-api/file/main:bq_field.proto)* +and *[bq_table](https://buf.build/googlecloudplatform/bq-schema-api/file/main:bq_table.proto)* proto. +These proto files help generate the BigQuery JSON schema. +In our example, we are also advocating for a shift-left approach, which means we want to move testing, quality, +and performance as early as possible in the development process. This is why we have included the *buf.validate* +elements to ensure that only valid events are generated from the source. + +Once we have our *movie_event.proto* in the *events/v1* folder, we can generate +the necessary [file descriptor](https://buf.build/docs/reference/descriptors). +Essentially, a file descriptor is a compiled representation of the schema that allows various tools and systems +to understand and work with Protobuf data dynamically. To simplify the process, we are using Buf in this example, +so we will need the following configuration files. + + +Buf configuration: + +```yaml +# buf.yaml + +version: v2 +deps: + - buf.build/googlecloudplatform/bq-schema-api + - buf.build/bufbuild/protovalidate +breaking: + use: + - FILE +lint: + use: + - DEFAULT +``` + +```yaml +# buf.gen.yaml + +version: v2 +managed: + enabled: true +plugins: + # Python Plugins + - remote: buf.build/protocolbuffers/python + out: gen/python + - remote: buf.build/grpc/python + out: gen/python + + # Java Plugins + - remote: buf.build/protocolbuffers/java:v25.2 + out: gen/maven/src/main/java + - remote: buf.build/grpc/java + out: gen/maven/src/main/java + + # BQ Schemas + - remote: buf.build/googlecloudplatform/bq-schema:v1.1.0 + out: protoc-gen/bq_schema + +``` + +Running the following two commands we will generate the necessary Java, Python, BigQuery schema, and Descriptor File: + +```bash +// Generate the buf.lock file +buf deps update + +// It will generate the descriptor in descriptor.binp. +buf build . -o descriptor.binp --exclude-imports + +// It will generate the Java, Python and BigQuery schema as described in buf.gen.yaml +buf generate --include-imports +``` + +# Let’s make our Beam YAML read proto: + +These are the modifications we need to make to the YAML file: + +```yaml +# movie_events_pipeline.yml + +pipeline: + transforms: + - type: ReadFromKafka + name: ReadProtoMovieEvents + config: + topic: 'movie_proto' + format: PROTO + bootstrap_servers: '' + file_descriptor_path: 'gs://my_proto_bucket/movie/v1.0.0/descriptor.binp' + message_name: 'event.v1.MovieEvent' + - type: WriteToBigQuery + name: WriteMovieEvents + input: ReadProtoMovieEvents + config: + table: '.raw.movie_table' + useAtLeastOnceSemantics: true +options: + streaming: true + dataflow_service_options: [streaming_mode_at_least_once] +``` + +As you can see, we just changed the format to be *PROTO* and added the *file_descriptor_path* and the *message_name*. + +### Let’s use Terraform to deploy it + +We can consider using [Terraform](https://www.terraform.io/) to deploy our Beam YAML pipeline +with [Dataflow](https://cloud.google.com/products/dataflow?hl=en) as the runner. +The following Terraform code example demonstrates how to achieve this: + +```hcl +// Enable Dataflow API. +resource "google_project_service" "enable_dataflow_api" { + project = var.gcp_project_id + service = "dataflow.googleapis.com" +} + +// DF Beam YAML +resource "google_dataflow_flex_template_job" "data_movie_job" { + provider = google-beta + project = var.gcp_project_id + name = "movie-proto-events" + container_spec_gcs_path = "gs://dataflow-templates-${var.gcp_region}/latest/flex/Yaml_Template" + region = var.gcp_region + on_delete = "drain" + machine_type = "n2d-standard-4" + enable_streaming_engine = true + subnetwork = var.subnetwork + skip_wait_on_job_termination = true + parameters = { + yaml_pipeline_file = "gs://${var.bucket_name}/yamls/${var.package_version}/movie_events_pipeline.yml" + max_num_workers = 40 + worker_zone = var.gcp_zone + } + depends_on = [google_project_service.enable_dataflow_api] +} +``` + +Assuming we have created the BigQuery table, which can also be done using Terraform and Proto as described earlier, +the previous code should create a Dataflow job using our Beam YAML code that reads Proto events from +Kafka and writes them into BigQuery. + +# Improvements and Conclusions + +Some potential improvements that can be done as part of community contributions to the previous Beam YAML code are: + +* **Supporting Schema Registries:** Integrate with schema registries such as Buf Registry or Apicurio for +better schema management. In the current solution, we generate the descriptors via Buf and store them in GCS. +We could store them in a schema registry instead. + + +* **Enhanced Monitoring:** Implement advanced monitoring and alerting mechanisms to quickly identify and address +issues in the data pipeline. + +As a conclusion, by leveraging Beam YAML and Protobuf, we have streamlined the creation and maintenance of +data processing pipelines, significantly reducing complexity. This approach ensures that engineers can more +efficiently implement and scale robust, reusable pipelines, compared to writing the equivalent Beam code manually. + +## Contributing + +Developers who wish to help build out and add functionalities are welcome to start contributing to the effort in the +Beam YAML module found [here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml). + +There is also a list of open [bugs](https://github.com/apache/beam/issues?q=is%3Aopen+is%3Aissue+label%3Ayaml) found +on the GitHub repo - now marked with the 'yaml' tag. + +While Beam YAML has been marked stable as of Beam 2.52, it is still under heavy development, with new features being +added with each release. Those who wish to be part of the design decisions and give insights to how the framework is +being used are highly encouraged to join the dev mailing list as those discussions will be directed there. A link to +the dev list can be found [here](https://beam.apache.org/community/contact-us/). From f717393db57271b159259f64c118b7e0fc4baed8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Thu, 10 Oct 2024 17:17:56 +0200 Subject: [PATCH 02/17] [blog] - update authors --- website/www/site/data/authors.yml | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/website/www/site/data/authors.yml b/website/www/site/data/authors.yml index 6f16ae12f01a..324e675bfc8a 100644 --- a/website/www/site/data/authors.yml +++ b/website/www/site/data/authors.yml @@ -283,4 +283,7 @@ jkinard: email: jkinard@google.com jkim: name: Jaehyeon Kim - email: dottami@gmail.com \ No newline at end of file + email: dottami@gmail.com +ffernandez92: + name: Ferran Fernandez + email: ffernandez.upc@gmail.com From bc8eb6650c6d0c58fac3991b292eef196c9a6626 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Thu, 10 Oct 2024 17:35:33 +0200 Subject: [PATCH 03/17] [blog] - trail whitespace --- .../site/content/en/blog/beam-yaml-proto.md | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 1d0590a99304..833dcee8ef32 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -20,7 +20,7 @@ See the License for the specific language governing permissions and limitations under the License. --> -As streaming data processing grows, so do its maintenance, complexity, and costs. +As streaming data processing grows, so do its maintenance, complexity, and costs. This post will explain how to efficiently scale pipelines using [Protobuf](https://protobuf.dev/), ensuring they are reusable and quick to deploy. Our goal is to keep this process simple for engineers to implement using [Beam YAML](https://beam.apache.org/documentation/sdks/yaml/). @@ -37,7 +37,7 @@ allowing you to focus solely on the most important part: data transformation. Some of the main key benefits include: * **Readability:** By using a declarative language ([YAML](https://yaml.org/)), we improve the human readability - aspect of the pipeline configuration. +aspect of the pipeline configuration. * **Reusability:** It is much simpler to reuse the same components across different pipelines. * **Maintainability:** It simplifies pipeline maintenance and updates. @@ -107,14 +107,14 @@ message MovieEvent { ``` As you can see here, there are important points to consider. Since we are planning to write these events to BigQuery, -we have imported the *[bq_field](https://buf.build/googlecloudplatform/bq-schema-api/file/main:bq_field.proto)* -and *[bq_table](https://buf.build/googlecloudplatform/bq-schema-api/file/main:bq_table.proto)* proto. +we have imported the *[bq_field](https://buf.build/googlecloudplatform/bq-schema-api/file/main:bq_field.proto)* +and *[bq_table](https://buf.build/googlecloudplatform/bq-schema-api/file/main:bq_table.proto)* proto. These proto files help generate the BigQuery JSON schema. In our example, we are also advocating for a shift-left approach, which means we want to move testing, quality, and performance as early as possible in the development process. This is why we have included the *buf.validate* elements to ensure that only valid events are generated from the source. -Once we have our *movie_event.proto* in the *events/v1* folder, we can generate +Once we have our *movie_event.proto* in the *events/v1* folder, we can generate the necessary [file descriptor](https://buf.build/docs/reference/descriptors). Essentially, a file descriptor is a compiled representation of the schema that allows various tools and systems to understand and work with Protobuf data dynamically. To simplify the process, we are using Buf in this example, @@ -169,7 +169,7 @@ Running the following two commands we will generate the necessary Java, Python, // Generate the buf.lock file buf deps update -// It will generate the descriptor in descriptor.binp. +// It will generate the descriptor in descriptor.binp. buf build . -o descriptor.binp --exclude-imports // It will generate the Java, Python and BigQuery schema as described in buf.gen.yaml @@ -208,8 +208,8 @@ As you can see, we just changed the format to be *PROTO* and added the *file_des ### Let’s use Terraform to deploy it -We can consider using [Terraform](https://www.terraform.io/) to deploy our Beam YAML pipeline -with [Dataflow](https://cloud.google.com/products/dataflow?hl=en) as the runner. +We can consider using [Terraform](https://www.terraform.io/) to deploy our Beam YAML pipeline +with [Dataflow](https://cloud.google.com/products/dataflow?hl=en) as the runner. The following Terraform code example demonstrates how to achieve this: ```hcl @@ -248,7 +248,7 @@ Kafka and writes them into BigQuery. Some potential improvements that can be done as part of community contributions to the previous Beam YAML code are: -* **Supporting Schema Registries:** Integrate with schema registries such as Buf Registry or Apicurio for +* **Supporting Schema Registries:** Integrate with schema registries such as Buf Registry or Apicurio for better schema management. In the current solution, we generate the descriptors via Buf and store them in GCS. We could store them in a schema registry instead. @@ -256,7 +256,7 @@ We could store them in a schema registry instead. * **Enhanced Monitoring:** Implement advanced monitoring and alerting mechanisms to quickly identify and address issues in the data pipeline. -As a conclusion, by leveraging Beam YAML and Protobuf, we have streamlined the creation and maintenance of +As a conclusion, by leveraging Beam YAML and Protobuf, we have streamlined the creation and maintenance of data processing pipelines, significantly reducing complexity. This approach ensures that engineers can more efficiently implement and scale robust, reusable pipelines, compared to writing the equivalent Beam code manually. From bade784f3e78968d048d0d8591e451756ff56ee9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Thu, 10 Oct 2024 17:46:57 +0200 Subject: [PATCH 04/17] [blog] - trail whitespace2 --- website/www/site/content/en/blog/beam-yaml-proto.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 833dcee8ef32..44a3d0f4352f 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -41,7 +41,7 @@ aspect of the pipeline configuration. * **Reusability:** It is much simpler to reuse the same components across different pipelines. * **Maintainability:** It simplifies pipeline maintenance and updates. -The following template shows an example of reading events from a [Kafka](https://kafka.apache.org/intro) topic and +The following template shows an example of reading events from a [Kafka](https://kafka.apache.org/intro) topic and writing them into [BigQuery](https://cloud.google.com/bigquery?hl=en). ```yaml From b20f36e2e584135ba069e3b9a82a596198d74dee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:47:50 +0200 Subject: [PATCH 05/17] Update website/www/site/content/en/blog/beam-yaml-proto.md Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 44a3d0f4352f..71fbef015cd0 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -21,7 +21,7 @@ limitations under the License. --> As streaming data processing grows, so do its maintenance, complexity, and costs. -This post will explain how to efficiently scale pipelines using [Protobuf](https://protobuf.dev/), +This post explains how to efficiently scale pipelines by using [Protobuf](https://protobuf.dev/), ensuring they are reusable and quick to deploy. Our goal is to keep this process simple for engineers to implement using [Beam YAML](https://beam.apache.org/documentation/sdks/yaml/). From c5c83b2e7bf47f642f36e3d12e80c7ff2883ec27 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:48:34 +0200 Subject: [PATCH 06/17] Update website/www/site/content/en/blog/beam-yaml-proto.md Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 71fbef015cd0..7d95af50b8ca 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -22,7 +22,7 @@ limitations under the License. As streaming data processing grows, so do its maintenance, complexity, and costs. This post explains how to efficiently scale pipelines by using [Protobuf](https://protobuf.dev/), -ensuring they are reusable and quick to deploy. Our goal is to keep this process simple +which ensures that pipelines are reusable and quick to deploy. The goal is to keep this process simple for engineers to implement using [Beam YAML](https://beam.apache.org/documentation/sdks/yaml/). From b224fb9f9da79a47dad9d0a6d3f5c9d3f512990f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:48:41 +0200 Subject: [PATCH 07/17] Update website/www/site/content/en/blog/beam-yaml-proto.md Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 7d95af50b8ca..a5c39c1021e6 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -27,7 +27,7 @@ for engineers to implement using [Beam YAML](https://beam.apache.org/documentati -# Simplifying Pipelines with Beam YAML +## Simplify pipelines with Beam YAML Creating a pipeline in Beam can be somewhat difficult, especially for newcomers with little experience with Beam. Setting up the project, managing dependencies, and so on can be challenging. From 42db58d10a680eaa02d61b0a66a31ca92ca5d948 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:48:50 +0200 Subject: [PATCH 08/17] Update website/www/site/content/en/blog/beam-yaml-proto.md Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index a5c39c1021e6..420b328efa29 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -29,7 +29,7 @@ for engineers to implement using [Beam YAML](https://beam.apache.org/documentati ## Simplify pipelines with Beam YAML -Creating a pipeline in Beam can be somewhat difficult, especially for newcomers with little experience with Beam. +Creating a pipeline in Beam can be somewhat difficult, especially for new Apache Beam users. Setting up the project, managing dependencies, and so on can be challenging. Beam YAML helps eliminate most of the boilerplate code, allowing you to focus solely on the most important part: data transformation. From d213d38ac1c8f9bc6739cb06d7bef462b7786d78 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:49:02 +0200 Subject: [PATCH 09/17] Update website/www/site/content/en/blog/beam-yaml-proto.md Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 420b328efa29..1a3e70ac0eb7 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -31,7 +31,7 @@ for engineers to implement using [Beam YAML](https://beam.apache.org/documentati Creating a pipeline in Beam can be somewhat difficult, especially for new Apache Beam users. Setting up the project, managing dependencies, and so on can be challenging. -Beam YAML helps eliminate most of the boilerplate code, +By using Beam YAML, you can eliminate most of the boilerplate code, allowing you to focus solely on the most important part: data transformation. Some of the main key benefits include: From e6dc05f879ef68426b4cafd509c1fc486d517f45 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:51:42 +0200 Subject: [PATCH 10/17] Update website/www/site/content/en/blog/beam-yaml-proto.md Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 1a3e70ac0eb7..4ddf06133753 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -32,7 +32,7 @@ for engineers to implement using [Beam YAML](https://beam.apache.org/documentati Creating a pipeline in Beam can be somewhat difficult, especially for new Apache Beam users. Setting up the project, managing dependencies, and so on can be challenging. By using Beam YAML, you can eliminate most of the boilerplate code, -allowing you to focus solely on the most important part: data transformation. +which allows you to focus on the most important part of the work: data transformation. Some of the main key benefits include: From 7b5bf93e3745de6dd77ff4ccee633a1169998696 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:51:52 +0200 Subject: [PATCH 11/17] Update website/www/site/content/en/blog/beam-yaml-proto.md Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 4ddf06133753..3cdfe9476eb3 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -266,7 +266,7 @@ Developers who wish to help build out and add functionalities are welcome to sta Beam YAML module found [here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml). There is also a list of open [bugs](https://github.com/apache/beam/issues?q=is%3Aopen+is%3Aissue+label%3Ayaml) found -on the GitHub repo - now marked with the 'yaml' tag. +on the GitHub repo - now marked with the `yaml` tag. While Beam YAML has been marked stable as of Beam 2.52, it is still under heavy development, with new features being added with each release. Those who wish to be part of the design decisions and give insights to how the framework is From cb3c673f512f3cb5daa6218f68e74d3d72ad8f99 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:52:05 +0200 Subject: [PATCH 12/17] Update website/www/site/content/en/blog/beam-yaml-proto.md Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 3cdfe9476eb3..1c67add75c80 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -268,7 +268,7 @@ Beam YAML module found [here](https://github.com/apache/beam/tree/master/sdks/py There is also a list of open [bugs](https://github.com/apache/beam/issues?q=is%3Aopen+is%3Aissue+label%3Ayaml) found on the GitHub repo - now marked with the `yaml` tag. -While Beam YAML has been marked stable as of Beam 2.52, it is still under heavy development, with new features being +Although Beam YAML is marked stable as of Beam 2.52, it is still under heavy development, with new features being added with each release. Those who wish to be part of the design decisions and give insights to how the framework is being used are highly encouraged to join the dev mailing list as those discussions will be directed there. A link to the dev list can be found [here](https://beam.apache.org/community/contact-us/). From 94cb34a4a9b5709689f26c353943354ffae7f51f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:52:14 +0200 Subject: [PATCH 13/17] Update website/www/site/content/en/blog/beam-yaml-proto.md Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 1c67add75c80..9126135fae80 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -269,6 +269,6 @@ There is also a list of open [bugs](https://github.com/apache/beam/issues?q=is%3 on the GitHub repo - now marked with the `yaml` tag. Although Beam YAML is marked stable as of Beam 2.52, it is still under heavy development, with new features being -added with each release. Those who wish to be part of the design decisions and give insights to how the framework is +added with each release. Those who want to be part of the design decisions and give insights to how the framework is being used are highly encouraged to join the dev mailing list as those discussions will be directed there. A link to the dev list can be found [here](https://beam.apache.org/community/contact-us/). From 2af4991280febf5906b67dfbc80d8be0141b647c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:52:23 +0200 Subject: [PATCH 14/17] Update website/www/site/content/en/blog/beam-yaml-proto.md Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 9126135fae80..d83f8b8016c7 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -270,5 +270,5 @@ on the GitHub repo - now marked with the `yaml` tag. Although Beam YAML is marked stable as of Beam 2.52, it is still under heavy development, with new features being added with each release. Those who want to be part of the design decisions and give insights to how the framework is -being used are highly encouraged to join the dev mailing list as those discussions will be directed there. A link to +being used are highly encouraged to join the [dev mailing list](https://beam.apache.org/community/contact-us/), where those discussions are occurring. the dev list can be found [here](https://beam.apache.org/community/contact-us/). From 923f053cfb8d4a0232b904641457893e98f3f338 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:52:34 +0200 Subject: [PATCH 15/17] Update website/www/site/content/en/blog/beam-yaml-proto.md Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 1 - 1 file changed, 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index d83f8b8016c7..440f183c0277 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -271,4 +271,3 @@ on the GitHub repo - now marked with the `yaml` tag. Although Beam YAML is marked stable as of Beam 2.52, it is still under heavy development, with new features being added with each release. Those who want to be part of the design decisions and give insights to how the framework is being used are highly encouraged to join the [dev mailing list](https://beam.apache.org/community/contact-us/), where those discussions are occurring. -the dev list can be found [here](https://beam.apache.org/community/contact-us/). From f1bfaa30fe7fe271bd2a739b93e5da1f367bf1ac Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 13:54:44 +0200 Subject: [PATCH 16/17] Apply suggestions from code review Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- .../site/content/en/blog/beam-yaml-proto.md | 74 +++++++++---------- 1 file changed, 37 insertions(+), 37 deletions(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index 440f183c0277..db56154750d5 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -34,12 +34,11 @@ Setting up the project, managing dependencies, and so on can be challenging. By using Beam YAML, you can eliminate most of the boilerplate code, which allows you to focus on the most important part of the work: data transformation. -Some of the main key benefits include: +Some of the key benefits of Beam YAML include: -* **Readability:** By using a declarative language ([YAML](https://yaml.org/)), we improve the human readability -aspect of the pipeline configuration. -* **Reusability:** It is much simpler to reuse the same components across different pipelines. -* **Maintainability:** It simplifies pipeline maintenance and updates. +* **Readability:** By using a declarative language ([YAML](https://yaml.org/)), the pipeline configuration is more human readable. +* **Reusability:** Reusing the same components across different pipelines is simplified. +* **Maintainability:** Pipeline maintenance and updates are easier. The following template shows an example of reading events from a [Kafka](https://kafka.apache.org/intro) topic and writing them into [BigQuery](https://cloud.google.com/bigquery?hl=en). @@ -66,9 +65,11 @@ options: dataflow_service_options: [streaming_mode_at_least_once] ``` -# Bringing It All Together +## The complete workflow -### Let's create a simple proto event: +This section demonstrates the complete workflow for this pipeline. + +### Create a simple proto event ```protobuf // events/v1/movie_event.proto @@ -106,19 +107,18 @@ message MovieEvent { } ``` -As you can see here, there are important points to consider. Since we are planning to write these events to BigQuery, -we have imported the *[bq_field](https://buf.build/googlecloudplatform/bq-schema-api/file/main:bq_field.proto)* -and *[bq_table](https://buf.build/googlecloudplatform/bq-schema-api/file/main:bq_table.proto)* proto. +Because these events are written to BigQuery, +the [`bq_field`](https://buf.build/googlecloudplatform/bq-schema-api/file/main:bq_field.proto) proto +and the [`bq_table`](https://buf.build/googlecloudplatform/bq-schema-api/file/main:bq_table.proto) proto are imported. These proto files help generate the BigQuery JSON schema. -In our example, we are also advocating for a shift-left approach, which means we want to move testing, quality, -and performance as early as possible in the development process. This is why we have included the *buf.validate* -elements to ensure that only valid events are generated from the source. +This example also demonstrates a shift-left approach, which moves testing, quality, +and performance as early as possible in the development process. For example, to ensure that only valid events are generated from the source, the `buf.validate` elements are included. -Once we have our *movie_event.proto* in the *events/v1* folder, we can generate +After you create the `movie_event.proto` proto in the `events/v1` folder, you can generate the necessary [file descriptor](https://buf.build/docs/reference/descriptors). -Essentially, a file descriptor is a compiled representation of the schema that allows various tools and systems -to understand and work with Protobuf data dynamically. To simplify the process, we are using Buf in this example, -so we will need the following configuration files. +A file descriptor is a compiled representation of the schema that allows various tools and systems +to understand and work with protobuf data dynamically. To simplify the process, this example uses Buf, +which requires the following configuration files. Buf configuration: @@ -163,22 +163,22 @@ plugins: ``` -Running the following two commands we will generate the necessary Java, Python, BigQuery schema, and Descriptor File: +Run the following two commands to generate the necessary Java, Python, BigQuery schema, and Descriptor file: ```bash // Generate the buf.lock file buf deps update -// It will generate the descriptor in descriptor.binp. +// It generates the descriptor in descriptor.binp. buf build . -o descriptor.binp --exclude-imports -// It will generate the Java, Python and BigQuery schema as described in buf.gen.yaml +// It generates the Java, Python and BigQuery schema as described in buf.gen.yaml buf generate --include-imports ``` -# Let’s make our Beam YAML read proto: +### Make the Beam YAML read proto -These are the modifications we need to make to the YAML file: +Make the following modifications to the to the YAML file: ```yaml # movie_events_pipeline.yml @@ -204,11 +204,11 @@ options: dataflow_service_options: [streaming_mode_at_least_once] ``` -As you can see, we just changed the format to be *PROTO* and added the *file_descriptor_path* and the *message_name*. +This step changes the format to `PROTO` and adds the `file_descriptor_path` and the `message_name`. -### Let’s use Terraform to deploy it +### Deploy the pipeline with Terraform -We can consider using [Terraform](https://www.terraform.io/) to deploy our Beam YAML pipeline +You can use [Terraform](https://www.terraform.io/) to deploy the Beam YAML pipeline with [Dataflow](https://cloud.google.com/products/dataflow?hl=en) as the runner. The following Terraform code example demonstrates how to achieve this: @@ -240,30 +240,30 @@ resource "google_dataflow_flex_template_job" "data_movie_job" { } ``` -Assuming we have created the BigQuery table, which can also be done using Terraform and Proto as described earlier, -the previous code should create a Dataflow job using our Beam YAML code that reads Proto events from +Assuming the BigQuery table exists, which you can do by using Terraform and Proto, +this code creates a Dataflow job by using the Beam YAML code that reads Proto events from Kafka and writes them into BigQuery. -# Improvements and Conclusions +## Improvements and conclusions -Some potential improvements that can be done as part of community contributions to the previous Beam YAML code are: +The following community contributions could improve the Beam YAML code in this example: -* **Supporting Schema Registries:** Integrate with schema registries such as Buf Registry or Apicurio for -better schema management. In the current solution, we generate the descriptors via Buf and store them in GCS. -We could store them in a schema registry instead. +* **Support schema registries:** Integrate with schema registries such as Buf Registry or Apicurio for +better schema management. The current workflow generates the descriptors by using Buf and store them in Google Cloud Storage. +The descriptors could be stored in a schema registry instead. * **Enhanced Monitoring:** Implement advanced monitoring and alerting mechanisms to quickly identify and address issues in the data pipeline. -As a conclusion, by leveraging Beam YAML and Protobuf, we have streamlined the creation and maintenance of +Leveraging Beam YAML and Protobuf lets us streamline the creation and maintenance of data processing pipelines, significantly reducing complexity. This approach ensures that engineers can more -efficiently implement and scale robust, reusable pipelines, compared to writing the equivalent Beam code manually. +efficiently implement and scale robust, reusable pipelines without needs to manually write Beam code. -## Contributing +## Contribute -Developers who wish to help build out and add functionalities are welcome to start contributing to the effort in the -Beam YAML module found [here](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml). +Developers who want to help build out and add functionalities are welcome to start contributing to the effort in the +[Beam YAML module](https://github.com/apache/beam/tree/master/sdks/python/apache_beam/yaml). There is also a list of open [bugs](https://github.com/apache/beam/issues?q=is%3Aopen+is%3Aissue+label%3Ayaml) found on the GitHub repo - now marked with the `yaml` tag. From f713b3e30e9ef1c6b56df81abdd32e22c20209ff Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Ferran=20Fern=C3=A1ndez=20Garrido?= Date: Fri, 11 Oct 2024 23:00:54 +0200 Subject: [PATCH 17/17] Apply suggestions from code review Co-authored-by: Rebecca Szper <98840847+rszper@users.noreply.github.com> --- website/www/site/content/en/blog/beam-yaml-proto.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/website/www/site/content/en/blog/beam-yaml-proto.md b/website/www/site/content/en/blog/beam-yaml-proto.md index db56154750d5..995b59b978c1 100644 --- a/website/www/site/content/en/blog/beam-yaml-proto.md +++ b/website/www/site/content/en/blog/beam-yaml-proto.md @@ -20,6 +20,8 @@ See the License for the specific language governing permissions and limitations under the License. --> +# Efficient Streaming Data Processing with Beam YAML and Protobuf + As streaming data processing grows, so do its maintenance, complexity, and costs. This post explains how to efficiently scale pipelines by using [Protobuf](https://protobuf.dev/), which ensures that pipelines are reusable and quick to deploy. The goal is to keep this process simple @@ -31,7 +33,7 @@ for engineers to implement using [Beam YAML](https://beam.apache.org/documentati Creating a pipeline in Beam can be somewhat difficult, especially for new Apache Beam users. Setting up the project, managing dependencies, and so on can be challenging. -By using Beam YAML, you can eliminate most of the boilerplate code, +Beam YAML eliminates most of the boilerplate code, which allows you to focus on the most important part of the work: data transformation. Some of the key benefits of Beam YAML include: @@ -71,6 +73,8 @@ This section demonstrates the complete workflow for this pipeline. ### Create a simple proto event +The following code creates a simple movie event. + ```protobuf // events/v1/movie_event.proto