From 497ca670d6ce33d9cc4e1aa34873a19c632efa44 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 19 Nov 2024 14:25:33 +0000 Subject: [PATCH 1/2] Add BigQuery enrichment bug to changes.md and blogs --- CHANGES.md | 27 +++++++++++++++++++ .../www/site/content/en/blog/beam-2.57.0.md | 4 +++ .../www/site/content/en/blog/beam-2.58.0.md | 4 +++ .../www/site/content/en/blog/beam-2.59.0.md | 5 +++- .../www/site/content/en/blog/beam-2.60.0.md | 5 +++- 5 files changed, 43 insertions(+), 2 deletions(-) diff --git a/CHANGES.md b/CHANGES.md index d0fede5cb5f9..adc384b1ff05 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -131,6 +131,10 @@ * (Java) Fixed tearDown not invoked when DoFn throws on Portable Runners ([#18592](https://github.com/apache/beam/issues/18592), [#31381](https://github.com/apache/beam/issues/31381)). * (Java) Fixed protobuf error with MapState.remove() in Dataflow Streaming Java Legacy Runner without Streaming Engine ([#32892](https://github.com/apache/beam/issues/32892)). * Adding flag to support conditionally disabling auto-commit in JdbcIO ReadFn ([#31111](https://github.com/apache/beam/issues/31111)) +* (Python) Fixed BigQuery Enrichment bug that can laed to multiple conditions returning duplicate rows, batching returning incorrect results and conditions not scoped by row during batching ([#32780](https://github.com/apache/beam/pull/32780)). +* Fixed BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([#32780](https://github.com/apache/beam/pull/32780)): + * Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output. + * Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results. ## Security Fixes * Fixed (CVE-YYYY-NNNN)[https://www.cve.org/CVERecord?id=CVE-YYYY-NNNN] (Java/Python/Go) ([#X](https://github.com/apache/beam/issues/X)). @@ -186,6 +190,13 @@ when running on 3.8. ([#31192](https://github.com/apache/beam/issues/31192)) * (Java, Python, Go) Fixed PeriodicSequence backlog bytes reporting, which was preventing Dataflow Runner autoscaling from functioning properly ([#32506](https://github.com/apache/beam/issues/32506)). * (Java) Fix improper decoding of rows with schemas containing nullable fields when encoded with a schema with equal encoding positions but modified field order. ([#32388](https://github.com/apache/beam/issues/32388)). +## Known Issues + +* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([#32780](https://github.com/apache/beam/pull/32780)): + * Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output. + * Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results. + * Fixed in 2.61.0. + # [2.59.0] - 2024-09-11 ## Highlights @@ -229,6 +240,10 @@ when running on 3.8. ([#31192](https://github.com/apache/beam/issues/31192)) * If your pipeline is having difficulty with the Python or Java direct runners, but runs well on Prism, please let us know. * Java file-based IOs read or write lots (100k+) files could experience slowness and/or broken metrics visualization on Dataflow UI [#32649](https://github.com/apache/beam/issues/32649). +* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([#32780](https://github.com/apache/beam/pull/32780)): + * Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output. + * Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results. + * Fixed in 2.61.0. # [2.58.1] - 2024-08-15 @@ -240,6 +255,10 @@ when running on 3.8. ([#31192](https://github.com/apache/beam/issues/31192)) * Large Dataflow graphs using runner v2, or pipelines explicitly enabling the `upload_graph` experiment, will fail at construction time ([#32159](https://github.com/apache/beam/issues/32159)). * Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([#32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer. +* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([#32780](https://github.com/apache/beam/pull/32780)): + * Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output. + * Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results. + * Fixed in 2.61.0. # [2.58.0] - 2024-08-06 @@ -271,6 +290,10 @@ when running on 3.8. ([#31192](https://github.com/apache/beam/issues/31192)) * Large Dataflow graphs using runner v2, or pipelines explicitly enabling the `upload_graph` experiment, will fail at construction time ([#32159](https://github.com/apache/beam/issues/32159)). * Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([#32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer. * [KafkaIO] Records read with `ReadFromKafkaViaSDF` are redistributed and may contain duplicates regardless of the configuration. This affects Java pipelines with Dataflow v2 runner and xlang pipelines reading from Kafka, ([#32196](https://github.com/apache/beam/issues/32196)) +* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([#32780](https://github.com/apache/beam/pull/32780)): + * Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output. + * Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results. + * Fixed in 2.61.0. # [2.57.0] - 2024-06-26 @@ -327,6 +350,10 @@ when running on 3.8. ([#31192](https://github.com/apache/beam/issues/31192)) * Large Dataflow graphs using runner v2, or pipelines explicitly enabling the `upload_graph` experiment, will fail at construction time ([#32159](https://github.com/apache/beam/issues/32159)). * Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([#32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer. +* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([#32780](https://github.com/apache/beam/pull/32780)): + * Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output. + * Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results. + * Fixed in 2.61.0. # [2.56.0] - 2024-05-01 diff --git a/website/www/site/content/en/blog/beam-2.57.0.md b/website/www/site/content/en/blog/beam-2.57.0.md index b583b4ee3c51..7be75a7891c5 100644 --- a/website/www/site/content/en/blog/beam-2.57.0.md +++ b/website/www/site/content/en/blog/beam-2.57.0.md @@ -79,6 +79,10 @@ For more information on changes in 2.57.0, check out the [detailed release notes ## Known Issues * Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([#32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer. +* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([#32780](https://github.com/apache/beam/pull/32780)): + * Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output. + * Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results. + * Fixed in 2.61.0. For the most up to date list of known issues, see https://github.com/apache/beam/blob/master/CHANGES.md diff --git a/website/www/site/content/en/blog/beam-2.58.0.md b/website/www/site/content/en/blog/beam-2.58.0.md index cfdf23c725e0..0d944fd419d4 100644 --- a/website/www/site/content/en/blog/beam-2.58.0.md +++ b/website/www/site/content/en/blog/beam-2.58.0.md @@ -53,6 +53,10 @@ For more information about changes in 2.58.0, check out the [detailed release no * Python pipelines that run with 2.53.0-2.58.0 SDKs and read data from GCS might be affected by a data corruption issue ([#32169](https://github.com/apache/beam/issues/32169)). The issue will be fixed in 2.59.0 ([#32135](https://github.com/apache/beam/pull/32135)). To work around this, update the google-cloud-storage package to version 2.18.2 or newer. * [KafkaIO] Records read with `ReadFromKafkaViaSDF` are redistributed and may contain duplicates regardless of the configuration. This affects Java pipelines with Dataflow v2 runner and xlang pipelines reading from Kafka, ([#32196](https://github.com/apache/beam/issues/32196)) +* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([#32780](https://github.com/apache/beam/pull/32780)): + * Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output. + * Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results. + * Fixed in 2.61.0. For the most up to date list of known issues, see https://github.com/apache/beam/blob/master/CHANGES.md diff --git a/website/www/site/content/en/blog/beam-2.59.0.md b/website/www/site/content/en/blog/beam-2.59.0.md index 6ce81e7c48eb..846e45916d66 100644 --- a/website/www/site/content/en/blog/beam-2.59.0.md +++ b/website/www/site/content/en/blog/beam-2.59.0.md @@ -66,8 +66,11 @@ For more information on changes in 2.59.0, check out the [detailed release notes * In the 2.59.0 release, Prism passes most runner validations tests with the exceptions of pipelines using the following features: OrderedListState, OnWindowExpiry (eg. GroupIntoBatches), CustomWindows, MergingWindowFns, Trigger and WindowingStrategy associated features, Bundle Finalization, Looping Timers, and some Coder related issues such as with Python combiner packing, and Java Schema transforms, and heterogenous flatten coders. Processing Time timers do not yet have real time support. * If your pipeline is having difficulty with the Python or Java direct runners, but runs well on Prism, please let us know. - * Java file-based IOs read or write lots (100k+) files could experience slowness and/or broken metrics visualization on Dataflow UI [#32649](https://github.com/apache/beam/issues/32649). +* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([#32780](https://github.com/apache/beam/pull/32780)): + * Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output. + * Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results. + * Fixed in 2.61.0. For the most up to date list of known issues, see https://github.com/apache/beam/blob/master/CHANGES.md diff --git a/website/www/site/content/en/blog/beam-2.60.0.md b/website/www/site/content/en/blog/beam-2.60.0.md index 462bdaf16798..ae5e0284ccdd 100644 --- a/website/www/site/content/en/blog/beam-2.60.0.md +++ b/website/www/site/content/en/blog/beam-2.60.0.md @@ -70,7 +70,10 @@ when running on 3.8. ([#31192](https://github.com/apache/beam/issues/31192)) ## Known Issues -N/A +* BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([#32780](https://github.com/apache/beam/pull/32780)): + * Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output. + * Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results. + * Fixed in 2.61.0. For the most up to date list of known issues, see https://github.com/apache/beam/blob/master/CHANGES.md From 2daf0668e3e6d4950e6360210497d23929231760 Mon Sep 17 00:00:00 2001 From: Claude Date: Tue, 19 Nov 2024 15:46:57 +0000 Subject: [PATCH 2/2] Address comments --- CHANGES.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/CHANGES.md b/CHANGES.md index adc384b1ff05..4c8118bcd292 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -131,10 +131,7 @@ * (Java) Fixed tearDown not invoked when DoFn throws on Portable Runners ([#18592](https://github.com/apache/beam/issues/18592), [#31381](https://github.com/apache/beam/issues/31381)). * (Java) Fixed protobuf error with MapState.remove() in Dataflow Streaming Java Legacy Runner without Streaming Engine ([#32892](https://github.com/apache/beam/issues/32892)). * Adding flag to support conditionally disabling auto-commit in JdbcIO ReadFn ([#31111](https://github.com/apache/beam/issues/31111)) -* (Python) Fixed BigQuery Enrichment bug that can laed to multiple conditions returning duplicate rows, batching returning incorrect results and conditions not scoped by row during batching ([#32780](https://github.com/apache/beam/pull/32780)). -* Fixed BigQuery Enrichment (Python): The following issues are present when using the BigQuery enrichment transform ([#32780](https://github.com/apache/beam/pull/32780)): - * Duplicate Rows: Multiple conditions may be applied incorrectly, leading to the duplication of rows in the output. - * Incorrect Results with Batched Requests: Conditions may not be correctly scoped to individual rows within the batch, potentially causing inaccurate results. +* (Python) Fixed BigQuery Enrichment bug that can lead to multiple conditions returning duplicate rows, batching returning incorrect results and conditions not scoped by row during batching ([#32780](https://github.com/apache/beam/pull/32780)). ## Security Fixes * Fixed (CVE-YYYY-NNNN)[https://www.cve.org/CVERecord?id=CVE-YYYY-NNNN] (Java/Python/Go) ([#X](https://github.com/apache/beam/issues/X)).