Managed BigQueryIO #31486

ahmedabu98 · 2024-06-03T23:14:54Z

No description provided.

ahmedabu98 · 2024-06-03T23:15:40Z

R: @chamikaramj

github-actions · 2024-06-03T23:16:45Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

…ged_bigquery

ahmedabu98 · 2024-07-09T21:46:28Z

R: @chamikaramj
R: @robertwb

…hods can instead be done in Dataflow service side

kennknowles

LGTM but this is a lot so good to get a second set of eyes. I could easily have missed something.

kennknowles · 2024-07-22T19:39:12Z

sdks/java/io/google-cloud-platform/expansion-service/build.gradle

@@ -36,6 +36,9 @@ dependencies {
    permitUnusedDeclared project(":sdks:java:io:google-cloud-platform") // BEAM-11761
    implementation project(":sdks:java:extensions:schemaio-expansion-service")
    permitUnusedDeclared project(":sdks:java:extensions:schemaio-expansion-service") // BEAM-11761
+    implementation project(":sdks:java:managed")
+    permitUnusedDeclared project(":sdks:java:managed") // BEAM-11761


Notes, no action required on this PR:

This is a link to Jira, so probably there's a github issue it is migrated to

This should be equivalent to runtimeOnly because it is "implementation" but no static references to it. I would guess this works the same, or else the uberjar plugin might not treat it right.

Putting these deps into a docker container without making an uber jar would honestly be better in the case where it does end up in a container, so we keep the original jar metadata.

...pache/beam/sdk/io/gcp/bigquery/providers/BigQueryStorageWriteApiSchemaTransformProvider.java

chamikaramj

Thanks.

.../java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryFileLoadsWriteSchemaTransformProvider.java

chamikaramj · 2024-08-05T18:21:15Z

.../java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryFileLoadsWriteSchemaTransformProvider.java

+
+      if (!Strings.isNullOrEmpty(configuration.getCreateDisposition())) {
+        CreateDisposition createDisposition =
+            CreateDisposition.valueOf(configuration.getCreateDisposition().toUpperCase());


As a larger point, I think we should do any transform overriding in job submission (BQ modes for batch/streaming etc.) so that we can just upgrade in the backend (at least in the first version).

Do you mean making this switch in the SDK (ie. construction time)? I assumed we had settled on making it a runner side decision

Some decisions are actually dependent on the runner (e.g. at least one streaming mode in Dataflow)

Do you mean making this switch in the SDK (ie. construction time)? I assumed we had settled on making it a runner side decision

Yeah. Added some comments to the relavent doc.

…ged_bigquery

chamikaramj · 2024-10-31T22:42:15Z

Based on an offline discussion, we should do the forking to select the correct write transforms at a single 'BigQueryWriteSchemaWriteTransformProvider' instead of expecting the caller to perform the expansion for the exact implementation. The pipeline options needed for this (for example, dataflowServiceOptions=streaming_mode_at_least_once) should be provided to this call (accessible via input.getPipeline().getOptions()). Also, we can use input.get(INPUT_ROWS_TAG).isBounded() to determine whether this is an unbounded call or not.

This should be possible since all BQ write schema-transforms here share the same configuration.

May be we should keep read general in a similar manner to support future expansions / read methods.

cc: @robertwb

ahmedabu98 · 2024-10-31T23:47:09Z

I suggest we do the forking on the SDK side only if it's actually necessary. Keep in mind this approach will require adding the Dataflow runner as a dependency (to access dataflowServiceOptions) in Managed API or GCP IOs (runner-independent modules).

This PR used to include such forking but that logic was reverted due to this seemingly unavoidable dependency:
74bc178

sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/ManagedSchemaTransformProvider.java

sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java

chamikaramj

Thanks!

chamikaramj · 2024-11-06T17:58:32Z

...apache/beam/sdk/io/gcp/bigquery/providers/BigQueryFileLoadsWriteSchemaTransformProvider.java

@@ -61,7 +60,7 @@ protected SchemaTransform from(BigQueryWriteConfiguration configuration) {

  @Override
  public String identifier() {
-    return getUrn(ExternalTransforms.ManagedTransforms.Urns.BIGQUERY_FILE_LOADS);
+    return "beam:schematransform:org.apache.beam:bigquery_fileloads:v1";


I think it's still fine to define these URNs in the proto.

I feel like we don't need to right now? Since we're using a wrapper over file loads and storage api writes

May be will be useful for any Python wrappers that directly use specific schema-transforms ?

In that case maybe we can do this in a separate PR that targets all schematransforms?

chamikaramj · 2024-11-06T17:58:39Z

...pache/beam/sdk/io/gcp/bigquery/providers/BigQueryStorageWriteApiSchemaTransformProvider.java

@@ -100,7 +98,7 @@ protected SchemaTransform from(BigQueryWriteConfiguration configuration) {

  @Override
  public String identifier() {
-    return getUrn(ExternalTransforms.ManagedTransforms.Urns.BIGQUERY_STORAGE_WRITE);
+    return "beam:schematransform:org.apache.beam:bigquery_storage_write:v2";


.../src/main/java/org/apache/beam/sdk/io/gcp/bigquery/providers/BigQueryWriteConfiguration.java

chamikaramj · 2024-11-06T19:17:31Z

...ogle-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryManagedIT.java

+        readPipeline
+            .apply(Managed.read(Managed.BIGQUERY).withConfig(config))
+            .getSinglePCollection();
+    PAssert.that(outputRows).containsInAnyOrder(ROWS);


Also confirm that we end up using the correct sink here.

Initially did not know how we would perform such a check. Will think about it again and give it a try

Added tests that verifies by looking at the pipeline proto, PTAL!

...ogle-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryManagedIT.java

sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java

sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/ManagedTransformConstants.java

chamikaramj · 2024-11-06T21:01:22Z

LGTM. Feel free to merge once comments are addressed.

…t transform is chosen

chamikaramj · 2024-11-07T23:00:24Z

...-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/providers/BigQueryManagedIT.java

@@ -121,6 +153,8 @@ public void testStreamingStorageWriteRead() {
    // streaming write
    PCollectionRowTuple.of("input", getInput(writePipeline, true))
        .apply(Managed.write(Managed.BIGQUERY).withConfig(config));
+    assertPipelineContainsTransformIdentifier(
+        writePipeline, new BigQueryStorageWriteApiSchemaTransformProvider().identifier());


So the MANAGED_UNDERLYING_TRANSFORM_URN_KEY annotation of the "Managed" transform will mention the top level BigQueryWriteSchemaTransform instead of specific implementation for the write method, right ?

Probably, we should add a unit test for this if we don't have one already.

So the MANAGED_UNDERLYING_TRANSFORM_URN_KEY annotation of the "Managed" transform will mention the top level BigQueryWriteSchemaTransform instead of specific implementation for the write method, right ?

Hmmm good catch, I just switched it to mentioning the top level URN. This means we lose information on what the underlying implementation is though

we should add a unit test for this

Adding unit tests to the respective schematransform test classes

I tweaked the test to look for the transform name instead. Not as clean as looking for a URN, but it does the job

…change tests to verify underlying transform name

…ged_bigquery

chamikaramj

Thanks. LGTM.

ahmedabu98 · 2024-11-11T13:31:10Z

P.S. added one last commit to expose to the Python SDK

…ged_bigquery

managed bigqueryio

ce14b96

github-actions bot added java io gcp labels Jun 3, 2024

ahmedabu98 added this to the 2.57.0 Release milestone Jun 3, 2024

ahmedabu98 added 2 commits June 3, 2024 21:25

spotless

550c1b4

move managed dependency to test only

c94de3c

ahmedabu98 changed the title ~~Supported BigQueryIO as a Managed Transform~~ Support BigQueryIO (Storage API) as a Managed Transform Jun 4, 2024

ahmedabu98 removed this from the 2.57.0 Release milestone Jun 4, 2024

ahmedabu98 added 3 commits June 4, 2024 20:49

Merge branch 'master' of https://github.com/ahmedabu98/beam into mana…

912dc08

…ged_bigquery

cleanup after merging snake_case PR

f436e62

choose write method based on boundedness and pipeline options

fe60904

ahmedabu98 changed the title ~~Support BigQueryIO (Storage API) as a Managed Transform~~ Managed BigQueryIO Jul 9, 2024

ahmedabu98 added 3 commits July 9, 2024 17:20

Merge branch 'master' of https://github.com/ahmedabu98/beam into mana…

7d405cf

…ged_bigquery

rename bigquery write config class

d45159f

spotless

989ad0f

ahmedabu98 added this to the 2.58.0 Release milestone Jul 9, 2024

change read output tag to 'output'

b9b49e7

spotless

a119bbc

ahmedabu98 removed this from the 2.58.0 Release milestone Jul 10, 2024

ahmedabu98 added 2 commits July 16, 2024 12:06

revert logic that depends on DataflowServiceOptions. switching BQ met…

74bc178

…hods can instead be done in Dataflow service side

spotless

528b504

kennknowles approved these changes Jul 22, 2024

View reviewed changes

fix typo

dcc398a

chamikaramj reviewed Aug 5, 2024

View reviewed changes

ahmedabu98 added 2 commits August 6, 2024 12:45

separate BQ write config to a new class

36edc38

fix doc

f9be86c

ahmedabu98 removed this from the 2.60.0 Release milestone Oct 4, 2024

ahmedabu98 added 2 commits October 26, 2024 00:22

Merge branch 'master' of https://github.com/ahmedabu98/beam into mana…

bd1e534

…ged_bigquery

resolve after syncing to HEAD

a26765e

github-actions bot added build model labels Oct 25, 2024

spotless

725f7bd

liferoad added this to the 2.61.0 Release milestone Oct 28, 2024

ahmedabu98 added 3 commits November 5, 2024 07:06

fork on batch/streaming

2631104

cleanup

770cf50

spotless

0a70466

ahmedabu98 mentioned this pull request Nov 5, 2024

Portable Managed BigQuery destinations #33017

Merged

chamikaramj reviewed Nov 5, 2024

View reviewed changes

sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/ManagedSchemaTransformProvider.java Outdated Show resolved Hide resolved

chamikaramj reviewed Nov 5, 2024

View reviewed changes

sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java Outdated Show resolved Hide resolved

move forking logic to BQ schematransform side

01a01f7

chamikaramj reviewed Nov 6, 2024

View reviewed changes

add file loads translation and tests; add test checks that the correc…

697c0b8

…t transform is chosen

chamikaramj reviewed Nov 7, 2024

View reviewed changes

ahmedabu98 added 4 commits November 7, 2024 19:55

set top-level wrapper to be the underlying managed BQ transform urn; …

105474b

…change tests to verify underlying transform name

move unit tests to respectvie schematransform test classes

d6b9e69

Merge branch 'master' of https://github.com/ahmedabu98/beam into mana…

c0767d7

…ged_bigquery

Merge branch 'master' of https://github.com/ahmedabu98/beam into mana…

a600f62

…ged_bigquery

chamikaramj approved these changes Nov 10, 2024

View reviewed changes

expose to Python SDK as well

ad4dcd9

github-actions bot added the python label Nov 11, 2024

Merge branch 'master' of https://github.com/ahmedabu98/beam into mana…

6f325ce

…ged_bigquery

ahmedabu98 merged commit 628348b into apache:master Nov 12, 2024
113 of 116 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Managed BigQueryIO #31486

Managed BigQueryIO #31486

ahmedabu98 commented Jun 3, 2024

ahmedabu98 commented Jun 3, 2024

github-actions bot commented Jun 3, 2024

ahmedabu98 commented Jul 9, 2024

kennknowles left a comment

kennknowles Jul 22, 2024

chamikaramj left a comment

chamikaramj Aug 5, 2024

ahmedabu98 Aug 6, 2024

chamikaramj Aug 6, 2024

chamikaramj commented Oct 31, 2024

ahmedabu98 commented Oct 31, 2024 •

edited

Loading

chamikaramj left a comment

chamikaramj Nov 6, 2024

ahmedabu98 Nov 6, 2024

chamikaramj Nov 6, 2024

ahmedabu98 Nov 7, 2024

chamikaramj Nov 6, 2024

chamikaramj Nov 6, 2024

ahmedabu98 Nov 6, 2024

ahmedabu98 Nov 7, 2024

chamikaramj commented Nov 6, 2024

chamikaramj Nov 7, 2024

chamikaramj Nov 7, 2024

ahmedabu98 Nov 8, 2024

ahmedabu98 Nov 8, 2024 •

edited

Loading

ahmedabu98 Nov 8, 2024

chamikaramj left a comment

ahmedabu98 commented Nov 11, 2024

Managed BigQueryIO #31486

Managed BigQueryIO #31486

Conversation

ahmedabu98 commented Jun 3, 2024

ahmedabu98 commented Jun 3, 2024

github-actions bot commented Jun 3, 2024

ahmedabu98 commented Jul 9, 2024

kennknowles left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj commented Oct 31, 2024

ahmedabu98 commented Oct 31, 2024 • edited Loading

chamikaramj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj commented Nov 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahmedabu98 Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chamikaramj left a comment

Choose a reason for hiding this comment

ahmedabu98 commented Nov 11, 2024

ahmedabu98 commented Oct 31, 2024 •

edited

Loading

ahmedabu98 Nov 8, 2024 •

edited

Loading