-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Managed BigQueryIO #31486
Managed BigQueryIO #31486
Conversation
R: @chamikaramj |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
R: @chamikaramj |
…hods can instead be done in Dataflow service side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but this is a lot so good to get a second set of eyes. I could easily have missed something.
@@ -36,6 +36,9 @@ dependencies { | |||
permitUnusedDeclared project(":sdks:java:io:google-cloud-platform") // BEAM-11761 | |||
implementation project(":sdks:java:extensions:schemaio-expansion-service") | |||
permitUnusedDeclared project(":sdks:java:extensions:schemaio-expansion-service") // BEAM-11761 | |||
implementation project(":sdks:java:managed") | |||
permitUnusedDeclared project(":sdks:java:managed") // BEAM-11761 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notes, no action required on this PR:
- This is a link to Jira, so probably there's a github issue it is migrated to
- This should be equivalent to
runtimeOnly
because it is "implementation" but no static references to it. I would guess this works the same, or else the uberjar plugin might not treat it right. - Putting these deps into a docker container without making an uber jar would honestly be better in the case where it does end up in a container, so we keep the original jar metadata.
...pache/beam/sdk/io/gcp/bigquery/providers/BigQueryStorageWriteApiSchemaTransformProvider.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
.../java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryFileLoadsWriteSchemaTransformProvider.java
Outdated
Show resolved
Hide resolved
.../java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryFileLoadsWriteSchemaTransformProvider.java
Outdated
Show resolved
Hide resolved
.../java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryFileLoadsWriteSchemaTransformProvider.java
Outdated
Show resolved
Hide resolved
|
||
if (!Strings.isNullOrEmpty(configuration.getCreateDisposition())) { | ||
CreateDisposition createDisposition = | ||
CreateDisposition.valueOf(configuration.getCreateDisposition().toUpperCase()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a larger point, I think we should do any transform overriding in job submission (BQ modes for batch/streaming etc.) so that we can just upgrade in the backend (at least in the first version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean making this switch in the SDK (ie. construction time)? I assumed we had settled on making it a runner side decision
Some decisions are actually dependent on the runner (e.g. at least one streaming mode in Dataflow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean making this switch in the SDK (ie. construction time)? I assumed we had settled on making it a runner side decision
Yeah. Added some comments to the relavent doc.
Based on an offline discussion, we should do the forking to select the correct write transforms at a single 'BigQueryWriteSchemaWriteTransformProvider' instead of expecting the caller to perform the expansion for the exact implementation. The pipeline options needed for this (for example, This should be possible since all BQ write schema-transforms here share the same configuration. May be we should keep read general in a similar manner to support future expansions / read methods. cc: @robertwb |
I suggest we do the forking on the SDK side only if it's actually necessary. Keep in mind this approach will require adding the Dataflow runner as a dependency (to access dataflowServiceOptions) in Managed API or GCP IOs (runner-independent modules). This PR used to include such forking but that logic was reverted due to this seemingly unavoidable dependency: |
sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/ManagedSchemaTransformProvider.java
Outdated
Show resolved
Hide resolved
sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@@ -61,7 +60,7 @@ protected SchemaTransform from(BigQueryWriteConfiguration configuration) { | |||
|
|||
@Override | |||
public String identifier() { | |||
return getUrn(ExternalTransforms.ManagedTransforms.Urns.BIGQUERY_FILE_LOADS); | |||
return "beam:schematransform:org.apache.beam:bigquery_fileloads:v1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's still fine to define these URNs in the proto.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we don't need to right now? Since we're using a wrapper over file loads and storage api writes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be will be useful for any Python wrappers that directly use specific schema-transforms ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case maybe we can do this in a separate PR that targets all schematransforms?
@@ -100,7 +98,7 @@ protected SchemaTransform from(BigQueryWriteConfiguration configuration) { | |||
|
|||
@Override | |||
public String identifier() { | |||
return getUrn(ExternalTransforms.ManagedTransforms.Urns.BIGQUERY_STORAGE_WRITE); | |||
return "beam:schematransform:org.apache.beam:bigquery_storage_write:v2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
.../src/main/java/org/apache/beam/sdk/io/gcp/bigquery/providers/BigQueryWriteConfiguration.java
Show resolved
Hide resolved
readPipeline | ||
.apply(Managed.read(Managed.BIGQUERY).withConfig(config)) | ||
.getSinglePCollection(); | ||
PAssert.that(outputRows).containsInAnyOrder(ROWS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also confirm that we end up using the correct sink here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially did not know how we would perform such a check. Will think about it again and give it a try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests that verifies by looking at the pipeline proto, PTAL!
...ogle-cloud-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryManagedIT.java
Outdated
Show resolved
Hide resolved
sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/Managed.java
Show resolved
Hide resolved
sdks/java/managed/src/main/java/org/apache/beam/sdk/managed/ManagedTransformConstants.java
Show resolved
Hide resolved
LGTM. Feel free to merge once comments are addressed. |
…t transform is chosen
@@ -121,6 +153,8 @@ public void testStreamingStorageWriteRead() { | |||
// streaming write | |||
PCollectionRowTuple.of("input", getInput(writePipeline, true)) | |||
.apply(Managed.write(Managed.BIGQUERY).withConfig(config)); | |||
assertPipelineContainsTransformIdentifier( | |||
writePipeline, new BigQueryStorageWriteApiSchemaTransformProvider().identifier()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the MANAGED_UNDERLYING_TRANSFORM_URN_KEY annotation of the "Managed" transform will mention the top level BigQueryWriteSchemaTransform instead of specific implementation for the write method, right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably, we should add a unit test for this if we don't have one already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the MANAGED_UNDERLYING_TRANSFORM_URN_KEY annotation of the "Managed" transform will mention the top level BigQueryWriteSchemaTransform instead of specific implementation for the write method, right ?
Hmmm good catch, I just switched it to mentioning the top level URN. This means we lose information on what the underlying implementation is though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a unit test for this
Adding unit tests to the respective schematransform test classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tweaked the test to look for the transform name instead. Not as clean as looking for a URN, but it does the job
…change tests to verify underlying transform name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM.
P.S. added one last commit to expose to the Python SDK |
No description provided.