-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Managed Transform protos & translation; Iceberg SchemaTransforms & translation #30910
Managed Transform protos & translation; Iceberg SchemaTransforms & translation #30910
Conversation
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
...io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergWriteSchemaTransformProvider.java
Outdated
Show resolved
Hide resolved
...io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergWriteSchemaTransformProvider.java
Outdated
Show resolved
Hide resolved
...io/iceberg/src/main/java/org/apache/beam/io/iceberg/IcebergWriteSchemaTransformProvider.java
Outdated
Show resolved
Hide resolved
…t schema; add iceberg to IO expansion service
…erg_translation
…erg_write_schematransform
…erg_write_schematransform
…erg_translation Pulling Read connector and making a translation for that too.
… into iceberg_write_schematransform
… Managed and Iceberg urns from proto and use SCHEMA_TRANSFORM URN
.../iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergSchemaTransformTranslation.java
Outdated
Show resolved
Hide resolved
…se conversion step from Python auto-xlang; spotless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks.
I think this includes what we want in the release. But this won't work end-to-end for upgrading till we update the ExpansionService logic as I mentioned in a comment.
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/FileWriteResult.java
Outdated
Show resolved
Hide resolved
sdks/java/io/iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergIO.java
Show resolved
Hide resolved
...ceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergSchemaTransformCatalogConfig.java
Show resolved
Hide resolved
.../iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergSchemaTransformTranslation.java
Outdated
Show resolved
Hide resolved
.../iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergSchemaTransformTranslation.java
Outdated
Show resolved
Hide resolved
.../iceberg/src/main/java/org/apache/beam/sdk/io/iceberg/IcebergSchemaTransformTranslation.java
Outdated
Show resolved
Hide resolved
…A_TRANSFORM urn, fetch underlying identifier
…ersions from python side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience with all my comments, both on the CL and out of band.
There are a huge number of separable changes going on in this PR, but given the time constraint it probably isn't worth separating them into separate PRs/commits at this point, so we can get this in as is.
Thank you all for the valuable feedback. Merging this now |
* iceberg write schematransform and test * cleanup * IcebergIO translation and tests * add sanity check for building with Row; add documentation about output schema; add iceberg to IO expansion service * spotless * spotless * permitUnusedDeclared iceberg * Change ManagedSchemaTransformProvider to take a Row config instead of a Yaml string * don't auto generate external wrapper for this just yet * spotless * spotless * Read schematransform and tests * pulling in IcebergIO changes; spotless * icebergio translation; managed translation; protos * spotless * spotless; use underscore instead of camel case field names when translating managed transform config * add grpc dependency * updated proto description; fix gen xlang command * ManagedTransform explicit input/output types; move iceberg package to org.apache.beam.sdk.io.iceberg * externalizable IcebergCatalogConfig * externalizable IcebergCatalogConfig supports all properties; address some comments * unify iceberg urns and identifiers; update some comments * one source for all supported managed transform identifiers * add documentation * custom serialization for OneTableDynamicDestinations * add iceberg via managed API tests; update proto doc * rename config; change test schematransform location * spotless * add missing package-info file * spotless * replace icebergIO translation with iceberg schematransform translation; fix Schema::sorted to do recursive sorting * remove ExternalizableIcebergCatalogConfig (no longer needed) * pull identifiers from generated proto * remove unused hadoop dependency * update generate sequence wrapper after Schema sorting * managed transform translation uses default schema * yaml returns null row; cleanup * spotless * remove SchemaAwareTransformPayload and use SchemaTransformPayload instead; rename StandardSchemaAwareTransforms -> ManagedSchemaAwareTransforms * create a beam-schema-compatible class for Snapshot info * removed new proto file and moved Managed URNs to beam_runner_api.proto; we now use SchemaTransformPayload for all schematransforms, including Managed; adding a version number to FileWriteResult encoding so that we can use it to fork in the future whhen needed * Row and Schema snake_case <-> camelCase conversion logic * Row sorted() util * use Row::sorted to fetch Managed & Iceberg row configs * use snake_case convention when translating transforms to spec; remove Managed and Iceberg urns from proto and use SCHEMA_TRANSFORM URN * spotless * cleanup * DefaultSchemaProvider can now provide the underlying SchemaProvider * perform snake_case <-> camelCase conversions directly in TypedSchemaTransformProvider * update icebergIO and managed translations to reflect field name convention changes * sorted SnapshotInfo * update manual Python wrappers to use snake_case convention; remove case conversion step from Python auto-xlang; spotless * Row utils allow nullable * add FileWriteResult test for version number; fix existing Java and YAML tests * add schema-aware transform urn to transform annotations during translation * add comments why we sort and snake_case configuration schemas * add SchemaTransformTranslation abstraction. when encountering a SCHEMA_TRANSFORM urn, fetch underlying identifier * add documentation * prioritize registered providers; remove snake_case <-> camelCase conversions from python side * cleanup
sorted()
,toSnakeCase()
,toCamelCase()
)snake_case
as the convention for SchemaTransform configuration field names (Fixes [Task]: TypedSchemaTransformProvider should generate Schema field names withlower_snake_case
convention #31061)