Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Develop a way to evolve the construction schema used by transforms safely #31686

Open
1 of 16 tasks
chamikaramj opened this issue Jun 25, 2024 · 4 comments
Open
1 of 16 tasks

Comments

@chamikaramj
Copy link
Contributor

chamikaramj commented Jun 25, 2024

What would you like to happen?

We ran into several regressions due to updates to construction schema of I/O transforms breaking transform upgrade via TransformService (from older Beam versions).

The pattern is:
(1) We add a new field/property to the I/O transform
(2) We update the schema in the corresponding IOTranslation class (for example, [1] [2])
(3) We update the schema.

If we do not take special mitigation actions, this might result in breakages during upgrade. This is because the Row objects we try to parse might actually come from older Beam versions where the new field does not exist.

Example regressions/fixes: [3] [4]

Potential strategies to prevent similar issues in the future:
(1) Develop utils to safely migrate the schema
(2) More integration/compat tests
(3) Move to proto

[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIOTranslation.java
[2]


[3] #30551
[4]#31685

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
@chamikaramj
Copy link
Contributor Author

We might run into similar issues when we try to migrate the schema for SchemaCoder I think (for streaming update operations).

@chamikaramj
Copy link
Contributor Author

cc: @kennknowles @ahmedabu98 @robertwb

@kennknowles
Copy link
Member

This could even impact jobs just running as cronjobs, right?

@chamikaramj
Copy link
Contributor Author

You mean as opposed to streaming updates ? If so yes. This gets hit when a pipeline tries to upgrade a single transform to a new Beam version (via TransformService for example).

https://beam.apache.org/documentation/programming-guide/#transform-service-usage-upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants