Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema inference parameterized types #32757

Merged

Conversation

reuvenlax
Copy link
Contributor

@reuvenlax reuvenlax commented Oct 11, 2024

Beam automatically infers schema for common Java types: POJOs, JavaBean, AutoValue, AVRO, among others. In addition for it's use with schema transforms, this also provides a simple, efficient way of using these types in PCollections without needing to construct a Coder. A frequent complaint has been that this inference did not work in the presence of generic type parameters. E.g. this means that while Beam can handle PCollections of this class:

@DefaultSchema(JavaFieldSchema.class)
class MyType {
String field1;
}
PCollection myTypes = read();

Schema inference would fail for this:

@DefaultSchema(JavaFieldSchema.class)
class MyType<T> {
T field1;
}
PCollection<MyType> myTypes = read();

This PR adds support for generic type parameters to schema inference for common types, addressing the above issue.

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@reuvenlax reuvenlax force-pushed the schema_inference_parameterized_types branch from 1957eea to 1366db3 Compare October 13, 2024 16:17
@reuvenlax reuvenlax requested a review from ahmedabu98 October 13, 2024 23:50
Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality looks sound and tests look great

Left some nits

There's few changes to public, non-internal, method signatures that are breaking though. Might make sense to overload these methods just in case (would also narrow down the scope of this PR)

@reuvenlax
Copy link
Contributor Author

@ahmedabu98 Any more comments? Are you ok with marking these classes as Internal, or do you prefer adding overloads?

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to marking classes as @Internal

As for overloads (do you mean the ReflectUtils.get__Type() methods?), it'd be an improvement but doesn't need to block this PR

LGTM 👍🏽

@ahmedabu98
Copy link
Contributor

I believe this change introduced a regression (see #32795)

Iceberg unit and integration tests are failing with the following error:

java.lang.IllegalArgumentException: unable to serialize SchemaCoder<Schema: Fields:
Field{...
Field{...

Tests are failing at HEAD, but passing when reverting this PR: #32802

@kennknowles
Copy link
Member

@reuvenlax
Copy link
Contributor Author

reuvenlax commented Oct 17, 2024 via email

@stankiewicz
Copy link
Contributor

I'm curious if prior work from #32081 had same issues. Can @tilgalas take a look as he explored this feature already and his PR was waiting in review queue for extended amount of time.

@ahmedabu98
Copy link
Contributor

Created an issue for the failing YAML tests: #32832

@kennknowles
Copy link
Member

The fix has been submitted - is that Yaml test still failing? Separately we should figure out why none of those tests triggered on my initial PR :(

On Wed, Oct 16, 2024 at 6:00 PM Kenn Knowles @.> wrote: Suspected culprit of breaking this one too https://github.com/apache/beam/actions/workflows/beam_PreCommit_Yaml_Xlang_Direct.yml?query=branch%3Amaster+is%3Acompleted — Reply to this email directly, view it on GitHub <#32757 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFAYJVJWYRWOHOV74QY6PS3Z34D4HAVCNFSM6AAAAABPZOTUP2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJYGI3TCNJYGE . You are receiving this because you modified the open/close state.Message ID: @.>

Yea we could see if there is a natural precommit trigger that would make sense. It may be that these are far enough apart in the codebase that postcommit is where we expect to catch it anyhow. Then we would roll back and roll forward touching a trigger file in .github/trigger_files

@Polber Polber mentioned this pull request Oct 17, 2024
3 tasks
reuvenlax added a commit to reuvenlax/incubator-beam that referenced this pull request Oct 21, 2024
reuvenlax added a commit that referenced this pull request Oct 22, 2024
reeba212 pushed a commit to reeba212/beam that referenced this pull request Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants