-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YAML] Increase re-use of providers with implicitly overlapping transforms. #30793
Conversation
…forms. We use the schema transform discovery service to determine all the transforms that a given provider vends which may be larger than those that were explicitly declared (e.g. the basic mapping transforms are part of java core) and use this to expand the possible set of transforms this provider can be used to instanciate (attaching the appropreate renaming, etc. as needed). This can greatly increase the chances that we can use the same provider, and hence same SDK and environment, for adjacent steps. This is done lazily as transforms are used to avoid instanciating all possible providers for all pipelines.
R: @Polber |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
Any update on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a bit for me to wrap my head around the logic, but this makes sense to me.
So now with this logic, if we were to package all turnkey external transforms in a common gradle target, a single provider and expansion service could be used, correct?
Yes. (Though the downside of a huge common target is that it may be much more heavyweight than what the user needs. There can also be dependency conflicts. But it's something we could experiment with.) |
Looks like this likely broke our python tests - https://github.com/apache/beam/actions/workflows/python_tests.yml?query=branch%3Amaster+event%3Apush - since this was merged, they've been permared/failing on yaml tests |
if not isinstance(other_provider, ExternalProvider): | ||
return | ||
|
||
all_urns = set(other_provider.schema_transforms().keys()).union( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is part of the stack trace. I'm very confident this PR is the culprit, creating a revert for now since I'm not sure what the immediate solution is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... the only thing red was coverage in the presubmits.
Do you have a rollback already or should I create one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#31169 - still waiting on presubmits to go green
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... the only thing red was coverage in the presubmits.
Yeah, I noticed this - I'm not sure why... It looks like the failures are around 2.57.0.dev, so it could be some versioning issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked that we've published snapshots and everything though
We use the schema transform discovery service to determine all the transforms that a given provider vends which may be larger than those that were explicitly declared (e.g. the basic mapping transforms are part of java core) and use this to expand the possible set of transforms this provider can be used to instanciate (attaching the appropreate renaming, etc. as needed). This can greatly increase the chances that we can use the same provider, and hence same SDK and environment, for adjacent steps.
This is done lazily as transforms are used to avoid instanciating all possible providers for all pipelines.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.