Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YAML] Increase re-use of providers with implicitly overlapping transforms. #30793

Merged
merged 5 commits into from
May 1, 2024

Conversation

robertwb
Copy link
Contributor

We use the schema transform discovery service to determine all the transforms that a given provider vends which may be larger than those that were explicitly declared (e.g. the basic mapping transforms are part of java core) and use this to expand the possible set of transforms this provider can be used to instanciate (attaching the appropreate renaming, etc. as needed). This can greatly increase the chances that we can use the same provider, and hence same SDK and environment, for adjacent steps.

This is done lazily as transforms are used to avoid instanciating all possible providers for all pipelines.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

…forms.

We use the schema transform discovery service to determine all the
transforms that a given provider vends which may be larger than those
that were explicitly declared (e.g. the basic mapping transforms
are part of java core) and use this to expand the possible set of
transforms this provider can be used to instanciate (attaching the
appropreate renaming, etc. as needed). This can greatly increase the
chances that we can use the same provider, and hence same SDK
and environment, for adjacent steps.

This is done lazily as transforms are used to avoid instanciating
all possible providers for all pipelines.
@robertwb
Copy link
Contributor Author

R: @Polber

Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@robertwb
Copy link
Contributor Author

Any update on this?

Copy link
Contributor

@Polber Polber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a bit for me to wrap my head around the logic, but this makes sense to me.

So now with this logic, if we were to package all turnkey external transforms in a common gradle target, a single provider and expansion service could be used, correct?

@robertwb
Copy link
Contributor Author

Took a bit for me to wrap my head around the logic, but this makes sense to me.

So now with this logic, if we were to package all turnkey external transforms in a common gradle target, a single provider and expansion service could be used, correct?

Yes. (Though the downside of a huge common target is that it may be much more heavyweight than what the user needs. There can also be dependency conflicts. But it's something we could experiment with.)

@robertwb robertwb merged commit dc5841d into apache:master May 1, 2024
71 of 72 checks passed
@damccorm
Copy link
Contributor

damccorm commented May 3, 2024

Looks like this likely broke our python tests - https://github.com/apache/beam/actions/workflows/python_tests.yml?query=branch%3Amaster+event%3Apush - since this was merged, they've been permared/failing on yaml tests

if not isinstance(other_provider, ExternalProvider):
return

all_urns = set(other_provider.schema_transforms().keys()).union(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is part of the stack trace. I'm very confident this PR is the culprit, creating a revert for now since I'm not sure what the immediate solution is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... the only thing red was coverage in the presubmits.

Do you have a rollback already or should I create one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#31169 - still waiting on presubmits to go green

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... the only thing red was coverage in the presubmits.

Yeah, I noticed this - I'm not sure why... It looks like the failures are around 2.57.0.dev, so it could be some versioning issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked that we've published snapshots and everything though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants