Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roll forward "Read API Source v2 (#25392)" fix data loss #28778

Merged
merged 4 commits into from
Feb 27, 2024

Conversation

Abacn
Copy link
Contributor

@Abacn Abacn commented Oct 2, 2023

Fixes #26354

This reverts commit 4ce8eed.

setEnableBundling is defunct since added to Beam repo and opt-in it will result in data loss, see #26354. After #26267 the number of sources returned by split is no longer an issue and BigQueryStorageStreamBundleSource is not needed after all.

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@Abacn
Copy link
Contributor Author

Abacn commented Oct 2, 2023

The only conflict is on the removal of Experimental annotation on master, giving this diff (+105 −2,863) different from original PR (+2,867 −105)

@Abacn
Copy link
Contributor Author

Abacn commented Oct 2, 2023

R: @ahmedabu98

@github-actions
Copy link
Contributor

github-actions bot commented Oct 2, 2023

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@ahmedabu98
Copy link
Contributor

IIRC the setEnableBundling option was implemented to resolve RPC latency issues (design doc here)

@vachan-shetty is this mode no longer needed?

@Abacn
Copy link
Contributor Author

Abacn commented Nov 9, 2023

Added --enableStorageReadApiV2 pipeline option to replace --enableBundling, tested on both runner v1 and v2:

For TPC-DS 1T dataset, it gives "Read session returned 1954 streams", and all records are read. See the internal gcp project

@vachan-shetty PTAL

@Abacn Abacn changed the title Revert "Read API Source v2 (#25392)" Roll forward "Read API Source v2 (#25392)" fix data loss Nov 9, 2023
@damccorm
Copy link
Contributor

@Abacn @vachan-shetty what are next steps here?

@Abacn Abacn closed this Jan 26, 2024
@Abacn Abacn deleted the revert25392 branch January 26, 2024 18:22
@Abacn Abacn restored the revert25392 branch January 26, 2024 18:25
@Abacn Abacn reopened this Jan 26, 2024
@vachan-shetty
Copy link
Contributor

👍🏽

@Abacn
Copy link
Contributor Author

Abacn commented Feb 26, 2024

R: @ahmedabu98 rebased onto latest master to resolve merge conflict. PTAL

@liferoad
Copy link
Collaborator

Can we update CHANGES.md to explain this breaking changes?

@Abacn
Copy link
Contributor Author

Abacn commented Feb 27, 2024

Can we update CHANGES.md to explain this breaking changes?

I thought it is fine to remove it without mention it as last time the addition of --enableBundling wasn't mentioned in CHANGES.md either, and it's not been used. However indeed we should note "setEnableStorageReadApiV2" is experimental and could update CHANGES, will do.

Comment on lines +168 to +173
"If set, BigQueryIO.Read will rely on the Read API backends to surface the appropriate"
+ " number of streams for read")
@Default.Boolean(false)
Boolean getEnableBundling();
Boolean getEnableStorageReadApiV2();

void setEnableBundling(Boolean value);
void setEnableStorageReadApiV2(Boolean value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this option be removed entirely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect of this option is to set the number of stream requested to 0 so the server will decide an appropriate number of read streams (which is Read API source v2). It is preserved for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense

@Abacn
Copy link
Contributor Author

Abacn commented Feb 27, 2024

SpannerChangeStreamOrderedWithinKeyIT.testOrderedWithinKey unrelated flaky test, merging for now

@Abacn Abacn merged commit 549faba into apache:master Feb 27, 2024
17 of 19 checks passed
@Abacn Abacn deleted the revert25392 branch February 27, 2024 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: BigQueryIO direct read not reading all rows when set --setEnableBundling=true
5 participants