Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beam 12994 - Python SDK BigQuery - Promote schemaUpdateOptions to named arguments #21867

Closed
wants to merge 7 commits into from
Closed

Conversation

waltage
Copy link

@waltage waltage commented Jun 14, 2022

Python - IO - GCP - BigQuery:

Promotes schema update options that are currently available as entries in the additional_bq_parameters dict to named and verified arguments (similar to the current Create and Write Dispositions).

addresses #21141

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@asf-ci
Copy link

asf-ci commented Jun 14, 2022

Can one of the admins verify this patch?

4 similar comments
@asf-ci
Copy link

asf-ci commented Jun 14, 2022

Can one of the admins verify this patch?

@asf-ci
Copy link

asf-ci commented Jun 14, 2022

Can one of the admins verify this patch?

@asf-ci
Copy link

asf-ci commented Jun 14, 2022

Can one of the admins verify this patch?

@asf-ci
Copy link

asf-ci commented Jun 14, 2022

Can one of the admins verify this patch?

@codecov
Copy link

codecov bot commented Jun 14, 2022

Codecov Report

Merging #21867 (1c6892b) into master (4b33a38) will increase coverage by 1.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #21867      +/-   ##
==========================================
+ Coverage   74.07%   75.11%   +1.03%     
==========================================
  Files         698      699       +1     
  Lines       92574    98173    +5599     
==========================================
+ Hits        68577    73739    +5162     
- Misses      22742    23179     +437     
  Partials     1255     1255              
Flag Coverage Δ
python 84.43% <100.00%> (+0.66%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
sdks/python/apache_beam/io/gcp/bigquery.py 72.15% <100.00%> (+2.07%) ⬆️
...s/python/apache_beam/io/gcp/bigquery_file_loads.py 88.76% <100.00%> (+1.06%) ⬆️
sdks/python/apache_beam/io/gcp/bigquery_tools.py 85.69% <100.00%> (+0.03%) ⬆️

... and 50 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@waltage waltage marked this pull request as ready for review June 15, 2022 17:12
@waltage
Copy link
Author

waltage commented Jun 15, 2022

The next closest OWNERS file is from the IO module, so please excuse me if this is an incorrect reviewer choice:
R: @aaltay

@aaltay aaltay requested a review from pabloem June 15, 2022 17:31
@waltage
Copy link
Author

waltage commented Jul 1, 2022

following up on the status of this review

@waltage
Copy link
Author

waltage commented Jul 12, 2022

@aaltay @pabloem
I recognize this is a lower priority issue/PR, but it's my first PR with Apache and on Beam. I'd like to move on to some higher priority issues for you all, but this review is currently a blocker

@aaltay
Copy link
Member

aaltay commented Jul 14, 2022

R: @pabloem @johnjcasey - could one of you please review?

@waltage - could you please fix the test errors in the mean time?

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@johnjcasey
Copy link
Contributor

This looks structurally fine to me. My only question is if we have an integration test suite that could be expanded to include a test that leverages these parameter, to verify they work as expected

@waltage
Copy link
Author

waltage commented Jul 15, 2022

This looks structurally fine to me. My only question is if we have an integration test suite that could be expanded to include a test that leverages these parameter, to verify they work as expected

If you could point me in the right direction (like a particular directory or a test file), I can work on adding one.

My original thinking was that any integration testing on this particular "feature" would reduce to a change-detector test given that the implementation is setting the parameter on the proto request itself (here), but it is duly noted that the string values themselves are not explicitly being tested anywhere

@pabloem
Copy link
Member

pabloem commented Jul 15, 2022

hi @waltage I am so sorry to be so delayed on this. I prefer to avoid adding new parameters to the transform - it already has lots and lots of parameters, and it's hard to be sure which ones to fill up.

I can see it's valuable to be able to validate the parameter directly, so that's fair. (although we could potentially validate it withinadditional_bq_parameters)

Can you tell me more about the reasoning to add it?

@waltage
Copy link
Author

waltage commented Jul 18, 2022

@pabloem no worries on the review timeline.

I am not the original reporter of the Jira issue, but it looks like this particular feature was requested with an objective of bringing the Python API closer in line with Java API. Since the issue was ported over to github and tagged with first issue, and since Ken in the original issue said it seemed like a good idea, I went ahead and worked on implementing it here.

I agree that this particular initializer is quite dense, regardless of whether this option is promoted to a parameter. I think there are a few ways I could go with this:

  1. (assuming I can update the other unit tests that use this transform, and provide sufficient integration coverage) Allow this option promotion in light of the already dense parameter set.
  2. Develop type and value checking for the ...bq_parameters... dict, which would allow the surface of the initializer to remain consistent.
  3. Close the issue/pr, as interest looks like it has subsided.
  4. Abstract away a majority of the transform's initializer parameters that are BigQuery-specific into a new type/value-checked Class that would allow more flexibility for changes like this (or changes to the BigQuery APIs themselves) without modifying existing surfaces of the transform itself.

In a related note, as a tie-breaker for 1 & 2 & 3, and in support of the design of 4 above, does the BEAM team currently analyze any of the public usages of its python libraries? I can look at how this transform is used "in the wild" and try to understand how often parameters are used vs. what keys of the dict are typically set (e.g., at first glance, it looks like partitioning specs are by far the most commonly used keys in the ...bq_parameters... dict).

I'm happy to work on any of the above, just let me know how best to proceed.

@pabloem
Copy link
Member

pabloem commented Jul 22, 2022

thanks for the explanation @waltage - I am going to be on vacation for a little while, so I can't advice at the moment.

@ahmedabu98 @Abacn @johnjcasey could y'all help Daniel figure out how to move forward? thanks

@johnjcasey
Copy link
Contributor

I am ok with the dense parameter set. The Beam convention is to use named parameters in python like we use builder parameters in Java. If something has a default, that default must be valid such that the user doesn't need to provide it.

I like brining python in line with Java, and I prefer having a named parameter to a dict for additional parameters

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @waltage! Left a couple of comments.

I can't speak on the user demand of having this option as a parameter in WriteToBigQuery, but I don't see a problem with adding it as long as users can still specify these options in additional_load_parameters.

Comment on lines +534 to 537
schemaUpdateOptions=schema_update_options,
sourceFormat=source_format,
useAvroLogicalTypes=True,
autodetect=schema == 'SCHEMA_AUTODETECT',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should confirm early in WriteToBigQuery that the schema update options are present in either schema_update_options or additional_load_parameters and resolve if options are present in both.

If options are present in additional_load_parameters at this point, we would have two schemaUpdateOptions parameters here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also allow backwards compatibility. If users want to keep using additional_load_parameters instead to specify update options, would they be able to?

@ahmedabu98
Copy link
Contributor

Agree with @johnjcasey that we need an integration test for this change, specifically to make sure that both additional_load_parameters and schema_update_options are viable. I would be happy to review that test.

@waltage we have a few BQ tests in apache_beam/io/gcp. This is an example test that uses BigQueryTableMatcher to test for options in additional_bq_parameters.

@ahmedabu98
Copy link
Contributor

FYI would need to see what existing checks are performed against additional_parameters["schemaUpdateOptions"] and make sure the new schemaUpdateOptions argument is there too. e.g. here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2023

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added stale and removed stale labels Mar 4, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added stale and removed stale labels Jun 8, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 9, 2023

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 9, 2023
@github-actions
Copy link
Contributor

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Aug 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants