Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[yaml] Normalize JdbcIO #28971

Merged
merged 3 commits into from
Oct 26, 2023
Merged

[yaml] Normalize JdbcIO #28971

merged 3 commits into from
Oct 26, 2023

Conversation

Polber
Copy link
Contributor

@Polber Polber commented Oct 12, 2023

This PR normalizes JdbcIO to work with the YAML framework. Tested with BigQueryIO read/write and built-in YAML transforms.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@Polber
Copy link
Contributor Author

Polber commented Oct 12, 2023

R: @robertwb

@github-actions
Copy link
Contributor

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

@codecov
Copy link

codecov bot commented Oct 12, 2023

Codecov Report

Merging #28971 (e12a3ce) into master (d6b3467) will increase coverage by 0.00%.
The diff coverage is n/a.

❗ Current head e12a3ce differs from pull request most recent head d5ac56c. Consider uploading reports for the commit d5ac56c to get more accurate results

@@           Coverage Diff           @@
##           master   #28971   +/-   ##
=======================================
  Coverage   38.38%   38.38%           
=======================================
  Files         686      686           
  Lines      101665   101653   -12     
=======================================
- Hits        39021    39018    -3     
+ Misses      61064    61055    -9     
  Partials     1580     1580           
Flag Coverage Δ
python 30.00% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 9 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

driver_class_name: 'driverClassName'
jdbc_url: 'jdbcUrl'
username: 'username'
password: 'password'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should seriously think about if there's a better way to store these than in plain text...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how the transform works today, but I can add a FR for KMS/Secret Manager support. Unless you think it's not worth offering without that support?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to a FR at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also tie in with the ability to templatize things too.

sdks/python/apache_beam/yaml/standard_io.yaml Outdated Show resolved Hide resolved
sdks/python/apache_beam/yaml/standard_io.yaml Outdated Show resolved Hide resolved
username: 'username'
password: 'password'
table_name: 'location'
write_statement: 'writeStatement'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean? Is it needed for the common case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also debated this one...

The transform will loop over the Row fields and append them the a pre-written INSERT statement, i.e.
INSERT INTO {table} VALUES ({field1}?, {field2}?, {fieldN}?) which means that the function that feeds into it HAS to ensure the order is correct. Allowing write_statement would allow the user to supply a custom INSERT statement with named fields so they can supply the row with the columns in any order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead construct the INSERT statement from the Row schema itself so it always has the "right" order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the SchemaTransform does not know the table schema, so we can't set the order of the values in the INSERT statement based off the table schema. We could theoretically do a INSERT INTO {table} (?, ?, ?) VALUES (?, ?, ?), but the current BeamRowPreparedStatementSetter does not support specifying columns (It just loops of the field values and replaces the ?'s iteratively), so it would require a major rewrite that would unlikely be completed by 2.52

I think that this might be worth another FR, but not blocking this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'd rather not go out with the write_statement option 'cause it'd be hard to remove later. The right thing to do is clear, but if that's not possible for 2.52 we either declare this transform usable without an explicit write_statement or we defer adding it 'till 2.53. (Looking at this, BeamRowPreparedStatementSetter seems to always be instantiated at a place where we know the schema, so it shouldn't be too hard to fix.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, looking into this a bit more, we're constructing the default query right here: https://github.com/apache/beam/blob/e78a01a58f7b1e5894872217b2474fc84dea9956/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcWriteSchemaTransformProvider.java#L104

Can't we use the names of the schema to construct a query

INSERT INTO {table} (field1, field2, field3, ...) VALUES ...

Yes, this would mean that names in your schema would have to agree with the names of your table, but this seems to be very much the right (and safe) thing to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this possibly introduce a security risk exposing the column names directly in the query? I know we use the prepared statement for the values to protect against SQL injection since the values could possibly come from an unverified source. I think this is prevented for the column names since they are set in the row schema when the pipeline is written, so it should be safe, but do you think there could be a way to break this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and updated it just in case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yes, only the pipeline author can control the column names themselves.

sdks/python/apache_beam/yaml/standard_io.yaml Outdated Show resolved Hide resolved
sdks/python/apache_beam/yaml/yaml_mapping.py Outdated Show resolved Hide resolved
Signed-off-by: Jeffrey Kinard <[email protected]>
Signed-off-by: Jeffrey Kinard <[email protected]>
Signed-off-by: Jeffrey Kinard <[email protected]>
driver_class_name: 'driverClassName'
jdbc_url: 'jdbcUrl'
username: 'username'
password: 'password'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also tie in with the ability to templatize things too.

username: 'username'
password: 'password'
table_name: 'location'
write_statement: 'writeStatement'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Yes, only the pipeline author can control the column names themselves.

@robertwb robertwb merged commit 16d68c1 into apache:master Oct 26, 2023
86 checks passed
@Polber Polber mentioned this pull request Nov 15, 2023
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants