[yaml] Normalize JdbcIO #28971

Polber · 2023-10-12T22:30:59Z

This PR normalizes JdbcIO to work with the YAML framework. Tested with BigQueryIO read/write and built-in YAML transforms.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Polber · 2023-10-12T22:31:14Z

R: @robertwb

github-actions · 2023-10-12T22:32:48Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

codecov · 2023-10-12T22:54:20Z

Codecov Report

Merging #28971 (e12a3ce) into master (d6b3467) will increase coverage by 0.00%.
The diff coverage is n/a.

❗ Current head e12a3ce differs from pull request most recent head d5ac56c. Consider uploading reports for the commit d5ac56c to get more accurate results

@@           Coverage Diff           @@
##           master   #28971   +/-   ##
=======================================
  Coverage   38.38%   38.38%           
=======================================
  Files         686      686           
  Lines      101665   101653   -12     
=======================================
- Hits        39021    39018    -3     
+ Misses      61064    61055    -9     
  Partials     1580     1580

Flag	Coverage Δ
python	`30.00% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

see 9 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

robertwb · 2023-10-18T20:17:09Z

sdks/python/apache_beam/yaml/standard_io.yaml

+        driver_class_name: 'driverClassName'
+        jdbc_url: 'jdbcUrl'
+        username: 'username'
+        password: 'password'


We should seriously think about if there's a better way to store these than in plain text...

That's how the transform works today, but I can add a FR for KMS/Secret Manager support. Unless you think it's not worth offering without that support?

+1 to a FR at least.

This could also tie in with the ability to templatize things too.

sdks/python/apache_beam/yaml/standard_io.yaml

robertwb · 2023-10-18T20:20:13Z

sdks/python/apache_beam/yaml/standard_io.yaml

+        username: 'username'
+        password: 'password'
+        table_name: 'location'
+        write_statement: 'writeStatement'


What does this mean? Is it needed for the common case?

I also debated this one...

The transform will loop over the Row fields and append them the a pre-written INSERT statement, i.e.
INSERT INTO {table} VALUES ({field1}?, {field2}?, {fieldN}?) which means that the function that feeds into it HAS to ensure the order is correct. Allowing write_statement would allow the user to supply a custom INSERT statement with named fields so they can supply the row with the columns in any order.

Can we instead construct the INSERT statement from the Row schema itself so it always has the "right" order?

The problem is the SchemaTransform does not know the table schema, so we can't set the order of the values in the INSERT statement based off the table schema. We could theoretically do a INSERT INTO {table} (?, ?, ?) VALUES (?, ?, ?), but the current BeamRowPreparedStatementSetter does not support specifying columns (It just loops of the field values and replaces the ?'s iteratively), so it would require a major rewrite that would unlikely be completed by 2.52

I think that this might be worth another FR, but not blocking this PR.

So I'd rather not go out with the write_statement option 'cause it'd be hard to remove later. The right thing to do is clear, but if that's not possible for 2.52 we either declare this transform usable without an explicit write_statement or we defer adding it 'till 2.53. (Looking at this, BeamRowPreparedStatementSetter seems to always be instantiated at a place where we know the schema, so it shouldn't be too hard to fix.)

Actually, looking into this a bit more, we're constructing the default query right here: https://github.com/apache/beam/blob/e78a01a58f7b1e5894872217b2474fc84dea9956/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcWriteSchemaTransformProvider.java#L104

Can't we use the names of the schema to construct a query

INSERT INTO {table} (field1, field2, field3, ...) VALUES ...

Yes, this would mean that names in your schema would have to agree with the names of your table, but this seems to be very much the right (and safe) thing to do.

Would this possibly introduce a security risk exposing the column names directly in the query? I know we use the prepared statement for the values to protect against SQL injection since the values could possibly come from an unverified source. I think this is prevented for the column names since they are set in the row schema when the pipeline is written, so it should be safe, but do you think there could be a way to break this?

I went ahead and updated it just in case

Thanks. Yes, only the pipeline author can control the column names themselves.

sdks/python/apache_beam/yaml/standard_io.yaml

sdks/python/apache_beam/yaml/yaml_mapping.py

Signed-off-by: Jeffrey Kinard <[email protected]>

robertwb · 2023-10-26T18:54:44Z

sdks/python/apache_beam/yaml/standard_io.yaml

+        driver_class_name: 'driverClassName'
+        jdbc_url: 'jdbcUrl'
+        username: 'username'
+        password: 'password'


This could also tie in with the ability to templatize things too.

robertwb · 2023-10-26T19:00:45Z

sdks/python/apache_beam/yaml/standard_io.yaml

+        username: 'username'
+        password: 'password'
+        table_name: 'location'
+        write_statement: 'writeStatement'


Thanks. Yes, only the pipeline author can control the column names themselves.

github-actions bot added python java io jdbc yaml labels Oct 12, 2023

Polber force-pushed the jkinard/jdbc-io branch from 7660242 to cf7cbc2 Compare October 17, 2023 22:02

robertwb reviewed Oct 18, 2023

View reviewed changes

Polber force-pushed the jkinard/jdbc-io branch from e12a3ce to e78a01a Compare October 23, 2023 21:52

Polber added 3 commits October 25, 2023 13:35

[yaml] Normalize JdbcIO

68a0620

Signed-off-by: Jeffrey Kinard <[email protected]>

address initial comments

6b21c4c

Signed-off-by: Jeffrey Kinard <[email protected]>

add fieldnames to jdbc insert

d5ac56c

Signed-off-by: Jeffrey Kinard <[email protected]>

Polber force-pushed the jkinard/jdbc-io branch from e78a01a to d5ac56c Compare October 25, 2023 17:55

robertwb approved these changes Oct 26, 2023

View reviewed changes

robertwb merged commit 16d68c1 into apache:master Oct 26, 2023
86 checks passed

Polber mentioned this pull request Nov 15, 2023

[yaml] Normalize JdbcIO #28682

Closed

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[yaml] Normalize JdbcIO #28971

[yaml] Normalize JdbcIO #28971

Polber commented Oct 12, 2023

Polber commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

codecov bot commented Oct 12, 2023 •

edited

Loading

robertwb Oct 18, 2023

Polber Oct 20, 2023

robertwb Oct 20, 2023

robertwb Oct 26, 2023

robertwb Oct 18, 2023

Polber Oct 20, 2023

robertwb Oct 20, 2023

Polber Oct 23, 2023

robertwb Oct 24, 2023

robertwb Oct 24, 2023

Polber Oct 25, 2023

Polber Oct 25, 2023

robertwb Oct 26, 2023

robertwb Oct 26, 2023

robertwb Oct 26, 2023

[yaml] Normalize JdbcIO #28971

[yaml] Normalize JdbcIO #28971

Conversation

Polber commented Oct 12, 2023

GitHub Actions Tests Status (on master branch)

Polber commented Oct 12, 2023

github-actions bot commented Oct 12, 2023

codecov bot commented Oct 12, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 12, 2023 •

edited

Loading