-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[yaml] Normalize JdbcIO #28971
[yaml] Normalize JdbcIO #28971
Conversation
R: @robertwb |
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
Codecov Report
@@ Coverage Diff @@
## master #28971 +/- ##
=======================================
Coverage 38.38% 38.38%
=======================================
Files 686 686
Lines 101665 101653 -12
=======================================
- Hits 39021 39018 -3
+ Misses 61064 61055 -9
Partials 1580 1580
Flags with carried forward coverage won't be shown. Click here to find out more. see 9 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
7660242
to
cf7cbc2
Compare
driver_class_name: 'driverClassName' | ||
jdbc_url: 'jdbcUrl' | ||
username: 'username' | ||
password: 'password' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should seriously think about if there's a better way to store these than in plain text...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how the transform works today, but I can add a FR for KMS/Secret Manager support. Unless you think it's not worth offering without that support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to a FR at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also tie in with the ability to templatize things too.
username: 'username' | ||
password: 'password' | ||
table_name: 'location' | ||
write_statement: 'writeStatement' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? Is it needed for the common case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also debated this one...
The transform will loop over the Row
fields and append them the a pre-written INSERT
statement, i.e.
INSERT INTO {table} VALUES ({field1}?, {field2}?, {fieldN}?)
which means that the function that feeds into it HAS to ensure the order is correct. Allowing write_statement
would allow the user to supply a custom INSERT
statement with named fields so they can supply the row with the columns in any order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead construct the INSERT statement from the Row schema itself so it always has the "right" order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is the SchemaTransform does not know the table schema, so we can't set the order of the values in the INSERT statement based off the table schema. We could theoretically do a INSERT INTO {table} (?, ?, ?) VALUES (?, ?, ?)
, but the current BeamRowPreparedStatementSetter
does not support specifying columns (It just loops of the field values and replaces the ?'s iteratively), so it would require a major rewrite that would unlikely be completed by 2.52
I think that this might be worth another FR, but not blocking this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'd rather not go out with the write_statement
option 'cause it'd be hard to remove later. The right thing to do is clear, but if that's not possible for 2.52 we either declare this transform usable without an explicit write_statement or we defer adding it 'till 2.53. (Looking at this, BeamRowPreparedStatementSetter
seems to always be instantiated at a place where we know the schema, so it shouldn't be too hard to fix.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking into this a bit more, we're constructing the default query right here: https://github.com/apache/beam/blob/e78a01a58f7b1e5894872217b2474fc84dea9956/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcWriteSchemaTransformProvider.java#L104
Can't we use the names of the schema to construct a query
INSERT INTO {table} (field1, field2, field3, ...) VALUES ...
Yes, this would mean that names in your schema would have to agree with the names of your table, but this seems to be very much the right (and safe) thing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this possibly introduce a security risk exposing the column names directly in the query? I know we use the prepared statement for the values to protect against SQL injection since the values could possibly come from an unverified source. I think this is prevented for the column names since they are set in the row schema when the pipeline is written, so it should be safe, but do you think there could be a way to break this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and updated it just in case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yes, only the pipeline author can control the column names themselves.
e12a3ce
to
e78a01a
Compare
Signed-off-by: Jeffrey Kinard <[email protected]>
Signed-off-by: Jeffrey Kinard <[email protected]>
Signed-off-by: Jeffrey Kinard <[email protected]>
e78a01a
to
d5ac56c
Compare
driver_class_name: 'driverClassName' | ||
jdbc_url: 'jdbcUrl' | ||
username: 'username' | ||
password: 'password' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could also tie in with the ability to templatize things too.
username: 'username' | ||
password: 'password' | ||
table_name: 'location' | ||
write_statement: 'writeStatement' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Yes, only the pipeline author can control the column names themselves.
This PR normalizes JdbcIO to work with the YAML framework. Tested with BigQueryIO read/write and built-in YAML transforms.
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123
), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>
instead.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.