-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a new IO named DataLakeIO (#23074) #23075
Conversation
Assigning reviewers. If you would like to opt out of this review, comment R: @kileys for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
Codecov Report
@@ Coverage Diff @@
## master #23075 +/- ##
=======================================
Coverage 73.58% 73.58%
=======================================
Files 716 716
Lines 95301 95301
=======================================
+ Hits 70124 70125 +1
+ Misses 23881 23880 -1
Partials 1296 1296
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general comment, it may also be worth looking at the in-progress SparkReceiverIO to see if a more generic spark connection can be used.
} | ||
} | ||
|
||
private static class ReadFn<ParameterT, OutputT> extends DoFn<ParameterT, OutputT> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we are trying to have all sources implemented using the SplittableDoFn pattern to enable scalability, and are doing our best to not include new sources that are not implemented as SDFs. Can this be re-implemented as an SDF instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll learn SplittableDoFn, then re-implemente it as an SDF instead.
@@ -0,0 +1,54 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you re-work these dependencies to match the pattern used for other IOs? See io/google-cloud-platform/build.gradel for an example.
New dependencies themselves are included buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll modify it according to this
Reminder, please take a look at this pr: @kileys @Abacn @johnjcasey |
waiting on author |
Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment R: @robertwb for label java. Available commands:
|
implementation "org.apache.spark:spark-sql_2.12:3.1.2" | ||
implementation "org.apache.spark:spark-core_2.12:3.1.2" | ||
implementation "org.apache.spark:spark-streaming_2.12:3.1.2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This IO would be really neat and I understand the motivation of using Spark underneath.
Nevertheless, the spark dependency is rather problematic and I'm very concerned about the consequences ...
There's also a Spark runner, which supports both Spark 2.4 and Spark >= 3.1. This IO would certainly conflict with the Spark 2.4 runner. The Spark 3 runner is build in a way that it supports various versions of Spark 3 (the path from 3.1 to 3.3 is full of breaking changes), Spark dependencies are typically provided (as available on the cluster). Even further, Spark comes with a massive tail of dependencies prone to causing conflicts with versions used in Beam.
The one common candidate to mention here is Avro. Spark 3.1 is still using Avro 1.8 matching Beam's version, Spark 3.2 bumps Avro to 1.10 which is incompatible with Beam :/ This kinda exemplifies the maintenance headache ahead.
Have you evaluated any alternative to using Spark underneath?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This IO would be really neat and I understand the motivation of using Spark underneath. Nevertheless, the spark dependency is rather problematic and I'm very concerned about the consequences ...
There's also a Spark runner, which supports both Spark 2.4 and Spark >= 3.1. This IO would certainly conflict with the Spark 2.4 runner. The Spark 3 runner is build in a way that it supports various versions of Spark 3 (the path from 3.1 to 3.3 is full of breaking changes), Spark dependencies are typically provided (as available on the cluster). Even further, Spark comes with a massive tail of dependencies prone to causing conflicts with versions used in Beam.
The one common candidate to mention here is Avro. Spark 3.1 is still using Avro 1.8 matching Beam's version, Spark 3.2 bumps Avro to 1.10 which is incompatible with Beam :/ This kinda exemplifies the maintenance headache ahead.
Have you evaluated any alternative to using Spark underneath?
Your advice is great ! I'll consider it, and then think about other alternatives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all - thanks for your contribution!
Before proceeding to review from my side, I'd like to know if there is a design doc or similar for this IO connector? It would be very helpful to understand the goals and the implementation of this connector in advance.
Also, several notes that are worth to mention:
- Please, create a new github issue for this feature.
- Please, avoid merging a
master
branch into your feature branch. Usegit rebase
instead. - Run
./gradlew :sdks:java:io:datalake:check
locally before pushing your changes to origin.
You can find a Beam contribution guide here:
https://beam.apache.org/contribute/get-started-contributing/
|
Thank you for your reply! I will make my changes, and create a new github issue later. |
Was there any progress on getting this IO into Beam? |
We developed a new IO named DataLakeIO, which support beam to read data from data lake (delta, iceberg, hudi), and write data to data lake(delta, icberg, hudi).
Because delta , iceberg and hudi does not provide enough java api to read and write, so we use spark datasouce api to read and write data in DataLakeIO. Therefore, the spark dependencies is needed.
BeamDeltaTest, BeamIcebergTest and BeamHudiTest show how to use the above features.