add a new IO named DataLakeIO (#23074) #23075

zhangt-nhlab · 2022-09-08T02:37:33Z

We developed a new IO named DataLakeIO, which support beam to read data from data lake (delta, iceberg, hudi), and write data to data lake(delta, icberg, hudi).

Because delta , iceberg and hudi does not provide enough java api to read and write, so we use spark datasouce api to read and write data in DataLakeIO. Therefore, the spark dependencies is needed.

BeamDeltaTest, BeamIcebergTest and BeamHudiTest show how to use the above features.

github-actions · 2022-09-08T02:59:47Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kileys for label java.
R: @Abacn for label build.
R: @johnjcasey for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

codecov · 2022-09-08T03:17:00Z

Codecov Report

Merging #23075 (ac21df5) into master (e3ba8d8) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #23075   +/-   ##
=======================================
  Coverage   73.58%   73.58%           
=======================================
  Files         716      716           
  Lines       95301    95301           
=======================================
+ Hits        70124    70125    +1     
+ Misses      23881    23880    -1     
  Partials     1296     1296

Flag	Coverage Δ
python	`83.40% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
sdks/python/apache_beam/utils/interactive_utils.py	`95.12% <0.00%> (-2.44%)`	⬇️
...hon/apache_beam/runners/worker/bundle_processor.py	`93.30% <0.00%> (-0.25%)`	⬇️
sdks/go/pkg/beam/util/gcsx/gcs.go	`27.41% <0.00%> (ø)`
sdks/go/pkg/beam/artifact/stage.go	`61.87% <0.00%> (ø)`
sdks/go/pkg/beam/io/filesystem/util.go	`96.29% <0.00%> (ø)`
sdks/go/pkg/beam/io/filesystem/memfs/memory.go	`96.15% <0.00%> (ø)`
...ks/python/apache_beam/runners/worker/sdk_worker.py	`89.09% <0.00%> (+0.15%)`	⬆️
sdks/python/apache_beam/runners/direct/executor.py	`97.01% <0.00%> (+0.54%)`	⬆️
.../python/apache_beam/transforms/periodicsequence.py	`100.00% <0.00%> (+1.61%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

zhangt-nhlab · 2022-09-08T06:13:54Z

@kileys
@Abacn
@johnjcasey

johnjcasey

As a general comment, it may also be worth looking at the in-progress SparkReceiverIO to see if a more generic spark connection can be used.

johnjcasey · 2022-09-08T13:44:48Z

sdks/java/io/datalake/src/main/java/org/apache/beam/sdk/io/datalake/DataLakeIO.java

+        }
+    }
+
+    private static class ReadFn<ParameterT, OutputT> extends DoFn<ParameterT, OutputT> {


Currently, we are trying to have all sources implemented using the SplittableDoFn pattern to enable scalability, and are doing our best to not include new sources that are not implemented as SDFs. Can this be re-implemented as an SDF instead?

Thanks, I'll learn SplittableDoFn, then re-implemente it as an SDF instead.

johnjcasey · 2022-09-08T13:50:19Z

sdks/java/io/datalake/build.gradle

@@ -0,0 +1,54 @@
+/*


Can you re-work these dependencies to match the pattern used for other IOs? See io/google-cloud-platform/build.gradel for an example.

New dependencies themselves are included buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy

Okay, I'll modify it according to this

github-actions · 2022-09-16T12:13:59Z

Reminder, please take a look at this pr: @kileys @Abacn @johnjcasey

Abacn · 2022-09-16T14:51:36Z

waiting on author

github-actions · 2022-09-21T12:14:14Z

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @damccorm for label build.
R: @pabloem for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

mosche · 2022-09-23T07:25:22Z

sdks/java/io/datalake/build.gradle

+    implementation "org.apache.spark:spark-sql_2.12:3.1.2"
+    implementation "org.apache.spark:spark-core_2.12:3.1.2"
+    implementation "org.apache.spark:spark-streaming_2.12:3.1.2"


This IO would be really neat and I understand the motivation of using Spark underneath.
Nevertheless, the spark dependency is rather problematic and I'm very concerned about the consequences ...

There's also a Spark runner, which supports both Spark 2.4 and Spark >= 3.1. This IO would certainly conflict with the Spark 2.4 runner. The Spark 3 runner is build in a way that it supports various versions of Spark 3 (the path from 3.1 to 3.3 is full of breaking changes), Spark dependencies are typically provided (as available on the cluster). Even further, Spark comes with a massive tail of dependencies prone to causing conflicts with versions used in Beam.

The one common candidate to mention here is Avro. Spark 3.1 is still using Avro 1.8 matching Beam's version, Spark 3.2 bumps Avro to 1.10 which is incompatible with Beam :/ This kinda exemplifies the maintenance headache ahead.

Have you evaluated any alternative to using Spark underneath?

cc @aromanenko-dev

This IO would be really neat and I understand the motivation of using Spark underneath. Nevertheless, the spark dependency is rather problematic and I'm very concerned about the consequences ...

There's also a Spark runner, which supports both Spark 2.4 and Spark >= 3.1. This IO would certainly conflict with the Spark 2.4 runner. The Spark 3 runner is build in a way that it supports various versions of Spark 3 (the path from 3.1 to 3.3 is full of breaking changes), Spark dependencies are typically provided (as available on the cluster). Even further, Spark comes with a massive tail of dependencies prone to causing conflicts with versions used in Beam.

The one common candidate to mention here is Avro. Spark 3.1 is still using Avro 1.8 matching Beam's version, Spark 3.2 bumps Avro to 1.10 which is incompatible with Beam :/ This kinda exemplifies the maintenance headache ahead.

Have you evaluated any alternative to using Spark underneath?

Your advice is great ! I'll consider it, and then think about other alternatives

aromanenko-dev

First of all - thanks for your contribution!

Before proceeding to review from my side, I'd like to know if there is a design doc or similar for this IO connector? It would be very helpful to understand the goals and the implementation of this connector in advance.

Also, several notes that are worth to mention:

Please, create a new github issue for this feature.
Please, avoid merging a master branch into your feature branch. Use git rebase instead.
Run ./gradlew :sdks:java:io:datalake:check locally before pushing your changes to origin.

You can find a Beam contribution guide here:
https://beam.apache.org/contribute/get-started-contributing/

zhangt-nhlab · 2022-09-26T00:57:21Z

First of all - thanks for your contribution!

Before proceeding to review from my side, I'd like to know if there is a design doc or similar for this IO connector? It would be very helpful to understand the goals and the implementation of this connector in advance.

Also, several notes that are worth to mention:

Please, create a new github issue for this feature.

Please, avoid merging a master branch into your feature branch. Use git rebase instead.

Run ./gradlew :sdks:java:io:datalake:check locally before pushing your changes to origin.

You can find a Beam contribution guide here: https://beam.apache.org/contribute/get-started-contributing/

zhangt-nhlab · 2022-09-26T01:00:28Z

First of all - thanks for your contribution!
Before proceeding to review from my side, I'd like to know if there is a design doc or similar for this IO connector? It would be very helpful to understand the goals and the implementation of this connector in advance.
Also, several notes that are worth to mention:

Please, create a new github issue for this feature.

Please, avoid merging a master branch into your feature branch. Use git rebase instead.

Run ./gradlew :sdks:java:io:datalake:check locally before pushing your changes to origin.

You can find a Beam contribution guide here: https://beam.apache.org/contribute/get-started-contributing/

Thank you for your reply! I will make my changes, and create a new github issue later.

aaltay · 2024-04-25T21:09:22Z

Was there any progress on getting this IO into Beam?

add a new IO named DataLakeIO (apache#23074)

d4f5fa4

github-actions bot added build io java labels Sep 8, 2022

github-actions bot added the Next Action: Reviewers label Sep 8, 2022

johnjcasey requested changes Sep 8, 2022

View reviewed changes

Merge branch 'apache:master' into datalakeioBranch

ac21df5

github-actions bot added the slow-review label Sep 16, 2022

github-actions bot added Next Action: Author slow-review and removed slow-review Next Action: Reviewers labels Sep 16, 2022

github-actions bot removed the slow-review label Sep 21, 2022

mosche reviewed Sep 23, 2022

View reviewed changes

aromanenko-dev requested changes Sep 23, 2022

View reviewed changes

zhangt-nhlab closed this Sep 26, 2022

github-actions bot added Next Action: Reviewers and removed Next Action: Author labels Sep 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a new IO named DataLakeIO (#23074) #23075

add a new IO named DataLakeIO (#23074) #23075

zhangt-nhlab commented Sep 8, 2022

github-actions bot commented Sep 8, 2022

codecov bot commented Sep 8, 2022 •

edited

Loading

zhangt-nhlab commented Sep 8, 2022

johnjcasey left a comment

johnjcasey Sep 8, 2022

zhangt-nhlab Sep 26, 2022

johnjcasey Sep 8, 2022

zhangt-nhlab Sep 26, 2022

github-actions bot commented Sep 16, 2022

Abacn commented Sep 16, 2022

github-actions bot commented Sep 21, 2022

mosche Sep 23, 2022

mosche Sep 23, 2022

zhangt-nhlab Sep 26, 2022

aromanenko-dev left a comment

zhangt-nhlab commented Sep 26, 2022

zhangt-nhlab commented Sep 26, 2022

aaltay commented Apr 25, 2024

add a new IO named DataLakeIO (#23074) #23075

add a new IO named DataLakeIO (#23074) #23075

Conversation

zhangt-nhlab commented Sep 8, 2022

github-actions bot commented Sep 8, 2022

codecov bot commented Sep 8, 2022 • edited Loading

Codecov Report

zhangt-nhlab commented Sep 8, 2022

johnjcasey left a comment

Choose a reason for hiding this comment

johnjcasey Sep 8, 2022

Choose a reason for hiding this comment

zhangt-nhlab Sep 26, 2022

Choose a reason for hiding this comment

johnjcasey Sep 8, 2022

Choose a reason for hiding this comment

zhangt-nhlab Sep 26, 2022

Choose a reason for hiding this comment

github-actions bot commented Sep 16, 2022

Abacn commented Sep 16, 2022

github-actions bot commented Sep 21, 2022

mosche Sep 23, 2022

Choose a reason for hiding this comment

mosche Sep 23, 2022

Choose a reason for hiding this comment

zhangt-nhlab Sep 26, 2022

Choose a reason for hiding this comment

aromanenko-dev left a comment

Choose a reason for hiding this comment

zhangt-nhlab commented Sep 26, 2022

zhangt-nhlab commented Sep 26, 2022

aaltay commented Apr 25, 2024

codecov bot commented Sep 8, 2022 •

edited

Loading