Migrate to Spark 3's DataSource v2 interfaces #4

mojodna · 2024-04-03T11:22:20Z

This allows this library to run against various Spark 3.x versions (tested against 3.3.x and 3.5.x using assemblies generated using this branch). Prior to updating the interfaces, I refreshed the dependencies (including migrating from Spark 2.11 to 2.12) and addressed deprecations (excepting sbt's changes to how integration tests are defined).

useLocal works in a cluster environment, but I'm unable to read PBFs larger than the included sample.pbf (e.g., Geofabrik extracts) from S3 directly: it begins to read the remote object, but the resulting DataFrame contains 0 rows. I suspect that this is due to how parallelpbf handles Hadoop's FSDataInputStream, potentially across multiple threads, but I've been unable to locate the source of the problem.

I'd like to share this as-is to accelerate any potential efforts by others while recognizing that there is still work to be done.

Fixes #3.

Fixes woltapp#3

mojodna added 6 commits April 3, 2024 13:10

Build against Scala 2.12

70b9ad7

Enable deprecation warnings

3d71ae0

Upgrade non-Spark dependencies

3ce9f43

Upgrade sbt

0d123c5

Add assembly support

a16f07d

Migrate to Spark 3 DataSource v2 interfaces

7b75ff3

Fixes woltapp#3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to Spark 3's DataSource v2 interfaces #4

Migrate to Spark 3's DataSource v2 interfaces #4

mojodna commented Apr 3, 2024 •

edited

Loading

Migrate to Spark 3's DataSource v2 interfaces #4

Are you sure you want to change the base?

Migrate to Spark 3's DataSource v2 interfaces #4

Conversation

mojodna commented Apr 3, 2024 • edited Loading

mojodna commented Apr 3, 2024 •

edited

Loading