-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should Geni have support for Delta Lake or should that be a separate library? #314
Comments
Hi @erp12, thank you for the issue! Sorry for the delayed reply, I only just realised that you wrote this issue. It must have fallen between the cracks! I'm not super familiar with Delta Lake, but it honestly sounds like a really good idea and slots it nicely with the general Spark API, so it makes sense to support it as an optional dependency! As for I also wonder if it's possible to add an |
I have finished a branch that provides support of Delta. Unfortunately, Delta 0.8 is incompatible with Spark 3.1.x which prevents me from testing some of the core functionality without downgrading geni to spark 3.0.2. I'm not sure what the best way forward is. We could revisit this after the next Delta release, or we could disable the incompatible functionality and leave a |
Hmm, honestly I'm not so sure myself if Geni should stay toe-to-toe with the latest version of Spark. I wonder if we should keep tabs on the latest, most compatible version of Spark with the ecosystem. @behrica brought this up as well previously with potential incompatibility with Spark NLP before they supported Spark 3. So, I'm not against downgrading to Spark 3.0.2 if that means unlocking awesome other features from the ecosystem. I'd be very interested in what you and other Geni users think about this! |
Very recently I started looking at switching our parquet lake to delta, so I appreciate this change very much. Maybe it makes sense to have the Spark version the highest possible that works with all of the features that Geni provides (I guess mainly Spark NLP and Delta then?!) ? Not too relevant for Geni but AWS EMR currently ships with max Spark 3.0.1 and since most of my other coworkers use pyspark we mainly develop and test against that. |
There are definitely trade-offs, and I'm not confident enough in any particular approach to make a concrete pitch. In my opinion, the appropriate strategy is different depending on the scope of Geni. What problems is Geni trying to solve vs not trying to solve? I will share my interpretation based on my reading of the design goals, the "why" page, and the Scicloj interview with @anthony-khong. Geni is:
Geni is not:
Feel free to correct me on this interpretation! Assuming I am correct that the primary goal of Geni is to do data science, I would say that Geni does not need to stay in sync with Spark, as long as it is built against a Spark version that enables its data science features. Furthermore, it might make sense to keep the Spark version as stable as possible so that data scientists can safely upgrade Geni without needing to upgrade their infrastructure. Alternatively, if the primary goal of Geni is simply providing "Spark for Clojure" then it is important to be agnostic to the specific Spark version. At a minimum, the Spark versions that have a decent market share should be supported. Stepping back from the question of Spark versions, I think there is a related topic that could use some more clarity as the project matures. Data science is a broad field, and it means different things to different people. In addition to stating the kind of person Geni is going to help (data scientists), can we describe the specific kinds of tasks that Geni is trying to help people do? For example, when I first opened this issue I wasn't sure if the functionality that Delta provides makes sense for Geni or not. I know that Delta (and the similar Iceberg project) have become important to the Spark ecosystem (especially for data engineering and DataOps), but do the problems that are solved by these technologies overlap with the problems Geni is trying to solve? It seems like "should Geni do ___?" is a common question. In addition to this Delta discussion, we have open discussions for GraphFrames (#321) and Spark NLP (#323). Also, Geni has already been integrated with other with non-spark tools like tech.ml.dataset, Excel, and XGBoost. Most tools pick a narrower, less fuzzy, scope than "data science" in general. Even within the Spark project they decompose into independent modules based on problem domain (SparkSQL, mllib, streaming, graphx). That said, Geni doesn't have to do the same. I don't think it is bad thing if Geni is an all-in-one, batteries included, kitchen-sink of a platform that provides a bunch of stuff that is typically useful to people who call themselves data scientists. There are lots of choices with lots of trade-offs that will inform future development. @anthony-khong I would be interested to hear what your vision for Geni's scope. What problems is Geni trying to solve, and what problems is Geni not trying to solve? I know that might seem like a vague question, and I can provide more specific probing questions if it would be helpful. :) |
jFYI Delta has released version 1.0 (https://delta.io/news/delta-lake-1-0-0-released/) with support for Spark 3.1. |
Delta Lake brings a lot of crucial features into the Spark ecosystem. Some of the highlights include:
In some toy projects, I have begun using Delta with Geni using Scala interop. I haven't worked out all the issues yet, but I think I am getting close to something that could be useful to others.
It would be great to get people's thoughts on the best way to manage Delta support. My specific question is: Would it be a good idea to add Delta Lake as an optional dep (like we do with xgboost) and implement the Delta Lake API directly in Geni?
If yes, I can move my code into a PR. If no, I can create a geni compatible Clojure lib specifically for Delta lake.
Some things to consider:
with-dynamic-import
when reaching into optional dependencies, but that is usually for 1 or 2 functions. Adding Delta would create a set of functions to mirror the public API of an entire Scala library. It could be a little awkward towith-dymanic-import
so much stuff. My editor has a hard time handling it, but maybe that is just me.Personally, I think option 1 is best (integrate into Geni) but I don't want to create more work for other people if I am one of the only people who would benefit. :)
The text was updated successfully, but these errors were encountered: