-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extend geni cli to "transform" data to arrow ? #286
Comments
Maybe an easier pathway to the above is:
Maybe in this case #284 is not needed at all. |
I think #284 is still very relevant, because you want to go in-and-out of Geni and TMD in one REPL session seamlessly, and this could be one way to do it. I'll soon start working on Geni bindings for TMD, so we'll see! As for your CLI tool for Arrow conversion, I think it'd be straightforward to bake it into the current Geni CLI, so that |
Maybe it helps to think about Geni + TDM in three scenarios:
|
A "geni repl" cli tool, would only support 1)
|
Bindings ? Or making Tabecloth API working with Geni ? |
One of the use case I have in mind for "geni" and why I developed as well #284 , was to use geni/spark as a first step to transform "arbitrary data" into arrow files (for using them in TMD mainly)
Ideally I would have a cli tool for this, which does the following operation:
Maybe "geni" cli could become this tool.
So it gets run as
"geni repl" -> as now
or alternatively like this:
I would hope that this "simple" case is enough for most cases. Eventually the "transform" need to be extended to allow 2 more things:
The first would require to extend #284 to allow to write several arrow files which are partitioned by the groups.
I am not sure, if this is even possible to do, while assuming big data and therefore "limited heap space".
And to have it very useful, TDM need to have "multi-file dataset support" for arrow files in some form:
techascent/tech.ml.dataset#145
The text was updated successfully, but these errors were encountered: