Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend geni cli to "transform" data to arrow ? #286

Open
behrica opened this issue Nov 3, 2020 · 5 comments
Open

extend geni cli to "transform" data to arrow ? #286

behrica opened this issue Nov 3, 2020 · 5 comments

Comments

@behrica
Copy link
Collaborator

behrica commented Nov 3, 2020

One of the use case I have in mind for "geni" and why I developed as well #284 , was to use geni/spark as a first step to transform "arbitrary data" into arrow files (for using them in TMD mainly)

Ideally I would have a cli tool for this, which does the following operation:

(->
 (g/read-xxx!        ; xxx-> "parquet" or "csv" or ....
 (g/repartition n)
 (g/collect-as-arrow m dir)

Maybe "geni" cli could become this tool.

So it gets run as
"geni repl" -> as now

or alternatively like this:

"geni to-arrow  xxxx.csv   10 50000 /tmp  

I would hope that this "simple" case is enough for most cases. Eventually the "transform" need to be extended to allow 2 more things:

  • specify group-by columns and write arrow files partitioned
  • specify arbitrary "filter" criteria to shrink the data

The first would require to extend #284 to allow to write several arrow files which are partitioned by the groups.
I am not sure, if this is even possible to do, while assuming big data and therefore "limited heap space".

And to have it very useful, TDM need to have "multi-file dataset support" for arrow files in some form:
techascent/tech.ml.dataset#145

@behrica
Copy link
Collaborator Author

behrica commented Nov 3, 2020

Maybe an easier pathway to the above is:

  • let geni/spark do everything and let it write parquet files to disk
  • write a cli tool which can convert a "directory of parquet" files into a "directory of arrow files"

Maybe in this case #284 is not needed at all.

@anthony-khong
Copy link
Member

I think #284 is still very relevant, because you want to go in-and-out of Geni and TMD in one REPL session seamlessly, and this could be one way to do it. I'll soon start working on Geni bindings for TMD, so we'll see!

As for your CLI tool for Arrow conversion, I think it'd be straightforward to bake it into the current Geni CLI, so that geni gives you the REPL and geni :to-arrow $SOURCE_PATH $DESTINATION_PATH does the Arrow conversion. And we could develop a number of built-in, frequently used mini apps like that to the CLI.

@behrica
Copy link
Collaborator Author

behrica commented Nov 5, 2020

Maybe it helps to think about Geni + TDM in three scenarios:

  1. I want to use TMD interactively but I have initial big data:
  • I know perfectly how to filter my big dataset (no exploration of big data needed)
  1. I have big data, but I don't know how to filter it yet. Exploration of big data + interactive work is needed

  2. I write a complex ETL job, starting from big data but then I want to continue in TDM or other clojure "in memoy" libraries

@behrica
Copy link
Collaborator Author

behrica commented Nov 5, 2020

A "geni repl" cli tool, would only support 1)

  1. and 3) require other forms of integration of Geni and TDM:
  • exchange of parquet files on disk
  • collect-to-arrow
  • collect-to-TMD (as recently implemented by Chris in TTMD)

@behrica
Copy link
Collaborator Author

behrica commented Nov 5, 2020

I think #284 is still very relevant, because you want to go in-and-out of Geni and TMD in one REPL session seamlessly, and this could be one way to do it. I'll soon start working on Geni bindings for TMD, so we'll see!

Bindings ? Or making Tabecloth API working with Geni ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants