-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working with partitions #30
Comments
I am not immedeately sure for the reason behind the issue you are seeing. Before trying to diagnose and fix for this rather old code, I would like to point you the upcoming V2 of Intake ("Take2"), in which you would be able to do: import intake
data = intake.datatypes.Parquet(["...", "...", ....])
cat = intake.readers.entry.Catalog()
cat["mydata"] = data.to_reader("dask") # or pandas
cat.to_yaml_file(...) This produces a dask.DataFrame, and avoids ever having to directly edit the YAML file. The pandas version no longer requires dask, and you also have the choice to read parquet with other engines such as ray or spark without needing dask or pandas. Exactly what discover() should do was always a little unclear. In the next release of Intake Take2, dask readers will indeed give you the dataframe head. However, pandas specifically has no way to say "read the first part" (at least not a parquet engine independent way), and we would rather keep the code simple. |
Thanks @martindurant for you feedback. |
https://intake.readthedocs.io/en/reader/index2.html#take2 is the main place to look. https://github.com/intake/intake/blob/49c9d3b514f5c0d8d7f2e0c58ea6fd3dae385406/examples/Take2.ipynb is an example notebook which I demoed at PyData Global (recording not yet available). This has not been exposed too readily yet, and is only in pre-release. |
We did some tests with Take2 pre-release (congrats for the deep rewrite !) that have been pretty conclusive. |
We are working with a multiple files catalog, eg:
discover()
is fine, we canread_partition(0)
andread_partition(1)
, but a fullread()
fails withValueError: storage_options passed with buffer, or non-supported URL
, probably becauseParquetSource.read()
does not handle array inurl_path
.discover()
fails with a `KeyError.Thanks for maintaining this intake plugin !
The text was updated successfully, but these errors were encountered: