Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Working with partitions #30

Open
remche opened this issue Dec 14, 2023 · 4 comments
Open

Working with partitions #30

remche opened this issue Dec 14, 2023 · 4 comments

Comments

@remche
Copy link
Contributor

remche commented Dec 14, 2023

We are working with a multiple files catalog, eg:

plugins:
  source:
    - module: intake_parquet
sources:
  test:
    description: Short example parquet data
    driver: parquet
    args:
      urlpath: 
        - s3://bucket/path/file.parquet
        - s3://bucket/path/file2.parquet
        - s3://bucket/path/file3.parquet
      storage_options:
        anon: True
        client_kwargs:
          endpoint_url: https://example.com
  1. With only two entries, discover() is fine, we can read_partition(0) and read_partition(1), but a full read() fails with ValueError: storage_options passed with buffer, or non-supported URL, probably because ParquetSource.read()does not handle array in url_path.
  2. With more that 2 entries, discover() fails with a `KeyError.

Thanks for maintaining this intake plugin !

@martindurant
Copy link
Member

I am not immedeately sure for the reason behind the issue you are seeing.

Before trying to diagnose and fix for this rather old code, I would like to point you the upcoming V2 of Intake ("Take2"), in which you would be able to do:

import intake
data = intake.datatypes.Parquet(["...", "...", ....])
cat = intake.readers.entry.Catalog()
cat["mydata"] = data.to_reader("dask")  # or pandas
cat.to_yaml_file(...)

This produces a dask.DataFrame, and avoids ever having to directly edit the YAML file. The pandas version no longer requires dask, and you also have the choice to read parquet with other engines such as ray or spark without needing dask or pandas.

Exactly what discover() should do was always a little unclear. In the next release of Intake Take2, dask readers will indeed give you the dataframe head. However, pandas specifically has no way to say "read the first part" (at least not a parquet engine independent way), and we would rather keep the code simple.

@remche
Copy link
Contributor Author

remche commented Dec 18, 2023

Thanks @martindurant for you feedback.
I was not aware of the dev of this new version of Intake. Is there any documentation regarding changes and new API ?

@martindurant
Copy link
Member

https://intake.readthedocs.io/en/reader/index2.html#take2 is the main place to look. https://github.com/intake/intake/blob/49c9d3b514f5c0d8d7f2e0c58ea6fd3dae385406/examples/Take2.ipynb is an example notebook which I demoed at PyData Global (recording not yet available).

This has not been exposed too readily yet, and is only in pre-release.

@remche
Copy link
Contributor Author

remche commented Dec 22, 2023

We did some tests with Take2 pre-release (congrats for the deep rewrite !) that have been pretty conclusive.
You could close this issue as we wont invest time on old release.
Thanks again for your work on this package !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants