Working with partitions #30

remche · 2023-12-14T13:14:20Z

We are working with a multiple files catalog, eg:

plugins:
  source:
    - module: intake_parquet
sources:
  test:
    description: Short example parquet data
    driver: parquet
    args:
      urlpath: 
        - s3://bucket/path/file.parquet
        - s3://bucket/path/file2.parquet
        - s3://bucket/path/file3.parquet
      storage_options:
        anon: True
        client_kwargs:
          endpoint_url: https://example.com

With only two entries, discover() is fine, we can read_partition(0) and read_partition(1), but a full read() fails with ValueError: storage_options passed with buffer, or non-supported URL, probably because ParquetSource.read()does not handle array in url_path.
With more that 2 entries, discover() fails with a `KeyError.

Thanks for maintaining this intake plugin !

The text was updated successfully, but these errors were encountered:

martindurant · 2023-12-15T14:48:51Z

I am not immedeately sure for the reason behind the issue you are seeing.

Before trying to diagnose and fix for this rather old code, I would like to point you the upcoming V2 of Intake ("Take2"), in which you would be able to do:

import intake
data = intake.datatypes.Parquet(["...", "...", ....])
cat = intake.readers.entry.Catalog()
cat["mydata"] = data.to_reader("dask")  # or pandas
cat.to_yaml_file(...)

This produces a dask.DataFrame, and avoids ever having to directly edit the YAML file. The pandas version no longer requires dask, and you also have the choice to read parquet with other engines such as ray or spark without needing dask or pandas.

Exactly what discover() should do was always a little unclear. In the next release of Intake Take2, dask readers will indeed give you the dataframe head. However, pandas specifically has no way to say "read the first part" (at least not a parquet engine independent way), and we would rather keep the code simple.

remche · 2023-12-18T11:39:33Z

Thanks @martindurant for you feedback.
I was not aware of the dev of this new version of Intake. Is there any documentation regarding changes and new API ?

martindurant · 2023-12-18T14:51:17Z

https://intake.readthedocs.io/en/reader/index2.html#take2 is the main place to look. https://github.com/intake/intake/blob/49c9d3b514f5c0d8d7f2e0c58ea6fd3dae385406/examples/Take2.ipynb is an example notebook which I demoed at PyData Global (recording not yet available).

This has not been exposed too readily yet, and is only in pre-release.

remche · 2023-12-22T11:06:45Z

We did some tests with Take2 pre-release (congrats for the deep rewrite !) that have been pretty conclusive.
You could close this issue as we wont invest time on old release.
Thanks again for your work on this package !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with partitions #30

Working with partitions #30

remche commented Dec 14, 2023

martindurant commented Dec 15, 2023

remche commented Dec 18, 2023

martindurant commented Dec 18, 2023

remche commented Dec 22, 2023

Working with partitions #30

Working with partitions #30

Comments

remche commented Dec 14, 2023

martindurant commented Dec 15, 2023

remche commented Dec 18, 2023

martindurant commented Dec 18, 2023

remche commented Dec 22, 2023