Feat hive partitioning #41

martibosch · 2024-06-24T12:04:36Z

Prework

I understand and agree to this repository's code of conduct.
I understand and agree to this repository's contributing guidelines.
I have already submitted an issue or discussion thread to discuss my idea with the maintainers.

What kind of change does this PR introduce? (check at least one)

Does this PR introduce a breaking change? (check one)

Yes
No

If yes, please describe the impact and communicate accordingly:

The PR fulfills these requirements:

It's submitted to the branch named as follow:
- Fix a bug: bugfix-<some_key>-<word>
- Improve the doc: doc-<some_key>-<word>
- Improve a tutorial tutorial-<some_key>-<word>
- Add a new feature: feature-<some_key>-<word>
- Refactor some code: refactor-<some_key>-<word>
- Optimize some code: optimize-<some_key>-<word>
When resolving a specific issue, it's referenced in the PR's title (e.g. fix #xxx[,#xxx], where "xxx" is the issue number)
Don't forget to link PR to issue if you are solving one.
All tests are passing.
New/updated tests are included

If adding a new feature, the PR's description includes:

A convincing reason for adding this feature (to avoid wasting your time, it's best to open a suggestion issue first and wait for approval before working on it)

Other information:

Related GitHub issues and pull requests

Ref: #

Summary

Please explain the purpose and scope of your contribution.

martibosch · 2024-06-24T12:09:42Z

This PR forces the id/var dir names to adopt a hive partitioning scheme - should it be optional?

For the partitions, hive partitioning seems to be enforced/hardcoded at https://github.com/ltelab/tstore/blob/feat-hive-partitioning/tstore/archive/ts/writers/pyarrow.py#L77

martibosch · 2024-06-26T08:31:40Z

I fixed some issues and rebased this to have a proper PR with this feature only. Now we can review it. I wonder: is it worth making the hive scheme optional at this point? I suggest we move forward with hive only. We may consider supporting futher schemes later on.

ghiggi · 2024-06-26T09:31:13Z

I will review this tomorrow or Friday @martibosch.

But as quick thought I would not enforce "store_id=1" and "variable=ts_variable" to have hive partitioning and neither the time series partitioning. In the case of not using hive partitioning for time series objects, if we want to enable the time filter function we still need to implement the listing of parquet files based on the partitioning info included in the TSTORE YAML file.

Two further considerations.

A TS object / partitioned parquet dataset is readable in whatever language-agnostic dataframe/query engine supporting reading parquet file.

A TSTORE directory structure with hive partitioning is not readable:

in LONG- format if a TS object contains more than a variable (is not a series but a dataframe ...)
it might not be neither readable in LONG-format even if the TS object contains a single variable. This should be tried out using i.e. duckdb or pyarrow.dataset ... for cases where a variable occurs in some store_id and not in others. It might well be that pyarrow/duckdb just infer the dataframe columns from the first store_id directory ;)

martibosch · 2024-06-26T09:59:53Z

ok I can make it optional, but I understand that for now we still leave the "hive" time partitioning hardcoded at https://github.com/ltelab/tstore/blob/feat-hive-partitioning/tstore/archive/ts/writers/pyarrow.py#L77 ?

martibosch · 2024-06-26T12:26:41Z

I amended the first commit in order to try to make this work not only for tslong but also for tsdf write/load.

martibosch · 2024-06-26T15:06:18Z

I have added a second commit drafting what I understand should be the rationale f the id_var argument. I am probably overthinking stuff, but from the "TODO" in https://github.com/ltelab/tstore/blob/main/tstore/tests/conftest.py#L236, I suppose that the use of id_var in a TSDF is not clear anyway.

martibosch · 2024-06-26T15:20:51Z

Sorry again for overthinking and for the likely premature optimization, but this is probably a good point to consider whether we need the time_var argument in the TSDF.

martibosch · 2024-06-26T15:22:43Z

Once the above issues are clear we can see how we make the id and var-level hive scheme optional, e.g., allow paths of the form my-tstore/1/temperature/year=2021/part.parquet (where 1 is an id value).

martibosch mentioned this pull request Jun 24, 2024

Comptability with DuckDB #37

Open

1 task

martibosch force-pushed the feat-hive-partitioning branch 2 times, most recently from c29b057 to d50c06f Compare June 26, 2024 08:28

martibosch force-pushed the feat-hive-partitioning branch from d50c06f to a87ea6f Compare June 26, 2024 12:25

martibosch added 2 commits June 26, 2024 16:57

feat: hive partitioning for id/var (TODO: make it optional?)

26a5d7a

feat: rationale for id_var in tsdf

e5f7b83

martibosch force-pushed the feat-hive-partitioning branch from a87ea6f to e5f7b83 Compare June 26, 2024 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat hive partitioning #41

Feat hive partitioning #41

martibosch commented Jun 24, 2024

martibosch commented Jun 24, 2024 •

edited

Loading

martibosch commented Jun 26, 2024

ghiggi commented Jun 26, 2024

martibosch commented Jun 26, 2024

martibosch commented Jun 26, 2024

martibosch commented Jun 26, 2024

martibosch commented Jun 26, 2024 •

edited

Loading

martibosch commented Jun 26, 2024

Feat hive partitioning #41

Are you sure you want to change the base?

Feat hive partitioning #41

Conversation

martibosch commented Jun 24, 2024

Prework

Related GitHub issues and pull requests

Summary

martibosch commented Jun 24, 2024 • edited Loading

martibosch commented Jun 26, 2024

ghiggi commented Jun 26, 2024

martibosch commented Jun 26, 2024

martibosch commented Jun 26, 2024

martibosch commented Jun 26, 2024

martibosch commented Jun 26, 2024 • edited Loading

martibosch commented Jun 26, 2024

martibosch commented Jun 24, 2024 •

edited

Loading

martibosch commented Jun 26, 2024 •

edited

Loading