ENH: Read and write pandas attrs to parquet with pyarrow engine #41545

snowman2 · 2021-05-18T17:44:06Z

closes ENH adding metadata argument to DataFrame.to_parquet #20521
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

datapythonista

lgtm, just couple of minor comments

pandas/tests/io/test_parquet.py

pandas/io/parquet.py

jreback

i think this is probably fine. can you add a versionadded 1.3 to the doc-string of read/write. Also a note in the whatsnew enhancements section.

pandas/io/parquet.py

pandas/tests/io/test_parquet.py

gsheni · 2021-06-02T22:51:43Z

@snowman2 thanks for putting up the PR. There seems to be some conflicts...

snowman2 · 2021-06-02T23:13:02Z

Well, resolving conflicts via my phone was a bad idea. I will fix it later.

…ng/types

jorisvandenbossche · 2021-06-03T05:54:32Z

@snowman2 one more thing: if we are changing the metadata we store, we should probably also update the documentation about this: https://github.com/pandas-dev/pandas/blob/master/doc/source/development/developer.rst

jorisvandenbossche

While taking a closer look, I have some questions / concerns:

This is also saving column / Series attrs, but to what extent do we actually already support attrs on the column level? (cc @TomAugspurger) For example, the df.a.attrs = {...} in the example/test only works because of the "item cache" of the DataFrame (once a column is accessed and converted into a Series, we cache this Series). If the item cache is cleared (so creating a new Series from the column), also the attrs of the "column" are lost.
If saving attrs of the columns, this could also go into the "columns" field

For the code:

Can you also add a test that checks the actual generated metadata? (so testing the specification, which needs to be interoperable with other libraries, and not only the roundtrip from/to pandas, which might hide errors in the actual metadata)
With the current code, I think it will generate duplicated attrs between "attrs" and "column_attrs" if you have only DataFrame-level attrs

Also, as mentioned in the issue, this is actually something that ideally would be implemented in pyarrow / fastparquet, since that is where currently those metadata are constructed (but it can be fine to short-term start with a workaround here).

jorisvandenbossche · 2021-06-06T08:06:51Z

doc/source/development/developer.rst

+.. warning:: This only works with the ``pyarrow`` engine as of ``pandas`` 1.3.
+
+The attributes of both the ``DataFrame`` and each ``Series`` are written to and read
+from using:


Probably also need to mention that this is an optional field?

TomAugspurger · 2021-06-06T15:39:47Z

I don’t really know how we could support nested attrs like that (on a series inside a Frame). I think we should recommend against relying on it.

…

On Jun 6, 2021, at 03:08, Joris Van den Bossche ***@***.***> wrote: @jorisvandenbossche requested changes on this pull request. While taking a closer look, I have some questions / concerns: This is also saving column / Series attrs, but to what extent do we actually already support attrs on the column level? (cc @TomAugspurger) For example, the df.a.attrs = {...} in the example/test only works because of the "item cache" of the DataFrame (once a column is accessed and converted into a Series, we cache this Series). If the item cache is cleared (so creating a new Series from the column), also the attrs of the "column" are lost. If saving attrs of the columns, this could also go into the "columns" field For the code: Can you also add a test that checks the actual generated metadata? (so testing the specification, which needs to be interoperable with other libraries, and not only the roundtrip from/to pandas, which might hide errors in the actual metadata) With the current code, I think it will generate duplicated attrs between "attrs" and "column_attrs" if you have only DataFrame-level attrs In doc/source/development/developer.rst: > @@ -185,3 +187,49 @@ As an example of fully-formed metadata: 'library': 'pyarrow', 'version': '0.13.0' }} + + +Attribute metadata +~~~~~~~~~~~~~~~~~~ + +.. warning:: This only works with the ``pyarrow`` engine as of ``pandas`` 1.3. + +The attributes of both the ``DataFrame`` and each ``Series`` are written to and read +from using: Probably also need to mention that this is an optional field? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jorisvandenbossche · 2021-06-07T08:07:41Z

Thanks Tom. Then I would propose to leave that out for this PR, and for now only focus on DataFrame.attrs

snowman2 · 2021-06-07T13:07:59Z

Then I would propose to leave that out for this PR, and for now only focus on DataFrame.attrs

That is unfortunate. Column level metadata was the main purpose of this PR for me at least.

snowman2 · 2021-06-07T13:10:09Z

Part of the motivation comes from this issue where xarray attrs are passed to pandas attrs: pydata/xarray#5335

But, seems like pandas attrs needs more work for this to be a reasonable solution.

github-actions · 2021-07-08T00:02:29Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

TomAugspurger · 2021-08-20T11:26:42Z

[Tom wrote] I don’t really know how we could support nested attrs like that (on a series inside a Frame). I think we should recommend against relying on it.

I'm rethinking this a bit. I think in the narrow context of reading / writing a dataset, it should be just fine to allow attrs at the table and column level (if the file backend supports it). When I wrote that, I was more concerned with how to propagate attrs through operations at multiple levels, which sounds hard in general. But for IO, where the number of operations is limited, it may be feasible.

Here's a link to the JIRA Alan created: https://issues.apache.org/jira/browse/ARROW-12823

snowman2 · 2021-08-21T01:38:24Z

Sounds good to me 👍. Should this go into pyarrow? Or, would it make sense to re-open this PR?

jorisvandenbossche · 2021-08-23T12:49:24Z

I am still hesitant to support a feature only in a directly following IO step, while it is not generally supported. That will cause confusion (users will expect it to work generally).

To restate one aspect of my comment above: storing attrs on a column currently only works because of item_cache:

In [1]: df = pd.DataFrame({"a": [1], "b": [1]})

In [2]: df['a'].attrs = {"a": "column"}

In [3]: df['a'].attrs
Out[3]: {'a': 'column'}

In [4]: df._clear_item_cache()

# attrs are gone
In [5]: df['a'].attrs
Out[5]: {}

And we can internally clear the item cache in seemingly random (for the user) cases (when doing some modification, when a consolidation of the manager happens, ..). And in general, behaviour should never depend on the item cache being cleared or not, IMO (and it also makes our behaviour tied to this mechanism, while in the Copy-on-Write POC I might want to remove the item_cache altogether).

One example:

In [6]: df = pd.DataFrame({"a": [1]})

In [7]: df["b"] = 2

In [8]: df['a'].attrs = {"a": "column"}

In [9]: df['a'].attrs
Out[9]: {'a': 'column'}

# .values consolidats the frame -> consolidation clears the item cache -> attrs "lost"
In [10]: df.values
Out[10]: array([[1, 2]])

In [11]: df['a'].attrs
Out[11]: {}

Another brittleness is that "column" attrs get inherited from the dataframe if not set through the item_cache:

In [19]: df = pd.DataFrame({"a": [1], "b": [1]})

In [20]: df.attrs = {"df": "frame"}

In [21]: df["a"].attrs
Out[21]: {'df': 'frame'}

That will also lead to duplicated attrs if saving both the dataframe and column attrs.

TomAugspurger · 2021-08-23T13:43:19Z

All of Joris' points are around pandas behavior after reading, so maybe we can limit the scope even further to writing attrs, if they exist? Then readers who need the attrs could use pyarrow.parquet to extract the metadata if they need it.

If we go that route, I think the work would be done in pyarrow, to write dataset and column(?) metadata if it's present.

jorisvandenbossche · 2021-08-23T14:24:40Z

But even for writing, it still depends on the item_cache behaviour (which I would call an accidental side effect, and not an explicitly supported feature), and it still would only be supported if you set the attrs in the step directly before doing the write. Doing anything else in between might alter the result. And also the duplicated attrs for multiple columns is a problem for writing, I think?

(given those aspects, I would not be keen on adding support for it in pyarrow)

jorisvandenbossche · 2021-08-23T14:25:26Z

(given those aspects, I would not be keen on adding support for it in pyarrow)

For columns, to be clear. Adding support for DataFrame.attrs seems a nice addition (and in theory one can store column specific information at the DataFrame level?)

snowman2 force-pushed the parquet_attrs branch from b311a62 to e0d3658 Compare May 18, 2021 17:46

snowman2 mentioned this pull request May 18, 2021

ENH adding metadata argument to DataFrame.to_parquet #20521

Open

datapythonista reviewed May 18, 2021

View reviewed changes

pandas/tests/io/test_parquet.py Outdated Show resolved Hide resolved

pandas/io/parquet.py Show resolved Hide resolved

datapythonista added Enhancement IO Parquet parquet, feather labels May 18, 2021

snowman2 force-pushed the parquet_attrs branch 4 times, most recently from 1a3fd8d to 45ea153 Compare May 19, 2021 00:05

jreback requested changes May 21, 2021

View reviewed changes

pandas/io/parquet.py Outdated Show resolved Hide resolved

pandas/io/parquet.py Outdated Show resolved Hide resolved

pandas/tests/io/test_parquet.py Show resolved Hide resolved

jreback requested a review from jorisvandenbossche May 21, 2021 16:53

snowman2 added 4 commits June 2, 2021 18:43

ENH: Read and write pandas attrs to parquet with pyarrow engine

e27f6c4

Ensure dataframe attrs not propagated to series

1e2fa66

staticmethods to _pyarrow_write_attrs/_pyarrow_read_attrs|add docstri…

0221b26

…ng/types

TST: Add test for non-json serializable attrs

564793e

snowman2 force-pushed the parquet_attrs branch from 4cbf1b3 to 564793e Compare June 2, 2021 23:43

DOC: Add documentation for parquet attributes

af1df4e

jorisvandenbossche requested changes Jun 6, 2021

View reviewed changes

github-actions bot added the Stale label Jul 8, 2021

snowman2 closed this Jul 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Read and write pandas attrs to parquet with pyarrow engine #41545

ENH: Read and write pandas attrs to parquet with pyarrow engine #41545

snowman2 commented May 18, 2021

datapythonista left a comment

jreback left a comment

gsheni commented Jun 2, 2021

snowman2 commented Jun 2, 2021

jorisvandenbossche commented Jun 3, 2021

jorisvandenbossche left a comment •

edited

Loading

jorisvandenbossche Jun 6, 2021

TomAugspurger commented Jun 6, 2021 via email

jorisvandenbossche commented Jun 7, 2021

snowman2 commented Jun 7, 2021

snowman2 commented Jun 7, 2021

github-actions bot commented Jul 8, 2021

TomAugspurger commented Aug 20, 2021

snowman2 commented Aug 21, 2021

jorisvandenbossche commented Aug 23, 2021

TomAugspurger commented Aug 23, 2021

jorisvandenbossche commented Aug 23, 2021

jorisvandenbossche commented Aug 23, 2021 •

edited

Loading

ENH: Read and write pandas attrs to parquet with pyarrow engine #41545

ENH: Read and write pandas attrs to parquet with pyarrow engine #41545

Conversation

snowman2 commented May 18, 2021

datapythonista left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

gsheni commented Jun 2, 2021

snowman2 commented Jun 2, 2021

jorisvandenbossche commented Jun 3, 2021

jorisvandenbossche left a comment • edited Loading

Choose a reason for hiding this comment

jorisvandenbossche Jun 6, 2021

Choose a reason for hiding this comment

TomAugspurger commented Jun 6, 2021 via email

jorisvandenbossche commented Jun 7, 2021

snowman2 commented Jun 7, 2021

snowman2 commented Jun 7, 2021

github-actions bot commented Jul 8, 2021

TomAugspurger commented Aug 20, 2021

snowman2 commented Aug 21, 2021

jorisvandenbossche commented Aug 23, 2021

TomAugspurger commented Aug 23, 2021

jorisvandenbossche commented Aug 23, 2021

jorisvandenbossche commented Aug 23, 2021 • edited Loading

jorisvandenbossche left a comment •

edited

Loading

jorisvandenbossche commented Aug 23, 2021 •

edited

Loading