-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Read and write pandas attrs to parquet with pyarrow engine #41545
Conversation
snowman2
commented
May 18, 2021
- closes ENH adding metadata argument to DataFrame.to_parquet #20521
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
- whatsnew entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just couple of minor comments
1a3fd8d
to
45ea153
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is probably fine. can you add a versionadded 1.3 to the doc-string of read/write. Also a note in the whatsnew enhancements section.
@snowman2 thanks for putting up the PR. There seems to be some conflicts... |
Well, resolving conflicts via my phone was a bad idea. I will fix it later. |
@snowman2 one more thing: if we are changing the metadata we store, we should probably also update the documentation about this: https://github.com/pandas-dev/pandas/blob/master/doc/source/development/developer.rst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While taking a closer look, I have some questions / concerns:
- This is also saving column / Series
attrs
, but to what extent do we actually already supportattrs
on the column level? (cc @TomAugspurger) For example, thedf.a.attrs = {...}
in the example/test only works because of the "item cache" of the DataFrame (once a column is accessed and converted into a Series, we cache this Series). If the item cache is cleared (so creating a new Series from the column), also the attrs of the "column" are lost. - If saving
attrs
of the columns, this could also go into the"columns"
field
For the code:
- Can you also add a test that checks the actual generated metadata? (so testing the specification, which needs to be interoperable with other libraries, and not only the roundtrip from/to pandas, which might hide errors in the actual metadata)
- With the current code, I think it will generate duplicated
attrs
between "attrs" and "column_attrs" if you have only DataFrame-levelattrs
Also, as mentioned in the issue, this is actually something that ideally would be implemented in pyarrow / fastparquet, since that is where currently those metadata are constructed (but it can be fine to short-term start with a workaround here).
.. warning:: This only works with the ``pyarrow`` engine as of ``pandas`` 1.3. | ||
|
||
The attributes of both the ``DataFrame`` and each ``Series`` are written to and read | ||
from using: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably also need to mention that this is an optional field?
I don’t really know how we could support nested attrs like that (on a series inside a Frame). I think we should recommend against relying on it.
… On Jun 6, 2021, at 03:08, Joris Van den Bossche ***@***.***> wrote:
@jorisvandenbossche requested changes on this pull request.
While taking a closer look, I have some questions / concerns:
This is also saving column / Series attrs, but to what extent do we actually already support attrs on the column level? (cc @TomAugspurger) For example, the df.a.attrs = {...} in the example/test only works because of the "item cache" of the DataFrame (once a column is accessed and converted into a Series, we cache this Series). If the item cache is cleared (so creating a new Series from the column), also the attrs of the "column" are lost.
If saving attrs of the columns, this could also go into the "columns" field
For the code:
Can you also add a test that checks the actual generated metadata? (so testing the specification, which needs to be interoperable with other libraries, and not only the roundtrip from/to pandas, which might hide errors in the actual metadata)
With the current code, I think it will generate duplicated attrs between "attrs" and "column_attrs" if you have only DataFrame-level attrs
In doc/source/development/developer.rst:
> @@ -185,3 +187,49 @@ As an example of fully-formed metadata:
'library': 'pyarrow',
'version': '0.13.0'
}}
+
+
+Attribute metadata
+~~~~~~~~~~~~~~~~~~
+
+.. warning:: This only works with the ``pyarrow`` engine as of ``pandas`` 1.3.
+
+The attributes of both the ``DataFrame`` and each ``Series`` are written to and read
+from using:
Probably also need to mention that this is an optional field?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Thanks Tom. Then I would propose to leave that out for this PR, and for now only focus on |
That is unfortunate. Column level metadata was the main purpose of this PR for me at least. |
Part of the motivation comes from this issue where xarray attrs are passed to pandas attrs: pydata/xarray#5335 But, seems like pandas attrs needs more work for this to be a reasonable solution. |
This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this. |
I'm rethinking this a bit. I think in the narrow context of reading / writing a dataset, it should be just fine to allow attrs at the table and column level (if the file backend supports it). When I wrote that, I was more concerned with how to propagate attrs through operations at multiple levels, which sounds hard in general. But for IO, where the number of operations is limited, it may be feasible. Here's a link to the JIRA Alan created: https://issues.apache.org/jira/browse/ARROW-12823 |
Sounds good to me 👍. Should this go into |
I am still hesitant to support a feature only in a directly following IO step, while it is not generally supported. That will cause confusion (users will expect it to work generally). To restate one aspect of my comment above: storing attrs on a column currently only works because of In [1]: df = pd.DataFrame({"a": [1], "b": [1]})
In [2]: df['a'].attrs = {"a": "column"}
In [3]: df['a'].attrs
Out[3]: {'a': 'column'}
In [4]: df._clear_item_cache()
# attrs are gone
In [5]: df['a'].attrs
Out[5]: {} And we can internally clear the item cache in seemingly random (for the user) cases (when doing some modification, when a consolidation of the manager happens, ..). And in general, behaviour should never depend on the item cache being cleared or not, IMO (and it also makes our behaviour tied to this mechanism, while in the Copy-on-Write POC I might want to remove the item_cache altogether). One example: In [6]: df = pd.DataFrame({"a": [1]})
In [7]: df["b"] = 2
In [8]: df['a'].attrs = {"a": "column"}
In [9]: df['a'].attrs
Out[9]: {'a': 'column'}
# .values consolidats the frame -> consolidation clears the item cache -> attrs "lost"
In [10]: df.values
Out[10]: array([[1, 2]])
In [11]: df['a'].attrs
Out[11]: {} Another brittleness is that "column" attrs get inherited from the dataframe if not set through the item_cache: In [19]: df = pd.DataFrame({"a": [1], "b": [1]})
In [20]: df.attrs = {"df": "frame"}
In [21]: df["a"].attrs
Out[21]: {'df': 'frame'} That will also lead to duplicated attrs if saving both the dataframe and column attrs. |
All of Joris' points are around pandas behavior after reading, so maybe we can limit the scope even further to writing attrs, if they exist? Then readers who need the attrs could use If we go that route, I think the work would be done in pyarrow, to write dataset and column(?) metadata if it's present. |
But even for writing, it still depends on the item_cache behaviour (which I would call an accidental side effect, and not an explicitly supported feature), and it still would only be supported if you set the (given those aspects, I would not be keen on adding support for it in pyarrow) |
For columns, to be clear. Adding support for DataFrame.attrs seems a nice addition (and in theory one can store column specific information at the DataFrame level?) |