Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: attrs #52166

Open
jbrockmendel opened this issue Mar 24, 2023 · 29 comments
Open

DEPR: attrs #52166

jbrockmendel opened this issue Mar 24, 2023 · 29 comments
Labels
Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Mar 24, 2023

Discussion broken off from #51280
PR #52152

Propagation of attrs in __finalize__ is a small-but-everywhere performance hit that we should deprecate.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 24, 2023
@jorisvandenbossche
Copy link
Member

For context, this feature was added in pandas 1.0 (#29062, cc @TomAugspurger).

I personally have no idea how much attrs specifically is being used since it was introduced, but in general the ability to store metadata is a topic that has come up a lot. From a quick browsing of our issues, some related ones:

(from the last two issues, it seems there is certainly some user interest in the specific attrs feature)

@TomAugspurger
Copy link
Contributor

From Joris in #51280 (comment)

If performance is the main argument that we would want to deprecate the related features to metadata propagation (attrs/flags, #52165 and #52166, where it is currently the only argument), I think we need some more investigation / proof that this is actually a problem.

That's been bugging me too. I haven't looked at the performance, but copying the metadata should just be a dictionary merge / update.

At the end of the day we'll be making a value judgement: is the performance cost worth it. We'll need a clearer idea of performance cost.

@jbrockmendel
Copy link
Member Author

where it is currently the only argument

The other argument is that attrs/_metadata is only half-implemented, with a bunch of the test_finalize tests xfailed and a bunch more just wrong. And there is no real prospect of getting these fully working.

If we do decide this is worth keeping, we should have Only One way to do it. _metadata and attrs do effectively the same thing in slightly different ways.

@jorisvandenbossche
Copy link
Member

If we do decide this is worth keeping, we should have Only One way to do it. _metadata and attrs do effectively the same thing in slightly different ways.

That is not really true I think. attrs allows users to use this for standard DataFrames, _metadata allows subclasses to use custom metadata that is not directly exposed to users through attrs.

@MarcoGorelli
Copy link
Member

And there is no real prospect of getting these fully working.

Personally, this is the argument I find most persuading

I encountered this in the USC contract too, they said they couldn't use attrs as they'd tried to and it was too unreliable (having tried fixing up the finalize tests, I'm not surprised)

One could make the argument that some feature not working completely isn't a reason to deprecate it, but I'm not sure that's valid if the feature isn't being worked on (by contrast, datetime parsing has bugs, but it's actively being worked on, so the prospect of fixing them is realistic).
Given that it's marked as "experimental" anyway, I'd suggest just deprecating/removing it, rather than leaving it hanging around unfinished.

As for users wanting to store metadata - does any other DataFrame library support this? If not, we shouldn't be saying "yes" to everything, especially given how limited maintenance resources are.

As for what users should do - I'd suggest they define their own dataclass where one field is metadata and another is the dataframe, and then take care of how to propagate it themselves

@jorisvandenbossche
Copy link
Member

but I'm not sure that's valid if the feature isn't being worked on

To be clear I am not working on this myself, so I don't know the details. But I am not sure that this is true that it is not being worked on: judging by the the activity and linked PRs in #28283, there is some work going on to improve this? (it might have slowed down the last months, but for example generally speaking for the year 2022, quite some PRs have been merged related to this)

I think the bigger problem is that there is no longer an active champion following up on this within the core team

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 1, 2023

I can chip away at these as I have free time.

As for users wanting to store metadata - does any other DataFrame library support this? If not, we shouldn't be saying "yes" to everything, especially given how limited maintenance resources are.

xarray does, and I think is a good analog here.

@MarcoGorelli
Copy link
Member

I can chip away at these as I have free time.

Awesome!

I think the bigger problem is that there is no longer an active champion following up on this within the core team

Yeah if someone's willing to step up and champion it (like it looks Tom might be doing?) then I have no objections to salvaging this, apologies for having made some too heavy-handed comments earlier on this

@chourmo
Copy link

chourmo commented Apr 2, 2023

An example of attrs use is one of my little personal projects : https://github.com/chourmo/netpandas
It uses it to keep track of the column name with ('from' 'to') to represent a graph structure in a standard dataframe.
This is just an example, do not make a decision just based on my use case :)

@theOehrly
Copy link
Contributor

Another project that subclasses pandas and uses _metadata is https://github.com/theOehrly/Fast-F1.
I have previously worked on some of the missing __finalize__ calls and tests. If you decide that you want to keep this, I can likely offer some time to work on this as well over the next few weeks if you want to get this to a fully working state then.

@jorisvandenbossche
Copy link
Member

@chourmo @theOehrly thanks a lot for chiming in! That's useful feedback, and it's good to see real-world examples so we can better evaluate this.

Another project that subclasses pandas and uses _metadata is https://github.com/theOehrly/Fast-F1.

@theOehrly I know you are aware of it, but for the general reader, the issue about subclasses/_metadata is at #51280

@TomAugspurger
Copy link
Contributor

Yeah if someone's willing to step up and champion it (like it looks Tom might be doing?)

Champion might be a bit strong :) It'll just be an hour or so on random weekend mornings.

@ssweber
Copy link

ssweber commented Apr 17, 2023

Another “using it!” chime.

Our library just converted to using dataframes for ResultSets.

attrs will store things like asc/desc sort order, if a inserted row is “virtual” (unsaved to db), etc.

@jorisvandenbossche jorisvandenbossche mentioned this issue May 2, 2023
5 tasks
@topper-123
Copy link
Contributor

I like the having a fixed location where users can store their own meta data. But at the same time I think that __finalize__ is not a nice concept, it's brittle.

If we can't make the propagation work, I'd be in favor of keeping attrs but then drop the propagation. There is IMO still value in having a location to put user data related to the dataframe.

@ssweber
Copy link

ssweber commented May 5, 2023

Update: After implementing and using, we only had to reattach attrs once, and it makes sense:

attrs = self.rows.attrs.copy()
row_series = pd.Series(row)
self.rows = pd.concat([self.rows, row_series.to_frame().T], ignore_index=True)
self.rows.attrs = attrs

@topper-123
Copy link
Contributor

can we say we agree that we deprecate giving attrs to __finalize__ but keep the attribute otherwise, i.e. make this a much simpler implementation?

@timhoffm
Copy link
Contributor

timhoffm commented Aug 2, 2023

Another user here. We use attrs to carry along additional metadata e.g. for timeseries.


Comments on above suggestions concerning removal:

As for what users should do - I'd suggest they define their own dataclass where one field is metadata and another is the dataframe, and then take care of how to propagate it themselves

This is quite inconvenient. You loose a lot of API. For example I can currently do (s1 + s2) / 2. For a dataframe or custom class one would have to reimplement all this. Also using the data will become more verbose: s1.dropna() will become s1.data.dropna() (or similar). If you mainly use the data and only rarely the attrs, sprinkling in .data everywhere not very readable or user-friendly.

can we say we agree that we deprecate giving attrs to finalize but keep the attribute otherwise, i.e. make this a much simpler implementation?

If I understand correctly, not handling attrs in __finalize__ would mean no propagation. The great value of attrs is that it's propagated. Without that usefulness is reduced by 90% because almost all operations create new series/dataframes and immediately loose all the user-set attrs, e.g. just doing a simple dropna() and your information is gone.


Comments on maintaining attrs

There are two aspects:

  1. Performance: As long as a __finalize__ concept exists, attrs propagation can be handled in there with almost zero overhead. We're essentially only doing a dict update, which (in the empty case) costs some tens of nanoseconds and is on the same order as a function call. There's likely also some micro-optimization potential here if needed.

  2. Behavior: What should the behavior be when multiple objects are involved (e.g. s1 + s2 or concat())? One should specify the desired behavoir (maybe just document the current one as it seems to work good enough for most people?). There may still be places without __finalize__. These can be fixed as we go. Or one could also document that "many operations will maintain attrs, but not all" - This relieves the project from the need to handle every non-working case as a bug, and at least I could live with such a weak guarantee. Again, in practice this seems to work mostly good enough already.

I'd be happy to go into discussion what's needed to keep propagated attrs around, and possibly could help out with some work here and there.

@MarcoGorelli
Copy link
Member

Many thanks @timhoffm for your comment!

I'm gonna reverse my previous stance then, it's really not too big of a deal to keep it. Furthermore, since I made my original comment, there have been PRs merged to improve attrs propagation

@scott-arne
Copy link

scott-arne commented Dec 23, 2023

I'd find it a pretty significant loss of functionality if attrs went away, especially Series-based attrs. Here are just a few ways that it is being used in several of my packages:

  1. To store custom custom rendering options that are set in conjunction with the register_series_accessor decorator. For example, I register a series accessor called highlight that lets users highlight chemical substructures in DataFrames in Jupyter. For lack of any better place to put that, I'm using attrs, since (until a recent change, see below) that data was propagated when the DataFrame was copied but it did not matter to me if it persisted on serialization.

  2. As the probably widest reported use case, for units. In other packages, I register custom accessors that retrieve data from our database (e.g., df.get_toxicity_data(["List", "of", "experiments"])). I bring back those units and attach them as Series metadata. Even though this would actually be nice to persist, I'm ok with letting my users make sure they standardize units as they wish before they serialize their DataFrames for downstream processing.

I'd be supportive of a _metadata-like approach for Series (as seems to now be documented for DataFrames). But getting rid of any way to store Series-level DataFrame would really make things sticky. I'm not a big fan of transparently returning a DataFrame with subclassed Series (I guess?) just to keep the metadata in attrs from disappearing.

Note from above:

The "see below" about the recent change is that Series attrs now disappear just when calling DataFrame.head() or DataFrame.tail(). I'm mentioning this in another open issue - but maybe worth mentioning here just to comment on the current state of attrs.

df = pd.DataFrame({"MySeries": [1, 2, 3]})
df.MySeries.attrs["metadata"] = "this is important"

# This prints an empty dictionary: {}
print(df.head().MySeries.attrs)

# This still prints the attrs: {'metadata': 'this is important'}
print(df.MySeries.attrs)

@TomAugspurger TomAugspurger mentioned this issue Dec 27, 2023
5 tasks
@rhshadrach
Copy link
Member

rhshadrach commented Feb 6, 2024

It seems that Copy-On-Write removes some of the utility of attrs. Is there a way to set attrs on a column of a DataFrame?

pd.set_option("mode.copy_on_write", True)

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})

df["a"].attrs["name"] = "x"
print(df["a"].attrs)
# {}

df.loc[:, "a"].attrs["name"] = "x"
print(df["a"].attrs)
# {}

ser = df["a"]
ser.attrs["name"] = 'x'
df["c"] = ser
print(df["c"].attrs)
# {}

I imagine the 3rd example can be made to work with CoW, but not the 1st and 2nd. cc @phofl

@phofl
Copy link
Member

phofl commented Feb 6, 2024

Yeah I think your conclusion is correct

@timhoffm
Copy link
Contributor

timhoffm commented Feb 6, 2024

IMHO the first two should error out. Setting attrs is a write operation, but we certainly don't want this to make a copy of the dataframe. Furthermore, attrs is a global property of the dataframe and modifying that through a partial view may be confusing. So the only reasonable behavior is to not allow setting attrs on views.

@phofl
Copy link
Member

phofl commented Feb 6, 2024

IMHO the first two should error out.

Yeah I think I agree, it is basically another version of chained assignment

@tfardet
Copy link

tfardet commented Jun 1, 2024

Am I correct in understanding that the consensus is now that attrs is here to stay?
I also use it and would like to increase my reliance on it (provided it doesn't go away).

If it's there to stay, would it be OK to remove the experimental warning in the doc and instead specify when it is not propagated?

@timhoffm
Copy link
Contributor

timhoffm commented Jun 4, 2024

Not a pandas core dev, but my take on this is that it's aspirational to support attrs. However, there are still some rough edges in particular also in connection with copy-on-write, so that it's not a first-class feature yet.

I would characterize it as:

  • Is here to stay.
  • Works reasonably well, but be prepared for some limitations / bugs.
  • Should gradually improve in the long run.

@LucaMarconato
Copy link

I want to add one item to the list of projects using .attrs 😊

In the spatialdata library (a framework developed for spatial biology, kind of an extension of microscopy which looks at molecular content in tissues), we use .attrs to store metadata (simple objects that are JSON-serializable) that is crucial for our data representation.

More precisely, the SpatialData class is a container for various objects: pd.DataFrame, xarray.DataArray, datatree.DataTree, geopandas.GeoDataFrame, dask.dataframe.DataFrame, anndata.AnnData and all these objects provide a .attrs (AnnData provides .uns, which works mostly as .attrs).

I discovered this discussion because recently dask.dataframe.DataFrame dropped the support for .attrs dask/dask#11146 and the whole library broke. I am now working on a PR to try to restore that functionality. I am happy to read that from this conversation that .attrs seems to be here to stay! 💯

@akors
Copy link

akors commented Nov 7, 2024

Not a pandas core dev, but my take on this is that it's aspirational to support attrs. However, there are still some rough edges in particular also in connection with copy-on-write, so that it's not a first-class feature yet.

I would characterize it as:

* Is here to stay.
* Works reasonably well, but be prepared for some limitations / bugs.
* Should gradually improve in the long run.

Could we have confirmation by a Pandas core dev that attrs are staying? For example by formally closing all Issues/PR's associated with the deprecation?

I would like to start using .attrs for tracking information about data generation and sources, and I'd love to attach them directly to the DataFrame instead of having to maintain extra data / files or using hacky workarounds suggested on StackOverflow such as adding Python attributes to the DataFrame objects.

@jobh
Copy link

jobh commented Nov 18, 2024

I've also used .attrs in several projects, and find it very useful — for units, provenance, dirty-state etc. I don't mind the support being incomplete. Two small additions I'd suggest to make the incomplete support feel more... complete? are:

  1. If there isn't already, put a terse list in the documentation of which operations support it, and how (by shallow copy, by reference if any).
  2. Add a .set_attrs() method that allows one-liners like return pd.concat(dfs).set_attrs(dfs[0].attrs)

@timhoffm
Copy link
Contributor

  1. If there isn't already, put a terse list in the documentation of which operations support it, and how (by shallow copy, by reference if any).

The Notes section in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.attrs.html#pandas.DataFrame.attrs is the best we have. It's a user-facing paraphrasing of the implementation: attrs handling is done in __finalize__, which most operations use (but there may be exceptions, which is why it may be hard to make a definitive list). pd.concat has defensive special-casing because the dataframe is created from multiple inputs. Copies are always deep.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests