-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DEPR: attrs #52166
Comments
For context, this feature was added in pandas 1.0 (#29062, cc @TomAugspurger). I personally have no idea how much
(from the last two issues, it seems there is certainly some user interest in the specific |
From Joris in #51280 (comment)
That's been bugging me too. I haven't looked at the performance, but copying the metadata should just be a dictionary merge / update. At the end of the day we'll be making a value judgement: is the performance cost worth it. We'll need a clearer idea of performance cost. |
The other argument is that attrs/_metadata is only half-implemented, with a bunch of the test_finalize tests xfailed and a bunch more just wrong. And there is no real prospect of getting these fully working. If we do decide this is worth keeping, we should have Only One way to do it. _metadata and attrs do effectively the same thing in slightly different ways. |
That is not really true I think. |
Personally, this is the argument I find most persuading I encountered this in the USC contract too, they said they couldn't use One could make the argument that some feature not working completely isn't a reason to deprecate it, but I'm not sure that's valid if the feature isn't being worked on (by contrast, datetime parsing has bugs, but it's actively being worked on, so the prospect of fixing them is realistic). As for users wanting to store metadata - does any other DataFrame library support this? If not, we shouldn't be saying "yes" to everything, especially given how limited maintenance resources are. As for what users should do - I'd suggest they define their own dataclass where one field is metadata and another is the dataframe, and then take care of how to propagate it themselves |
To be clear I am not working on this myself, so I don't know the details. But I am not sure that this is true that it is not being worked on: judging by the the activity and linked PRs in #28283, there is some work going on to improve this? (it might have slowed down the last months, but for example generally speaking for the year 2022, quite some PRs have been merged related to this) I think the bigger problem is that there is no longer an active champion following up on this within the core team |
I can chip away at these as I have free time.
xarray does, and I think is a good analog here. |
Awesome!
Yeah if someone's willing to step up and champion it (like it looks Tom might be doing?) then I have no objections to salvaging this, apologies for having made some too heavy-handed comments earlier on this |
An example of attrs use is one of my little personal projects : https://github.com/chourmo/netpandas |
Another project that subclasses pandas and uses _metadata is https://github.com/theOehrly/Fast-F1. |
@chourmo @theOehrly thanks a lot for chiming in! That's useful feedback, and it's good to see real-world examples so we can better evaluate this.
@theOehrly I know you are aware of it, but for the general reader, the issue about subclasses/ |
Champion might be a bit strong :) It'll just be an hour or so on random weekend mornings. |
Another “using it!” chime. Our library just converted to using dataframes for ResultSets. attrs will store things like asc/desc sort order, if a inserted row is “virtual” (unsaved to db), etc. |
I like the having a fixed location where users can store their own meta data. But at the same time I think that If we can't make the propagation work, I'd be in favor of keeping |
Update: After implementing and using, we only had to reattach attrs once, and it makes sense: attrs = self.rows.attrs.copy()
row_series = pd.Series(row)
self.rows = pd.concat([self.rows, row_series.to_frame().T], ignore_index=True)
self.rows.attrs = attrs |
can we say we agree that we deprecate giving |
Another user here. We use Comments on above suggestions concerning removal:
This is quite inconvenient. You loose a lot of API. For example I can currently do
If I understand correctly, not handling attrs in Comments on maintaining attrsThere are two aspects:
I'd be happy to go into discussion what's needed to keep propagated attrs around, and possibly could help out with some work here and there. |
Many thanks @timhoffm for your comment! I'm gonna reverse my previous stance then, it's really not too big of a deal to keep it. Furthermore, since I made my original comment, there have been PRs merged to improve attrs propagation |
I'd find it a pretty significant loss of functionality if attrs went away, especially Series-based attrs. Here are just a few ways that it is being used in several of my packages:
I'd be supportive of a Note from above: The "see below" about the recent change is that Series attrs now disappear just when calling df = pd.DataFrame({"MySeries": [1, 2, 3]})
df.MySeries.attrs["metadata"] = "this is important"
# This prints an empty dictionary: {}
print(df.head().MySeries.attrs)
# This still prints the attrs: {'metadata': 'this is important'}
print(df.MySeries.attrs) |
It seems that Copy-On-Write removes some of the utility of attrs. Is there a way to set attrs on a column of a DataFrame?
I imagine the 3rd example can be made to work with CoW, but not the 1st and 2nd. cc @phofl |
Yeah I think your conclusion is correct |
IMHO the first two should error out. Setting attrs is a write operation, but we certainly don't want this to make a copy of the dataframe. Furthermore, attrs is a global property of the dataframe and modifying that through a partial view may be confusing. So the only reasonable behavior is to not allow setting attrs on views. |
Yeah I think I agree, it is basically another version of chained assignment |
Am I correct in understanding that the consensus is now that If it's there to stay, would it be OK to remove the experimental warning in the doc and instead specify when it is not propagated? |
Not a pandas core dev, but my take on this is that it's aspirational to support I would characterize it as:
|
I want to add one item to the list of projects using In the More precisely, the I discovered this discussion because recently |
Could we have confirmation by a Pandas core dev that I would like to start using |
I've also used
|
The Notes section in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.attrs.html#pandas.DataFrame.attrs is the best we have. It's a user-facing paraphrasing of the implementation: attrs handling is done in |
Discussion broken off from #51280
PR #52152
Propagation of
attrs
in__finalize__
is a small-but-everywhere performance hit that we should deprecate.The text was updated successfully, but these errors were encountered: