Various methods don't call call finalize #28283

TomAugspurger · 2019-09-04T12:10:47Z

Improve coverage of NDFrame.__finalize__

Pandas uses NDFrame.__finalize__ to propagate metadata from one NDFrame to
another. This ensures that things like self.attrs and self.flags are not
lost. In general we would like that any operation that accepts one or more
NDFrames and returns an NDFrame should propagate metadata by calling
__finalize__.

The test file at
https://github.com/pandas-dev/pandas/blob/master/pandas/tests/generic/test_finalize.py
attempts to be an exhaustive suite of tests for all these cases. However there
are many tests currently xfailing, and there are likely many APIs not covered.

This is a meta-issue to improve the use of __finalize__. Here's a hopefully
accurate list of methods that don't currently call finalize.

Some general comments around finalize

We don't have a good sense for what should happen to attrs when there are
multiple NDFrames involved with differing attrs (e.g. in concat). The safest
approach is to probably drop the attrs when they don't match, but this will
need some thought.
We need to be mindful of performance. __finalize__ can be somewhat expensive
so we'd like to call it exactly once per user-facing method. This can be tricky
for things like DataFrame.apply which is sometimes used internally. We may need
to refactor some methods to have a user-facing DataFrame.apply that calls an internal
DataFrame._apply. The internal method would not call __finalize__, just the
user-facing DataFrame.apply would.

If you're interested in working on this please post a comment indicating which method
you're working on. Un-xfail the test, then update the method to pass the test. Some of these
will be much more difficult to work on than others (e.g. groupby is going to be difficult). If you're
unsure whether a particular method is likely to be difficult, ask first.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-09-04T21:07:39Z

How __finalize__ works with methods like concat requires a bit of discussion. How do we reconcile metadata from multiple sources?

In #27108, I'm using _metadata to propagate whether the NDFrame allows duplicate labels. In this situation, my ideal reconciliation function would be

def reconcile_concat(others: List[DataFrame]) -> bool:
    """
    Allow duplicates only if all the inputs allowed them.

    If any disallow them, we disallow them.
    """
    return all(x.allows_duplicates for x in others)

However, that reconciliation strategy isn't valid / proper for arbitrary metadata. Which I think argues for some kind of dispatch system for reconciling metadata, where the attribute gets to determine how things are handled.

allows_duplicate_meta = PandasMetadata("allows_duplicates")  # the attribute name

@allows_duplicate_meta.register(pd.concat)  # the method
def reconcile_concat():
    ...

Then we always pass method to NDFrame.__finalize__, which we'll use to look up the function for how to reconcile things. A default reconciliation can be provided.

cc @jbrockmendel, since I think you looked into metadata propagation in another issue.

jbrockmendel · 2019-09-04T22:10:42Z

IIRC I was looking at _metadata to try to implement units (this predated EAs). One of the biggest problems I had was that metadata on a Series didn't behave well when that Series is inserted into a DataFrame.

Do we have an idea of how often _metadata is used in the wild? i.e. could we deprecate it and make an EA-based implementation?

TomAugspurger · 2019-09-05T00:11:21Z

It’s essentially undocumented, so I’m OK with being aggressive here.

What would an EA-based implementation look like? For something like units, metadata may not be appropriate. I think an EA dtype makes more sense.

I’ll add this to the agenda for next weeks call.

jbrockmendel · 2019-09-05T00:24:01Z

What would an EA-based implementation look like?

It's a bit ambiguous what this question is referring to, but big picture something like _metadata but with dispatch logic could be done with something like (for this example I'm thinking of units in physics where (4 meters) / (2 seconds) == 2 m/s):

class MyDtype(ExtensionDtype):
    def __mul__(self, other):
        other = getattr(other, "dtype", other)
        return some_new_dtype


class MyEA(ExtensionArray):
     def __mul__(self, other):
         result = self._data * other
         dtype = self.dtype * other
         return type(self)(result, dtype)

TomAugspurger · 2019-09-05T01:42:55Z

Right. Does the current EA interface suffice for that use case, or are there additional hooks needed?

TomAugspurger · 2019-11-12T17:14:19Z

Not a blocker for 1.0.

Progress towards pandas-dev#28283. This calls `finalize` for all the public series methods where I think it makes sense.

TomAugspurger · 2020-04-06T16:21:15Z

Do people think that accessor methods like .str and .dt should call finalize? IMO, yes they should.

jorisvandenbossche · 2020-04-06T17:48:36Z

But would you then handle name propagation there?

TomAugspurger · 2020-04-06T18:06:51Z

Name propagation isn't (currently) handled in __finalize__. I don't think it should be handled there currently, since the current __finalize__ isn't well suited to resolving the name when there are multiple inputs. In the future it might make sense.

My two motivating use-cases here are

My allow_duplicate_labels PR, for disallowing duplicates
A workload that preserves something like .attrs["source"] = "file.csv" through as many operations as makes sense.

jorisvandenbossche · 2020-04-06T18:14:11Z

Name propagation isn't (currently) handled in finalize.

It actually is, not? At least in some cases? (eg for new Series originating from other Series, where other is that Series).

TomAugspurger · 2020-04-06T18:59:36Z

Apologies, I forgot that name was in _metadata. So yes, name handling could be handled there.

TomAugspurger · 2020-04-07T14:17:40Z

When should we call finalize? A high-level list:

Yes

Reshape operations (stack, unstack, reset_index, set_index, to_frame)
Indexing (take, loc, iloc, __getitem__, reindex, drop, assign, select_dtypes, nlargest?)
"transforms" (repeat, explode, shift, diff, round, isin, fillna, isna, dropna, copy, rename, applymap, .transform, sort_values, sort_index)
Accessor methods returning Series .str, .dt, .cat
Binary ops (arithmetic, comparison, combine,
ufuncs
concat / merge / joins, append
cumulative aggregations (cumsum)?

Unsure

Reductions (DataFrame.sum(), etc. count, quantlie, idxmin)
groupby, pivot, pivot_table?
DataFrame.apply?
corr, cov, etc.

These are somewhat arbitrary. I can't really come up with a rule why a reduction like DataFrame.sum() shouldn't call __finalize__ while DataFrame.cumsum does. So perhaps the rule should be "any NDFrame method returning an NDFrame (or Tuple[NDFrame]) will call __finalize__". I like the simplicity of that.

Progress towards pandas-dev#28283. This adds tests that ensures `NDFrame.__finalize__` is called in more places. Thus far I've added tests for anything that meets the following rule: > Pandas calls `NDFrame.__finalize__` on any NDFrame method that returns > another NDFrame. I think that given the generality of `__finalize__`, making any kind of list of which methods should call it is going to be somewhat arbitrary. That rule errs on the side of calling it too often, which I think is my preference.

TomAugspurger · 2020-09-03T13:58:12Z

I've updated the original post. Hopefully we can find some contributors interested in working on getting finalize called in more places.

Sadin · 2020-09-03T14:12:56Z

@TomAugspurger I would be interested in contributing to pandas and start by helping to tackle some of these methods. Which methods might be good places to start?

theOehrly · 2022-06-02T15:49:00Z

.quantile can be ticked off as well. Done as of #47183

covertg · 2022-06-08T03:39:23Z

Hi, I'd be happy to help tick some of the boxes off here. Would love to see attrs move out of its experimental status. However, I'm new to contributing to pandas and would appreciate suggestions on which methods to start with.

hamedgibago · 2022-06-18T19:04:38Z

Hi, I want to take my first issue and I'm new contributor to pandas. I'd like to get DataFrame.count as my first. Is it good for start? If not please offer another one. Thanks.

SomtochiUmeh · 2022-07-20T04:25:00Z

take

SomtochiUmeh · 2022-07-20T04:25:27Z

Starting with Dataframe.idxmax() and Dataframe.idxmin()

…andas-dev#28283

#47821) * Fixed metadata propagation in Dataframe.idxmax and Dataframe.idxmin #28283 * fixing PEP 8 issues * removing unnecessary pytest.param() * removing unnecessary pytest.param

seljaks · 2022-09-07T11:35:00Z

Hi, I'm new to open source and I'm interested in contributing to this issue. I'd like to start with DataFrame.add and if that goes well move on to sub, mul, and div

yuanx749 · 2022-09-18T08:20:53Z

Worked on corr and cov.
Skip corrwith for the time being since it involves another NDFrame.

bobzhang-stack · 2023-04-01T04:12:17Z

Hello,
I'm new to contributing to open source and am interesting in doing one of the tasks for this issue.
@TomAugspurger Is there any particular method (of the remaining ones still with finalize issue) that would be recommended for me to look into? Is pop alright to start with (it seemed like someone was working on it but I didn't see a pull request for resolving it)?

TomAugspurger · 2023-04-01T13:40:46Z

@bobzhang-stack it looks like pop might be done.

In [74]: df = pd.DataFrame({"A": [1, 2], "B": [1, 2]})

In [75]: df.attrs['a'] = 1

In [76]: df.pop("B").attrs
Out[76]: {'a': 1}

In [77]: df.attrs
Out[77]: {'a': 1}

It seems the majority of the remaining ones are related to operations between multiple objects with attrs (#49916). Aside from that, there's .eval with engine="numexpr". I'm not sure how hard that would be.

KartikeyBartwal · 2023-10-01T19:22:08Z

Hi I'm starting to work on DataFrame.merge

…ta correctly

Frostbyte72 · 2024-11-22T16:11:51Z

Would it be possible for me to be assigned to df.merge() to attempt a fix?

TomAugspurger added this to the 1.0 milestone Sep 4, 2019

TomAugspurger added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Sep 4, 2019

TomAugspurger changed the title ~~DataFrame.reset_index doesn't call __finalize__~~ Various methods don't call call __finalize__ Sep 4, 2019

TomAugspurger modified the milestones: 1.0, Contributions Welcome Nov 12, 2019

TomAugspurger mentioned this issue Apr 6, 2020

Series .attrs is not correctly maintained/propagated in to_frame #31452

Closed

TomAugspurger added the metadata _metadata, .attrs label Apr 6, 2020

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Apr 6, 2020

API: Call finalize in more Series methods [WIP]

971f62b

Progress towards pandas-dev#28283. This calls `finalize` for all the public series methods where I think it makes sense.

TomAugspurger mentioned this issue Apr 7, 2020

API/TST: Call __finalize__ in more places #33379

Merged

TomAugspurger mentioned this issue May 14, 2020

BUG: Subclassed DataFrame doesn't persist _metadata properties across binary operations #34177

Open

3 tasks

RobertRosca mentioned this issue Jul 27, 2020

BUG: attrs lost for Series in DataFrame #35425

Open

3 tasks

raphaelvallat mentioned this issue Jul 30, 2020

Return X/Xw arrays in linear regression output? raphaelvallat/pingouin#112

Closed

TomAugspurger added the good first issue label Sep 3, 2020

ajcost mentioned this issue Jun 5, 2022

ENH: Dataframe metadata preservation on join operation #47238

Closed

github-actions bot assigned SomtochiUmeh Jul 20, 2022

SomtochiUmeh added a commit to SomtochiUmeh/pandas that referenced this issue Jul 22, 2022

Fixed metadata propagation in Dataframe.idxmax and Dataframe.idxmin p…

05fdeb3

…andas-dev#28283

SomtochiUmeh mentioned this issue Jul 22, 2022

Fixed metadata propagation in Dataframe.idxmax and Dataframe.idxmin #… #47821

Merged

1 task

seljaks mentioned this issue Sep 14, 2022

ENH: added finalize to binary operators on DataFrame, GH28283 #48551

Merged

4 tasks

yuanx749 mentioned this issue Sep 18, 2022

BUG: Fix metadata propagation in df.corr and df.cov, GH28283 #48616

Merged

4 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

phofl mentioned this issue Oct 14, 2022

QST: When will attrs be not experimental? #49085

Closed

2 tasks

This was referenced Mar 27, 2023

DEPR: __finalize__ and _metadata #51280

Open

DEPR: attrs #52166

Open

jinlixiao mentioned this issue Apr 24, 2023

Call __finalize__ in Dataframe.combine and Dataframe.combine_first #52886

Merged

2 tasks

xiaoxiaoimg pushed a commit to xiaoxiaoimg/pandas that referenced this issue May 15, 2024

Fix for issue pandas-dev#28283: Ensure DataFrame.eval calls __finalize__

2895b9b

xiaoxiaoimg pushed a commit to xiaoxiaoimg/pandas that referenced this issue May 16, 2024

Fix for issue pandas-dev#28283: Ensure __finalize__ propagates metada…

1e283bf

…ta correctly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various methods don't call call finalize #28283

Various methods don't call call finalize #28283

TomAugspurger commented Sep 4, 2019 •

edited by lithomas1

Loading

TomAugspurger commented Sep 4, 2019

jbrockmendel commented Sep 4, 2019

TomAugspurger commented Sep 5, 2019

jbrockmendel commented Sep 5, 2019

TomAugspurger commented Sep 5, 2019

TomAugspurger commented Nov 12, 2019

TomAugspurger commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

TomAugspurger commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

TomAugspurger commented Apr 6, 2020

TomAugspurger commented Apr 7, 2020

TomAugspurger commented Sep 3, 2020

Sadin commented Sep 3, 2020

theOehrly commented Jun 2, 2022

covertg commented Jun 8, 2022

hamedgibago commented Jun 18, 2022

SomtochiUmeh commented Jul 20, 2022

SomtochiUmeh commented Jul 20, 2022 •

edited

Loading

seljaks commented Sep 7, 2022

yuanx749 commented Sep 18, 2022

bobzhang-stack commented Apr 1, 2023 •

edited

Loading

TomAugspurger commented Apr 1, 2023 •

edited

Loading

KartikeyBartwal commented Oct 1, 2023

Frostbyte72 commented Nov 22, 2024 •

edited

Loading

Various methods don't call call __finalize__ #28283

Various methods don't call call __finalize__ #28283

Comments

TomAugspurger commented Sep 4, 2019 • edited by lithomas1 Loading

TomAugspurger commented Sep 4, 2019

jbrockmendel commented Sep 4, 2019

TomAugspurger commented Sep 5, 2019

jbrockmendel commented Sep 5, 2019

TomAugspurger commented Sep 5, 2019

TomAugspurger commented Nov 12, 2019

TomAugspurger commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

TomAugspurger commented Apr 6, 2020

jorisvandenbossche commented Apr 6, 2020

TomAugspurger commented Apr 6, 2020

TomAugspurger commented Apr 7, 2020

TomAugspurger commented Sep 3, 2020

Sadin commented Sep 3, 2020

theOehrly commented Jun 2, 2022

covertg commented Jun 8, 2022

hamedgibago commented Jun 18, 2022

SomtochiUmeh commented Jul 20, 2022

SomtochiUmeh commented Jul 20, 2022 • edited Loading

seljaks commented Sep 7, 2022

yuanx749 commented Sep 18, 2022

bobzhang-stack commented Apr 1, 2023 • edited Loading

TomAugspurger commented Apr 1, 2023 • edited Loading

KartikeyBartwal commented Oct 1, 2023

Frostbyte72 commented Nov 22, 2024 • edited Loading

Various methods don't call call finalize #28283

Various methods don't call call finalize #28283

TomAugspurger commented Sep 4, 2019 •

edited by lithomas1

Loading

SomtochiUmeh commented Jul 20, 2022 •

edited

Loading

bobzhang-stack commented Apr 1, 2023 •

edited

Loading

TomAugspurger commented Apr 1, 2023 •

edited

Loading

Frostbyte72 commented Nov 22, 2024 •

edited

Loading