API: how strict should the equals() method be? #33940

jorisvandenbossche · 2020-05-02T08:36:29Z

While adding an equals() method to ExtensionArray (#27081, #30652), some questions come up about how strict the method should be:

Other array-likes are not equivalent, even if they are all equal?
But subclasses are equivalent, when they are all equal?
Objects with different dtype are not equivalent (eg int8 vs int16), even if all values are equal?

And it seems that right now, we are somewhat inconsistent with this in pandas regarding being strict on the data type.

Series is strict about the dtype, while Index is not:

>>> pd.Series([1, 2, 3], dtype="int64").equals(pd.Series([1, 2, 3], dtype="int32"))
False

>>> pd.Index([1, 2, 3], dtype="int64").equals(pd.Index([1, 2, 3], dtype="int32"))
True

For Index, this not only gives True for different integer dtypes as above, but also for float/int, object/int (both examples give False for Series):

>>> pd.Index([1, 2, 3], dtype="int64").equals(pd.Index([1, 2, 3], dtype="object"))
True

>>> pd.Index([1, 2, 3], dtype="int64").equals(pd.Index([1, 2, 3], dtype="float64"))
True

Index and Series are consistent when it comes to not being equal with other array-likes:

# all those cases return False
pd.Series([1, 2, 3]).equals(np.array([1, 2, 3])) 
pd.Index([1, 2, 3]).equals(np.array([1, 2, 3])) 
pd.Series([1, 2, 3]).equals(pd.Index([1, 2, 3])) 
pd.Index([1, 2, 3]).equals(pd.Series([1, 2, 3]))
pd.Series([1, 2, 3]).equals([1, 2, 3])
pd.Index([1, 2, 3]).equals([1, 2, 3])

Both Index and Series also seem to allow subclasses:

class MySeries(pd.Series): 
    pass 

>>> pd.Series([1, 2, 3]).equals(MySeries([1, 2, 3]))
True

So in the end, I think the main discussion point is: should the dtype be exactly the same, or should only the values be equal?

For DataFrame, it shares the implementation with Series so follows that behaviour (except that for DataFrame there are some additional rules about how column names need to compare equal).

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-05-03T18:01:44Z

It feels to me like when calling equals on some container you'd generally be interested in equality of the entire container not just the values, otherwise you'd use == (maybe modulo some differences around missing values). You might argue that even this behavior is questionable:

[ins] In [11]: pd.Series([1, 2, 3], name="abc").equals(pd.Series([1, 2, 3], name="xyz"))
Out[11]: True

jorisvandenbossche · 2020-05-04T07:53:57Z

If you want equality of the entire container, which I assume you typically do in testing (?), you also have the assert_.. functions to use.

On the other hand, in non-testing code, you might be more be interested in equal values (this is just guessing, I don't really know how people would like to use this).
For example, if you want to check if two columns contain the same elements (df['a'].equals(df['b'])), it might be annoying to have to rename first before being able to call equals.

otherwise you'd use == (maybe modulo some differences around missing values).

You could say that equals mainly exists to handle this "modulo" clause .. ? (but again, hard to say how it is exactly used)

jorisvandenbossche · 2020-05-04T07:55:13Z

Regarding use cases, it might already be interesting to check how this method is used internally (as the reason I started looking at this is because I actually needed EA.equals to implement Index.equals when storing EAs in the Index).

attack68 · 2020-09-02T14:29:52Z

Just to add to this from #36065 the testing mod equals also behave inconsistently:

import pandas as pd
import pandas._testing as tm

i1 = pd.date_range("2008-01-01", periods=1000, freq="12H")
i2 = pd.date_range("2008-01-01", periods=1000, freq="12H")
i2.freq = None

i1.equals(i2)  # True
tm.assert_index_equal(i1, i2)  # True

df = pd.DataFrame({'dates': i1})
df2 = pd.DataFrame({'dates': i2})

df.equals(df2)  # True
tm.assert_frame_equal(df, df2)  # True

df.index = i1
df2.index = i2

df.equals(df2)  #True
tm.assert_frame_equal(df, df2)  # FAILS lidx.freq != ridx.freq

The reason being that for assert_series_equals after checking the index it checks the frequency of the index. A very easy change would be to add the freq check to the assert_index_equals with argument and remove it from the assert_series_equals.

YarShev · 2020-09-02T16:17:31Z

@attack68 , it's a good point!

jbrockmendel · 2020-09-23T20:32:57Z

I think this is at the heart of #33531

jbrockmendel · 2020-09-28T21:42:09Z

Now that EA has a .equals method, we should consider having both Index.equal and Series.equals wrap self.array.equals to ensure consistent behavior. e.g. ATM we have

dti = pd.date_range("2016-01-01", periods=3)
ser = pd.Series(dti)

>>> ser.equals(pd.Series(dti.asi8))
False

>>> dti.equals(pd.Index(dti.asi8))
True

jbrockmendel · 2021-11-01T23:27:42Z

For Index.equals, I think the main internal usage is intended as a check on whether we can short-circuit/fastpath setops/get_indexer/reindex/join. Those would benefit from making equals stricter, particularly for DTI/TDI/PI.

jorisvandenbossche · 2021-11-26T10:20:12Z

I understand that for internal use the strict version might be more useful, but for external use cases, the loose version might be more useful if you care about equal values (and not eg int32 vs int64). In such a case, this equals method is mostly useful as a shortcut for (obj1 == obj2).all() that also takes care of ignoring NAs at the same location.

Now, whatever option we would like to choose long-term as default, do we want to add a keyword to control this "strictness"? (eg like check_dtype or equal_dtype)

ronanpaixao · 2022-06-29T13:49:47Z

Please, also extend the consideration on differences to the compare() method. Current behavior, for example on handling NaN versus <NA>, is confusing (tested with Pandas version 1.4.2):

>>> df1 = pd.DataFrame([pd.NA])
>>> df2 = pd.DataFrame([np.nan])
>>> df1.equals(df2)
False
>>> df1.compare(df2)
Empty DataFrame
Columns: []
Index: []

batterseapower · 2023-10-26T23:23:51Z

When I compare two indexes "a" and "b" using equals I'm usually trying to answer the question "will pd.Series(index=a).loc[b] and pd.Series(index=b).loc[a] both work?". Since Pandas generally converts dtypes in .loc I would not expect Index.equal to consider dtypes, and empirically this is how Index.equals usually works. I raised #55694 because it seems to be an exception to this general rule.

rhshadrach · 2024-11-05T16:57:43Z

What about an argument check_dtype? We could default to False, but use True (especially internally).

jorisvandenbossche added Needs Discussion Requires discussion from core team before further action API - Consistency Internal Consistency of API/Behavior labels May 2, 2020

jorisvandenbossche mentioned this issue May 2, 2020

ENH: .equals for Extension Arrays #30652

Merged

5 tasks

jorisvandenbossche mentioned this issue May 27, 2020

BUG/API: other object type check in Series/DataFrame.equals #34402

Merged

galipremsagar mentioned this issue Jul 10, 2020

[REVIEW] Add support for DataFrame.append, Index.append, Index.difference and Index.empty rapidsai/cudf#5454

Merged

14 tasks

simonjayhawkins mentioned this issue Aug 8, 2020

CLN: consistent signatures for equals methods #35636

Merged

dsaxton mentioned this issue Sep 2, 2020

BUG: Index.equals returns incorrect result #36065

Closed

3 tasks

YarShev mentioned this issue Sep 3, 2020

FEAT-#1839: Update pandas dependency and pandas APIs to match modin-project/modin#1840

Merged

6 tasks

attack68 mentioned this issue Sep 23, 2020

BUG: tz_localize drops freq from DatetimIndex #36575

Open

3 tasks

mroeschke added the API Design label Aug 7, 2021

jorisvandenbossche mentioned this issue Nov 26, 2021

ENH: allow storing ExtensionArrays in Index #43930

Merged

7 tasks

jbrockmendel mentioned this issue Mar 11, 2022

BUG: Index.equals different behavior for EA vs non-EA types #46164

Closed

3 tasks

phofl mentioned this issue Jun 17, 2022

Mixed dtypes of column index for data frames raises en error #47382

Closed

jbrockmendel mentioned this issue Oct 26, 2023

BUG: DatetimeIndex.equals returns false if units mismatch #55694

Open

3 tasks

jorisvandenbossche mentioned this issue Nov 5, 2024

BUG: Index.equals is not commutative for string and category dtypes #60194

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: how strict should the equals() method be? #33940

API: how strict should the equals() method be? #33940

jorisvandenbossche commented May 2, 2020 •

edited

Loading

dsaxton commented May 3, 2020

jorisvandenbossche commented May 4, 2020

jorisvandenbossche commented May 4, 2020

attack68 commented Sep 2, 2020

YarShev commented Sep 2, 2020

jbrockmendel commented Sep 23, 2020

jbrockmendel commented Sep 28, 2020

jbrockmendel commented Nov 1, 2021

jorisvandenbossche commented Nov 26, 2021

ronanpaixao commented Jun 29, 2022

batterseapower commented Oct 26, 2023

rhshadrach commented Nov 5, 2024

API: how strict should the equals() method be? #33940

API: how strict should the equals() method be? #33940

Comments

jorisvandenbossche commented May 2, 2020 • edited Loading

dsaxton commented May 3, 2020

jorisvandenbossche commented May 4, 2020

jorisvandenbossche commented May 4, 2020

attack68 commented Sep 2, 2020

YarShev commented Sep 2, 2020

jbrockmendel commented Sep 23, 2020

jbrockmendel commented Sep 28, 2020

jbrockmendel commented Nov 1, 2021

jorisvandenbossche commented Nov 26, 2021

ronanpaixao commented Jun 29, 2022

batterseapower commented Oct 26, 2023

rhshadrach commented Nov 5, 2024

jorisvandenbossche commented May 2, 2020 •

edited

Loading