-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: how strict should the equals() method be? #33940
Comments
It feels to me like when calling equals on some container you'd generally be interested in equality of the entire container not just the values, otherwise you'd use == (maybe modulo some differences around missing values). You might argue that even this behavior is questionable: [ins] In [11]: pd.Series([1, 2, 3], name="abc").equals(pd.Series([1, 2, 3], name="xyz"))
Out[11]: True |
If you want equality of the entire container, which I assume you typically do in testing (?), you also have the On the other hand, in non-testing code, you might be more be interested in equal values (this is just guessing, I don't really know how people would like to use this).
You could say that |
Regarding use cases, it might already be interesting to check how this method is used internally (as the reason I started looking at this is because I actually needed |
Just to add to this from #36065 the testing mod equals also behave inconsistently:
The reason being that for assert_series_equals after checking the index it checks the frequency of the index. A very easy change would be to add the freq check to the assert_index_equals with argument and remove it from the assert_series_equals. |
@attack68 , it's a good point! |
I think this is at the heart of #33531 |
Now that EA has a .equals method, we should consider having both Index.equal and Series.equals wrap
|
For Index.equals, I think the main internal usage is intended as a check on whether we can short-circuit/fastpath setops/get_indexer/reindex/join. Those would benefit from making equals stricter, particularly for DTI/TDI/PI. |
I understand that for internal use the strict version might be more useful, but for external use cases, the loose version might be more useful if you care about equal values (and not eg int32 vs int64). In such a case, this Now, whatever option we would like to choose long-term as default, do we want to add a keyword to control this "strictness"? (eg like |
Please, also extend the consideration on differences to the >>> df1 = pd.DataFrame([pd.NA])
>>> df2 = pd.DataFrame([np.nan])
>>> df1.equals(df2)
False
>>> df1.compare(df2)
Empty DataFrame
Columns: []
Index: [] |
When I compare two indexes "a" and "b" using |
What about an argument |
While adding an
equals()
method to ExtensionArray (#27081, #30652), some questions come up about how strict the method should be:And it seems that right now, we are somewhat inconsistent with this in pandas regarding being strict on the data type.
Series is strict about the dtype, while Index is not:
For Index, this not only gives True for different integer dtypes as above, but also for float/int, object/int (both examples give False for Series):
Index and Series are consistent when it comes to not being equal with other array-likes:
Both Index and Series also seem to allow subclasses:
So in the end, I think the main discussion point is: should the dtype be exactly the same, or should only the values be equal?
For DataFrame, it shares the implementation with Series so follows that behaviour (except that for DataFrame there are some additional rules about how column names need to compare equal).
The text was updated successfully, but these errors were encountered: