Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Parameter keep_equal in .compare(...) method should not rely on the keep_shape parameter #49510

Open
1 of 3 tasks
it176131 opened this issue Nov 3, 2022 · 3 comments
Open
1 of 3 tasks
Labels
Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@it176131
Copy link

it176131 commented Nov 3, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The docs for the keep_equal parameter say:

keep_equal: bool, default False
If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.

I would expect the following code to return rows where the values in each Series are equal, but instead it is empty.

import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

print(s0.compare(s1, keep_equal=True))
Empty DataFrame
Columns: [self, other]
Index: []

Changing the keep_shape parameter argument to True results in the expected results

import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

print(s0.compare(s1, keep_shape=True, keep_equal=True))
       self     other
0  0.548814  0.548814
1  0.715189  0.715189
2  0.602763  0.602763
3  0.544883  0.544883
4  0.423655  0.423655

I would like it if the latter output would be returned without having to change the keep_shape argument.

Feature Description

Add logic to .compare(...) method so keep_equal can swap the underlying mask.

NOTE -- this has not been tested!

def compare(
    self,
    other,
    align_axis: Axis = 1,
    keep_shape: bool_t = False,
    keep_equal: bool_t = False,
    result_names: Suffixes = ("self", "other"),
):
    from pandas.core.reshape.concat import concat

    if type(self) is not type(other):
        cls_self, cls_other = type(self).__name__, type(other).__name__
        raise TypeError(
            f"can only compare '{cls_self}' (not '{cls_other}') with '{cls_self}'"
        )

    mask = ~((self == other) | (self.isna() & other.isna()))

    if not keep_equal:
        self = self.where(mask)
        other = other.where(mask)
    
    # TODO -- negate the mask!
    else:
        self = self.mask(mask)
        other = other.mask(mask)

    ...

Alternative Solutions

Use the .compare(...) method as is, but remember to change the keep_shape argument from False to True.

import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

print(s0.compare(s1, keep_shape=True, keep_equal=True))
       self     other
0  0.548814  0.548814
1  0.715189  0.715189
2  0.602763  0.602763
3  0.544883  0.544883
4  0.423655  0.423655

Additional Context

I could not find any existing issues related to the keep_equal parameter in .compare. However, I did find this issue related to tolerance parameters.

@it176131 it176131 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 3, 2022
@vamsi-verma-s
Copy link
Contributor

Hi @it176131

I would expect the following code to return rows where the values in each Series are equal, but instead it is empty.

I think, this expectation is a bit off as the the function is supposed to show differences.
doc - Compare to another Series and show the differences.

By default equal values are shown as NaNs making it easier to view ( maybe does not make sense in Series. But, helps in the context of DataFrame)

If keep_shape=True by default, and this used to compare any series that has more than a few values. Its going show everything and user has to figure out where the differences are.

having both keep_shape=True and keep_equal=True just puts both of the Series side by side, leaving the user to figure our where the differences are, if any.

@it176131
Copy link
Author

it176131 commented Nov 8, 2022

Thanks for looking at this @vamsi-verma-s

I created this example to hopefully show more of what I think is an issue.

import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

# change the 0th element in `s1` to something else
s1.iloc[0] = "a different value"

print(s0.compare(s1, keep_equal=True))

With Series this doesn't do anything

       self              other
0  0.548814  a different value

So I convert to a DataFrame like you mentioned.

import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

# change the 0th element in `s1` to something else
s1.iloc[0] = "a different value"

f0 = s0.to_frame("col0")
f1 = s1.to_frame("col0")

print(f0.compare(f1, keep_equal=True))

But this doesn't do anything either.

       col0                   
       self              other
0  0.548814  a different value

To keep the method working as intended, should the docstring for keep_equal be edited so the user knows that it may not return what's expected unless keep_shape=True?

@vamsi-verma-s
Copy link
Contributor

I think it's more useful for comparing DataFrames that have multiple columns, keep_equal=True shows equal so that you can get better context, otherwise shows as NaNs to make it easier to view. But, yeah, does not seem very useful in Series or DataFrames that have single columns

In [41]: df1
Out[41]:
   A  B   C
0  1  5   9
1  2  6  10
2  3  7  11
3  4  8  12

In [42]: df2
Out[42]:
   A  B   C
0  1  5   9
1  2 -2  10
2 -1  7  11
3  4  8  12

In [43]: df1.compare(df2, keep_equal=True)
Out[43]:
     A          B
  self other self other
1    2     2    6    -2
2    3    -1    7     7

In [44]: df1.compare(df2)
Out[44]:
     A          B
  self other self other
1  NaN   NaN  6.0  -2.0
2  3.0  -1.0  NaN   NaN

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants