-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] case_when function #55390
[ENH] case_when function #55390
Conversation
@rhshadrach @phofl @erfannariman @mroeschke could you kindly have a look at this PR? what's the proper way to tag reviewers? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this up! I think it's looking good. The main issue I see is combining different indices - methods like Series.mask use alignment, and I think we should do so here. My guess is that we could use the logic of pd.concat(..., axis=1)
to determine the resulting index. We may also want to allow both inner or outer joins when doing this.
pandas/core/case_when.py
Outdated
for replacement in replacements[::-1]: | ||
if isinstance(replacement, ABCSeries): | ||
default_index = replacement.index | ||
break | ||
default = Series(default, index=default_index, dtype=common_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we respect alignment, why the reverse iteration here and using replacements instead of conditions? If we always use the first condition (it has to have an index), I think that is a more natural choice: it agrees with the idea of iteratively refining the values of the first condition/value pair.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just focusing on the core behavior for now.
pandas/core/case_when.py
Outdated
for condition in conditions[::-1]: | ||
if isinstance(condition, ABCSeries): | ||
default_index = condition.index | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't consider that the first condition could be a non-pandas object. With this, it seems to me this has a bit of an odd behavior of using the first indexed condition rather than always using the first one. In hindsight, maybe even my suggestion of using the first condition for the index is guessing at what the user wants to do, and is something we should avoid.
When default
has no index, we could check that for those that are Series, the conditions and/or values have the same index. If they aren't all equal, we raise. In this case they are all equal, it's unambiguous what the resulting index should be. I worry this is an expensive check. But in what I think is the common case where a single DataFrame is used (e.g. pd.case_when(df['a'] > 5, 10, df['b'] < 3, 5, default=0)
), this check is very cheap.
If we do go this route, we can also entertain adding an optional index
argument to the function specifically to tell us what the user wants the resulting index to be. But this isn't necessary.
What do you think @samukweku? Also wouldn't mind getting a few more eyes on this, cc @mroeschke @jorisvandenbossche
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a non-pandas, list-like condition, I would assume the "index" would be the same as the original object unless I'm not totally understanding the scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke - I'm guessing you're thinking something like Series.case_when
or DataFrame.case_when
. This function is top-level: pd.case_when
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we get rid of the top level function and just assign it to a Series/DataFrame instead? In that case, we could allow mask/where to support multiple conditions and replacements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@samukweku - In the linked issue I've advocated that Series.case_when
would be quite useful - but then you'd remove the default
argument (the Series values are the default). However, that would not cover a use case like
result = pd.case_when(df['a'] > 5, 1, df['b'] < 3, 2, default=0)
The Series alternative would be
result = pd.Series(0, index=df.index).case_when(df['a'] > 5, 1, df['b'] < 3)
While I like the explicitness of the 2nd version and do prefer it, I can understand if others find the first more natural.
I don't know what DataFrame.case_when
would do. Modify all columns? This doesn't seem like a common use case.
default = default.mask( | ||
condition, other=replacement, axis=0, inplace=False, level=level | ||
) | ||
except Exception as error: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this try except and have mask raise it's error normally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is to keep track of which condition, replacement
failed. More like condition1
failed, this is why it failed. Devolving to mask error directly and you lose the error tracking. I assume the error tracking would be useful to the user.
@mroeschke @rhshadrach made some changes based on your feedback. Looking forward to your feedback. Thanks |
@rhshadrach made some changes to the code. Let me know your thoughts. Thanks |
Thanks for the ping @samukweku - going to get to this either tonight or tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Series.case_when looks great; I still feel uncertain about the pd.case_when logic but don't see any way it could be made better. I'm okay going forward with it.
@rhshadrach getting some unrelated failing tests. I have also updated the code based on your feedback. |
had issues with my repo, so I had to delete and reinstall. I do not know how to reconnect with this PR. opened a new PR. @rhshadrach if there is a git fu to connect to this, I'll gladly run it. thanks |
reconnecting to this PR |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Continues the work started by @ELHoussineT(#50343)
pd.Series.mask
under the hood