Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] case_when function #55390

Closed
wants to merge 2 commits into from
Closed

[ENH] case_when function #55390

wants to merge 2 commits into from

Conversation

samukweku
Copy link
Contributor

@samukweku samukweku commented Oct 4, 2023

Continues the work started by @ELHoussineT(#50343)

  • uses pd.Series.mask under the hood
  • function implementation, as well as a Series method

@samukweku samukweku mentioned this pull request Oct 4, 2023
5 tasks
@samukweku
Copy link
Contributor Author

@rhshadrach @phofl @erfannariman @mroeschke could you kindly have a look at this PR? what's the proper way to tag reviewers? Thanks

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this up! I think it's looking good. The main issue I see is combining different indices - methods like Series.mask use alignment, and I think we should do so here. My guess is that we could use the logic of pd.concat(..., axis=1) to determine the resulting index. We may also want to allow both inner or outer joins when doing this.

pandas/core/case_when.py Outdated Show resolved Hide resolved
Comment on lines 195 to 199
for replacement in replacements[::-1]:
if isinstance(replacement, ABCSeries):
default_index = replacement.index
break
default = Series(default, index=default_index, dtype=common_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we respect alignment, why the reverse iteration here and using replacements instead of conditions? If we always use the first condition (it has to have an index), I think that is a more natural choice: it agrees with the idea of iteratively refining the values of the first condition/value pair.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just focusing on the core behavior for now.

Comment on lines 134 to 137
for condition in conditions[::-1]:
if isinstance(condition, ABCSeries):
default_index = condition.index
break
Copy link
Member

@rhshadrach rhshadrach Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't consider that the first condition could be a non-pandas object. With this, it seems to me this has a bit of an odd behavior of using the first indexed condition rather than always using the first one. In hindsight, maybe even my suggestion of using the first condition for the index is guessing at what the user wants to do, and is something we should avoid.

When default has no index, we could check that for those that are Series, the conditions and/or values have the same index. If they aren't all equal, we raise. In this case they are all equal, it's unambiguous what the resulting index should be. I worry this is an expensive check. But in what I think is the common case where a single DataFrame is used (e.g. pd.case_when(df['a'] > 5, 10, df['b'] < 3, 5, default=0)), this check is very cheap.

If we do go this route, we can also entertain adding an optional index argument to the function specifically to tell us what the user wants the resulting index to be. But this isn't necessary.

What do you think @samukweku? Also wouldn't mind getting a few more eyes on this, cc @mroeschke @jorisvandenbossche

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a non-pandas, list-like condition, I would assume the "index" would be the same as the original object unless I'm not totally understanding the scenario

Copy link
Member

@rhshadrach rhshadrach Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke - I'm guessing you're thinking something like Series.case_when or DataFrame.case_when. This function is top-level: pd.case_when.

Copy link
Contributor Author

@samukweku samukweku Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we get rid of the top level function and just assign it to a Series/DataFrame instead? In that case, we could allow mask/where to support multiple conditions and replacements

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samukweku - In the linked issue I've advocated that Series.case_when would be quite useful - but then you'd remove the default argument (the Series values are the default). However, that would not cover a use case like

result = pd.case_when(df['a'] > 5, 1, df['b'] < 3, 2, default=0)

The Series alternative would be

result = pd.Series(0, index=df.index).case_when(df['a'] > 5, 1, df['b'] < 3)

While I like the explicitness of the 2nd version and do prefer it, I can understand if others find the first more natural.

I don't know what DataFrame.case_when would do. Modify all columns? This doesn't seem like a common use case.

pandas/core/case_when.py Outdated Show resolved Hide resolved
pandas/core/case_when.py Outdated Show resolved Hide resolved
default = default.mask(
condition, other=replacement, axis=0, inplace=False, level=level
)
except Exception as error:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this try except and have mask raise it's error normally

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is to keep track of which condition, replacement failed. More like condition1 failed, this is why it failed. Devolving to mask error directly and you lose the error tracking. I assume the error tracking would be useful to the user.

@samukweku
Copy link
Contributor Author

samukweku commented Oct 29, 2023

@mroeschke @rhshadrach made some changes based on your feedback. Looking forward to your feedback. Thanks

@samukweku
Copy link
Contributor Author

Just focusing on the core behavior for now.

@rhshadrach made some changes to the code. Let me know your thoughts. Thanks

@rhshadrach
Copy link
Member

Thanks for the ping @samukweku - going to get to this either tonight or tomorrow.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Series.case_when looks great; I still feel uncertain about the pd.case_when logic but don't see any way it could be made better. I'm okay going forward with it.

doc/source/whatsnew/v2.2.0.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v2.2.0.rst Show resolved Hide resolved
pandas/core/case_when.py Outdated Show resolved Hide resolved
pandas/core/case_when.py Outdated Show resolved Hide resolved
pandas/core/case_when.py Outdated Show resolved Hide resolved
pandas/core/case_when.py Outdated Show resolved Hide resolved
pandas/core/case_when.py Outdated Show resolved Hide resolved
pandas/core/series.py Outdated Show resolved Hide resolved
pandas/core/series.py Outdated Show resolved Hide resolved
pandas/core/series.py Show resolved Hide resolved
@samukweku samukweku requested a review from rhshadrach November 13, 2023 20:22
@samukweku
Copy link
Contributor Author

samukweku commented Nov 17, 2023

@rhshadrach getting some unrelated failing tests. I have also updated the code based on your feedback.

@samukweku samukweku closed this by deleting the head repository Nov 19, 2023
@samukweku
Copy link
Contributor Author

had issues with my repo, so I had to delete and reinstall. I do not know how to reconnect with this PR. opened a new PR. @rhshadrach if there is a git fu to connect to this, I'll gladly run it. thanks

@samukweku
Copy link
Contributor Author

reconnecting to this PR

@samukweku samukweku mentioned this pull request Nov 19, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Dedicated method for creating conditional columns
4 participants