[ENH] case_when function #55390

samukweku · 2023-10-04T01:53:04Z

closes ENH: Dedicated method for creating conditional columns #39154
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Continues the work started by @ELHoussineT(#50343)

uses pd.Series.mask under the hood
function implementation, as well as a Series method

samukweku · 2023-10-07T21:05:23Z

@rhshadrach @phofl @erfannariman @mroeschke could you kindly have a look at this PR? what's the proper way to tag reviewers? Thanks

rhshadrach

Thanks for taking this up! I think it's looking good. The main issue I see is combining different indices - methods like Series.mask use alignment, and I think we should do so here. My guess is that we could use the logic of pd.concat(..., axis=1) to determine the resulting index. We may also want to allow both inner or outer joins when doing this.

pandas/core/case_when.py

rhshadrach · 2023-10-15T12:31:17Z

pandas/core/case_when.py

+        for replacement in replacements[::-1]:
+            if isinstance(replacement, ABCSeries):
+                default_index = replacement.index
+                break
+        default = Series(default, index=default_index, dtype=common_dtype)


Assuming we respect alignment, why the reverse iteration here and using replacements instead of conditions? If we always use the first condition (it has to have an index), I think that is a more natural choice: it agrees with the idea of iteratively refining the values of the first condition/value pair.

rhshadrach

Just focusing on the core behavior for now.

rhshadrach · 2023-10-19T20:58:38Z

pandas/core/case_when.py

+        for condition in conditions[::-1]:
+            if isinstance(condition, ABCSeries):
+                default_index = condition.index
+                break


I didn't consider that the first condition could be a non-pandas object. With this, it seems to me this has a bit of an odd behavior of using the first indexed condition rather than always using the first one. In hindsight, maybe even my suggestion of using the first condition for the index is guessing at what the user wants to do, and is something we should avoid.

When default has no index, we could check that for those that are Series, the conditions and/or values have the same index. If they aren't all equal, we raise. In this case they are all equal, it's unambiguous what the resulting index should be. I worry this is an expensive check. But in what I think is the common case where a single DataFrame is used (e.g. pd.case_when(df['a'] > 5, 10, df['b'] < 3, 5, default=0)), this check is very cheap.

If we do go this route, we can also entertain adding an optional index argument to the function specifically to tell us what the user wants the resulting index to be. But this isn't necessary.

What do you think @samukweku? Also wouldn't mind getting a few more eyes on this, cc @mroeschke @jorisvandenbossche

For a non-pandas, list-like condition, I would assume the "index" would be the same as the original object unless I'm not totally understanding the scenario

@mroeschke - I'm guessing you're thinking something like Series.case_when or DataFrame.case_when. This function is top-level: pd.case_when.

Should we get rid of the top level function and just assign it to a Series/DataFrame instead? In that case, we could allow mask/where to support multiple conditions and replacements

@samukweku - In the linked issue I've advocated that Series.case_when would be quite useful - but then you'd remove the default argument (the Series values are the default). However, that would not cover a use case like

result = pd.case_when(df['a'] > 5, 1, df['b'] < 3, 2, default=0)

The Series alternative would be

result = pd.Series(0, index=df.index).case_when(df['a'] > 5, 1, df['b'] < 3)

While I like the explicitness of the 2nd version and do prefer it, I can understand if others find the first more natural.

I don't know what DataFrame.case_when would do. Modify all columns? This doesn't seem like a common use case.

pandas/core/case_when.py

mroeschke · 2023-10-19T23:14:43Z

pandas/core/case_when.py

+            default = default.mask(
+                condition, other=replacement, axis=0, inplace=False, level=level
+            )
+        except Exception as error:


I would remove this try except and have mask raise it's error normally

The idea here is to keep track of which condition, replacement failed. More like condition1 failed, this is why it failed. Devolving to mask error directly and you lose the error tracking. I assume the error tracking would be useful to the user.

samukweku · 2023-10-29T09:29:21Z

@mroeschke @rhshadrach made some changes based on your feedback. Looking forward to your feedback. Thanks

samukweku · 2023-11-04T00:13:43Z

Just focusing on the core behavior for now.

@rhshadrach made some changes to the code. Let me know your thoughts. Thanks

rhshadrach · 2023-11-06T21:53:51Z

Thanks for the ping @samukweku - going to get to this either tonight or tomorrow.

rhshadrach

I think the Series.case_when looks great; I still feel uncertain about the pd.case_when logic but don't see any way it could be made better. I'm okay going forward with it.

doc/source/whatsnew/v2.2.0.rst

pandas/core/case_when.py

pandas/core/series.py

samukweku · 2023-11-17T08:58:55Z

@rhshadrach getting some unrelated failing tests. I have also updated the code based on your feedback.

samukweku · 2023-11-19T07:24:54Z

had issues with my repo, so I had to delete and reinstall. I do not know how to reconnect with this PR. opened a new PR. @rhshadrach if there is a git fu to connect to this, I'll gladly run it. thanks

samukweku · 2023-11-19T07:25:33Z

reconnecting to this PR

samukweku mentioned this pull request Oct 4, 2023

ENH: case_when function #55306

Closed

5 tasks

rhshadrach requested changes Oct 8, 2023

View reviewed changes

pandas/core/case_when.py Outdated Show resolved Hide resolved

samukweku requested review from MarcoGorelli and WillAyd as code owners October 12, 2023 08:07

samukweku requested a review from rhshadrach October 15, 2023 03:36

rhshadrach reviewed Oct 15, 2023

View reviewed changes

pandas/core/case_when.py Outdated Show resolved Hide resolved

rhshadrach reviewed Oct 15, 2023

View reviewed changes

rhshadrach requested changes Oct 15, 2023

View reviewed changes

samukweku requested a review from rhshadrach October 18, 2023 09:22

simonjayhawkins added Enhancement API Design labels Oct 18, 2023

rhshadrach reviewed Oct 19, 2023

View reviewed changes

mroeschke reviewed Oct 19, 2023

View reviewed changes

pandas/core/case_when.py Outdated Show resolved Hide resolved

mroeschke reviewed Oct 19, 2023

View reviewed changes

pandas/core/case_when.py Outdated Show resolved Hide resolved

mroeschke reviewed Oct 19, 2023

View reviewed changes

samukweku requested review from mroeschke and rhshadrach October 28, 2023 11:23

rhshadrach requested changes Nov 9, 2023

View reviewed changes

samukweku requested a review from rhshadrach November 13, 2023 20:22

samukweku added 2 commits November 19, 2023 08:45

updates

39a8929

Merge branch 'main' into samukweku/case_when_function

551e1e8

samukweku closed this by deleting the head repository Nov 19, 2023

samukweku mentioned this pull request Nov 19, 2023

ENH: Add case_when method #56059

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] case_when function #55390

[ENH] case_when function #55390

samukweku commented Oct 4, 2023 •

edited

Loading

samukweku commented Oct 7, 2023

rhshadrach left a comment

rhshadrach Oct 15, 2023

rhshadrach left a comment

rhshadrach Oct 19, 2023 •

edited

Loading

mroeschke Oct 19, 2023

rhshadrach Oct 20, 2023 •

edited

Loading

samukweku Oct 20, 2023 •

edited

Loading

rhshadrach Oct 28, 2023

mroeschke Oct 19, 2023

samukweku Oct 20, 2023

samukweku commented Oct 29, 2023 •

edited

Loading

samukweku commented Nov 4, 2023

rhshadrach commented Nov 6, 2023

rhshadrach left a comment

samukweku commented Nov 17, 2023 •

edited

Loading

samukweku commented Nov 19, 2023

samukweku commented Nov 19, 2023

[ENH] case_when function #55390

[ENH] case_when function #55390

Conversation

samukweku commented Oct 4, 2023 • edited Loading

samukweku commented Oct 7, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Oct 15, 2023

Choose a reason for hiding this comment

rhshadrach left a comment

Choose a reason for hiding this comment

rhshadrach Oct 19, 2023 • edited Loading

Choose a reason for hiding this comment

mroeschke Oct 19, 2023

Choose a reason for hiding this comment

rhshadrach Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

samukweku Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

rhshadrach Oct 28, 2023

Choose a reason for hiding this comment

mroeschke Oct 19, 2023

Choose a reason for hiding this comment

samukweku Oct 20, 2023

Choose a reason for hiding this comment

samukweku commented Oct 29, 2023 • edited Loading

samukweku commented Nov 4, 2023

rhshadrach commented Nov 6, 2023

rhshadrach left a comment

Choose a reason for hiding this comment

samukweku commented Nov 17, 2023 • edited Loading

samukweku commented Nov 19, 2023

samukweku commented Nov 19, 2023

samukweku commented Oct 4, 2023 •

edited

Loading

rhshadrach Oct 19, 2023 •

edited

Loading

rhshadrach Oct 20, 2023 •

edited

Loading

samukweku Oct 20, 2023 •

edited

Loading

samukweku commented Oct 29, 2023 •

edited

Loading

samukweku commented Nov 17, 2023 •

edited

Loading