Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow NaN Intervals and to allow NaN as a Groupby Category #28927

Closed
JoshuaC3 opened this issue Oct 11, 2019 · 2 comments
Closed

Allow NaN Intervals and to allow NaN as a Groupby Category #28927

JoshuaC3 opened this issue Oct 11, 2019 · 2 comments
Labels
Categorical Categorical Data Type Closing Candidate May be closeable, needs more eyeballs Enhancement Groupby Interval Interval data type

Comments

@JoshuaC3
Copy link

JoshuaC3 commented Oct 11, 2019

Code Sample

import pandas as pd

df = pd.DataFrame(
    data={
        'age': [12, 13, 14, 12, np.nan, 16, 16.5, 13.5, 10, 18]
    }
)

intervals = [pd.Interval(i, i + 3) for i in range(10, 18, 3)]
# - 1 intervals = intervals + pd.Interval(pd.np.nan, pd.np.nan)
interval_idx = pd.IntervalIndex(intervals)

cutted = pd.cut(df.age, interval_idx)
# - 2 cutted = cutted.fillna(-99)
cutted
0    [10.0, 13.0)
1    [13.0, 16.0)
2    [13.0, 16.0)
3    [10.0, 13.0)
4             NaN
5    [16.0, 19.0)
6    [16.0, 19.0)
7    [13.0, 16.0)
8              NaN
9    [16.0, 19.0)
Name: age, dtype: category
Categories (3, interval[int64]): [(10, 13] < (13, 16] < (16, 19]]
df.groupby(cutted).score.mean()
age
(10, 13]    51.000000
(13, 16]    54.333333
(16, 19]    35.000000
Name: score, dtype: float64

Problem description

Warning: This is something of an XYZ problem.

Ultimately, I want to be able to groupby Intervals AND include NaNs as one of the groups. Ideally, this would be displayed as NaNs but groupby doesn't include categorical NaNs. It would be acceptable to display this with -99, for example.

However, if I had only the NaN value as seen before the cut, I could easily pre-fill (fillna) with -99 but here the cut generates the NaN associated with the 10 value.

Now I try to fillna on the IntervalIndex code-comment # - 2 but it doesn't let me do a categorical fillna for things not in the Category. ValueError: fill value must be in categories

To try and solve this, I tried adding a "NaN Interval" as shown in code comment # - 1. This throws ValueError: left side of interval must be <= right side.

Expected Output

Ultimately,

age
(NaN, NaN]     55.500000
(10, 13]            51.000000
(13, 16]            54.333333
(16, 19]            35.000000
Name: score, dtype: float64

But it seems there are a few issues to tackle along the way.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

@rhshadrach
Copy link
Member

I think you just need to specify dropna=False:

df = pd.DataFrame(
    data={
        'age': [12, 13, 14, 12, np.nan, 16, 16.5, 13.5, 10, 18],
        'score': 5,
    }
)

intervals = [pd.Interval(i, i + 3) for i in range(10, 18, 3)]
# - 1 intervals = intervals + pd.Interval(pd.np.nan, pd.np.nan)
interval_idx = pd.IntervalIndex(intervals)

cutted = pd.cut(df.age, interval_idx)
print(df.groupby(cutted, dropna=False).score.mean())
# (10.0, 13.0]    5.0
# (13.0, 16.0]    5.0
# (16.0, 19.0]    5.0
# NaN             5.0
# Name: score, dtype: float64

This was likely fixed in 1.3-1.5 where categoricals with dropna got a few bug fixes. I don't think we need tests - we have tests for this in test_groupby_dropna.

@rhshadrach rhshadrach added the Closing Candidate May be closeable, needs more eyeballs label Nov 17, 2023
@mroeschke
Copy link
Member

Yeah if we think this is covered already closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Closing Candidate May be closeable, needs more eyeballs Enhancement Groupby Interval Interval data type
Projects
None yet
Development

No branches or pull requests

4 participants