Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: groupby.grouper #56521

Merged
merged 5 commits into from
Dec 18, 2023
Merged

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach commented Dec 16, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Ref: #55738 (comment)

This reverts the deprecation of grouper.reconstructed_codes and grouper.group_keys_seq from #56149 in favor of deprecating all of grouper.

@rhshadrach rhshadrach added Groupby Deprecate Functionality to remove in pandas labels Dec 16, 2023
@rhshadrach rhshadrach added this to the 2.2 milestone Dec 16, 2023
@rhshadrach rhshadrach marked this pull request as draft December 16, 2023 22:54
@rhshadrach rhshadrach marked this pull request as ready for review December 17, 2023 16:25
@@ -472,7 +472,7 @@ Other Deprecations
- Deprecated strings ``H``, ``S``, ``U``, and ``N`` denoting units in :func:`to_timedelta` (:issue:`52536`)
- Deprecated strings ``H``, ``T``, ``S``, ``L``, ``U``, and ``N`` denoting units in :class:`Timedelta` (:issue:`52536`)
- Deprecated strings ``T``, ``S``, ``L``, ``U``, and ``N`` denoting frequencies in :class:`Minute`, :class:`Second`, :class:`Milli`, :class:`Micro`, :class:`Nano` (:issue:`52536`)
- Deprecated the :class:`.BaseGrouper` attributes ``group_keys_seq`` and ``reconstructed_codes``; these will be removed in a future version of pandas (:issue:`56148`)
- Deprecated the :class:`.DataFrameGroupBy.grouper` and :class:`SeriesGroupBy.grouper`; these will be removed in a future version of pandas (:issue:`56521`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either :attr: or the word "attribute" somewhere in here?

gen = gb.grouper.get_iterator(obj, axis=gb.axis)
msg = "DataFrameGroupBy.grouper is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
gen = gb.grouper.get_iterator(obj, axis=gb.axis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just change to _grouper?

assert s.groupby(s).grouper.names == ["name"]
msg = "SeriesGroupBy.grouper is deprecated"
with tm.assert_produces_warning(FutureWarning, match=msg):
assert s.groupby(s).grouper.names == ["name"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much all of these can just change grouper->_grouper i think, as long as there are a couple that test the deprecation

)
with tm.assert_produces_warning(FutureWarning, match=msg):
tm.assert_numpy_array_equal(
gr.grouper.group_info[1], np.array([], dtype=np.dtype(np.intp))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if for some reason this doesnt change to _grouper, i have a medium-strong preference for doing as little inside the context as possible, so e.g. gpr = gr.grouper would go inside the context and the assert_numpy... would go in a separate line

@rhshadrach
Copy link
Member Author

Thanks @jbrockmendel - changes made.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge when ready @jbrockmendel

@jbrockmendel jbrockmendel merged commit ac170fd into pandas-dev:main Dec 18, 2023
44 checks passed
@jbrockmendel
Copy link
Member

thanks @rhshadrach

@rhshadrach rhshadrach deleted the depr_gb_grouper branch December 18, 2023 23:20
cbpygit pushed a commit to cbpygit/pandas that referenced this pull request Jan 2, 2024
* DEPR: groupby.grouper

* DEPR: groupby.grouper

* fix whatsnew, tests

* Restore test
@jorisvandenbossche
Copy link
Member

@rhshadrach one usage of this that I noticed in downstream (seaborn, mwaskom/seaborn#3615, https://github.com/mwaskom/seaborn/blob/a3cb0f1b33551eb25fed1907d0abdd0237fac215/seaborn/categorical.py#L640) is to access the result_index.

Do we have a good alternative for that?
One could do something like pd.Index(grouped.indices.keys()), but that's less robust (converts from python scalars)

I haven't followed the PR where you are refactoring result_index, so don't know if that will change anything (or only how it is calculated), but would we want to expose something similar on the groupby object itself as alternative?

@rhshadrach
Copy link
Member Author

Using result_index is not currently reliable:

df = pd.DataFrame(
    {
        "a": pd.Categorical([1], categories=[1, 2]),
        "b": pd.Categorical([1], categories=[1, 2]),
        "c": 5,
    }
)
gb = df.groupby(["a", "b"])
print(gb.grouper.result_index)
# MultiIndex([(1, 1)], names=['a', 'b'])
print(gb.size())
# a  b
# 1  1    1
#    2    0
# 2  1    0
#    2    0
# dtype: int64

though #55738 will fix this in all cases I know of. An alternative is to perform a reduction and take the index from the result. I recommend size for this as it doesn't require as much computation as other methods.

size = 100_000
df = pd.DataFrame({'a': np.random.randint(0, 100, size), 'b': np.random.randint(0, 100, size), 'c': np.random.random(size)})
%timeit df.groupby(['a', 'b']).size().index
# 3.68 ms ± 6.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby(['a', 'b'])._grouper.result_index
# 3.36 ms ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

We do not need to deprecate grouper for #55738, only certain grouper attributes (see #56149)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Groupby
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants