Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Grouper.ngroups vs Grouper.group_info[2] #49980

Closed
jbrockmendel opened this issue Nov 30, 2022 · 2 comments · Fixed by #55738
Closed

DOC: Grouper.ngroups vs Grouper.group_info[2] #49980

jbrockmendel opened this issue Nov 30, 2022 · 2 comments · Fixed by #55738

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Nov 30, 2022

In the groupby code we sometimes get do ids, _, ngroups = self.group_info and other times do ngroups = self.ngroups. In all existing test cases (as of last time I checked which was probably over a year ago), these two ways of getting ngroups match. Can we prove that this will always be the case? If so, let's make one a property based on the other. If not, let's add a test with a counter-example and a comment.

@natmokval
Copy link
Contributor

I would like to work on this.

@rhshadrach
Copy link
Member

rhshadrach commented Nov 15, 2023

I feel "proof" is too strong, but from what I can tell, they are always the same.

Case I: there is a single grouping ping. Here, result_index is ping.result_index and Grouper.group_info[2] is the length of ping.group_index. But ping.result_index is the same as ping.group_index, possibly with a modification to the categories. Thus these are the same.

Case II: there are multiple groupings. Here we have

comp_ids, obs_group_ids = self._get_compressed_codes()

with group_info[2] being len(obs_group_ids). obs_group_ids are the unique values that arise from self.codes when considering all groupings, where as comp_ids are the enumeration of the groups themselves. In particular, the number of unique values of comp_ids is equal to the length of obs_group_ids. On the other hand, we get reconstructed_codes from calling decons_obs_group_ids on comp_ids. The function decons_obs_group_ids returns a list of NumPy arrays all of the same length as the number of unique values in comp_ids. Therefore, len(obs_group_ids) equals len(result_index).

#55738 uses len(result_index) for both of these values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants