-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: unstack with sort=False fails when used with the level parameter… #56357
Conversation
e057c9f
to
3b17596
Compare
Thanks for the PR @renanffernando. I only took a quick look, I think there are perhaps more performant ways of carrying out the changes you have here. Running the ASVs as-is but adding
Perhaps some of the performance can be clawed back - I'll be taking a more detailed look in the next few days. |
3b17596
to
dcb44c4
Compare
I did not know about these performance tests @rhshadrach. I did some experiments and the cause of this poor performance is the creation of the new label's codes. However, I tested some alternatives, but no one performed well. Anyway, I updated the mapping to the best version I found. Do you know a better alternative to do this mapping? |
5d39ff9
to
c27929d
Compare
It sounds like you're looking for factorize with |
…pandas-dev#54987) Assign new codes to labels when sort=False. This is done so that the data appears to be already sorted, fixing the bug.
c27929d
to
3bd4fdf
Compare
I changed the code to use the factorize function as you recommended @rhshadrach, and the performance decrease is much better now. However, it still has a performance decrease by a factor of two in some tests, as I present below. Do you think that is ok?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great update! I think the performance penalty we're seeing looks good for fixing the behavior. Also needs a note in the 3.0.0 whatsnew under the Reshaping section (once you merge main).
pandas/core/reshape/reshape.py
Outdated
|
||
if not self.sort: | ||
# Create new codes considering that labels are already sorted | ||
codes = [np.array(factorize(code)[0], dtype=code.dtype) for code in codes] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need to wrap with np.array(...)
- is that right?
pandas/core/reshape/reshape.py
Outdated
def sorted_labels(self) -> list[np.ndarray]: | ||
if self.sort: | ||
indexer, _ = self._indexer_and_to_sort | ||
return self.labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit confusing to me: what was sorted_labels
has become labels
, and the code for sorted_labels
returns the same result as labels
when sort=True
. If there are good reasons behind these names, maybe add a short docstring to make that clear? Otherwise, perhaps a renaming is in order.
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@renanffernando - are you interested in continuing here? If not, I plan to finish this up. |
With the latest commit, I'm seeing:
|
@@ -2703,16 +2703,3 @@ def test_pivot_table_with_margins_and_numeric_column_names(self): | |||
index=Index(["a", "b", "All"], name=0), | |||
) | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
@pytest.mark.parametrize("m", [1, 10]) | |||
def test_unstack_shares_memory(self, m): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to keep the test, it still ensures correct behavior. You can remove the shares memory assertion, but everything else should still pass. A failure would imply a legit bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing - done. Renamed since shares_memory
was no longer accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK. Just needs a whatsnew
@mroeschke - good to merge? |
One comment about the test, the tests should stay, just removing the assertion on shares memory is the way to go |
Thanks @renanffernando and @rhshadrach |
Thanks for fixing. Is it possible to backport this fix to 2.2.x? |
Backports are reserved for bugs introduce in recent prior versions, and from the issues it doesn't appear that this is the case. |
When sort = False, the previous implementation of unstack assumes that the data will appear in the new data frame in the same order as the old data, but this is not always true. Moreover, the code uses the sorted labels to map the nan values, but it led to a wrong result when sort = False.
To fix the first problem, I assign a 'code id' for each label in a way to simulate that they are already sorted. In this way, the indexer created will map the old data in the new one correctly. The second issue is solved simply by using the old labels.
unstack
withsort=False
fails when used with the level parameter #54987 (Replace xxxx with the GitHub issue number)doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.