Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: unstack with sort=False fails when used with the level parameter… #56357

Merged
merged 11 commits into from
May 21, 2024

Conversation

renanffernando
Copy link
Contributor

@renanffernando renanffernando commented Dec 6, 2023

When sort = False, the previous implementation of unstack assumes that the data will appear in the new data frame in the same order as the old data, but this is not always true. Moreover, the code uses the sorted labels to map the nan values, but it led to a wrong result when sort = False.

To fix the first problem, I assign a 'code id' for each label in a way to simulate that they are already sorted. In this way, the indexer created will map the old data in the new one correctly. The second issue is solved simply by using the old labels.

@renanffernando renanffernando marked this pull request as draft December 6, 2023 05:20
@renanffernando renanffernando force-pushed the franco-unstack branch 3 times, most recently from e057c9f to 3b17596 Compare December 8, 2023 23:03
@renanffernando renanffernando marked this pull request as ready for review December 8, 2023 23:48
@renanffernando renanffernando changed the title [WIP] BUG: unstack with sort=False fails when used with the level parameter… BUG: unstack with sort=False fails when used with the level parameter… Dec 8, 2023
@mroeschke mroeschke requested a review from rhshadrach December 18, 2023 19:27
@rhshadrach
Copy link
Member

Thanks for the PR @renanffernando. I only took a quick look, I think there are perhaps more performant ways of carrying out the changes you have here. Running the ASVs as-is but adding sort=False to the unstack benchmarks, I'm seeing this:

       before           after         ratio
     [14cf864c]       [3b175962]
       <main>        <franco-unstack> 
+         913±2μs      10.9±0.02ms    11.93  reshape.ReshapeExtensionDtype.time_unstack_fast('Period[s]')
+         932±9μs      10.9±0.02ms    11.72  reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
+     1.71±0.01ms      11.7±0.05ms     6.87  reshape.ReshapeExtensionDtype.time_unstack_slow('datetime64[ns, US/Pacific]')
+        1.70±0ms      11.7±0.05ms     6.85  reshape.ReshapeExtensionDtype.time_unstack_slow('Period[s]')
+     1.47±0.01ms      10.0±0.02ms     6.79  reshape.SimpleReshape.time_unstack
+     2.04±0.01ms       12.2±0.1ms     5.98  reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Int64')
+     2.03±0.01ms      12.1±0.08ms     5.94  reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Float64')
+     5.22±0.05ms      15.2±0.04ms     2.92  reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Float64')
+     5.21±0.06ms      15.2±0.06ms     2.91  reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Int64')
+      23.3±0.6ms       44.5±0.1ms     1.91  reshape.Unstack.time_full_product('int')
+         925±2μs      1.69±0.01ms     1.82  reshape.SparseIndex.time_unstack
+      16.6±0.3ms       25.4±0.3ms     1.53  reshape.Unstack.time_full_product('category')
+      17.7±0.2ms       26.9±0.3ms     1.52  reshape.Unstack.time_without_last_row('category')
+      52.5±0.3ms       74.7±0.2ms     1.42  reshape.Unstack.time_without_last_row('int')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

Perhaps some of the performance can be clawed back - I'll be taking a more detailed look in the next few days.

@rhshadrach rhshadrach added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 18, 2023
@renanffernando
Copy link
Contributor Author

renanffernando commented Dec 21, 2023

I did not know about these performance tests @rhshadrach.

I did some experiments and the cause of this poor performance is the creation of the new label's codes. However, I tested some alternatives, but no one performed well. Anyway, I updated the mapping to the best version I found.

Do you know a better alternative to do this mapping?

@renanffernando renanffernando force-pushed the franco-unstack branch 3 times, most recently from 5d39ff9 to c27929d Compare December 21, 2023 16:43
@rhshadrach
Copy link
Member

Do you know a better alternative to do this mapping?

It sounds like you're looking for factorize with sort=False: https://pandas.pydata.org/docs/reference/api/pandas.factorize.html

…pandas-dev#54987)

Assign new codes to labels when sort=False. This is done so that the data appears to be already sorted,
fixing the bug.
@renanffernando
Copy link
Contributor Author

I changed the code to use the factorize function as you recommended @rhshadrach, and the performance decrease is much better now. However, it still has a performance decrease by a factor of two in some tests, as I present below. Do you think that is ok?

Change Before [46c8da3] <v2.3.0.dev0~122> After [3bd4fdf] Ratio Benchmark (Parameter)
+ 26.9±0.1ms 57.9±3ms 2.16 reshape.Unstack.time_full_product('int')
+ 2.35±0.01ms 4.61±0.9ms 1.96 reshape.SimpleReshape.time_stack
+ 877±9μs 1.58±0.2ms 1.8 reshape.SparseIndex.time_unstack
+ 65.0±1ms 108±10ms 1.67 reshape.Unstack.time_without_last_row('int')
+ 1.42±0.02ms 2.18±0.4ms 1.53 reshape.SimpleReshape.time_unstack
+ 1.96±0.01ms 2.96±0.5ms 1.51 reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Float64')
+ 1.97±0.03ms 2.87±0.4ms 1.45 reshape.ReshapeMaskedArrayDtype.time_unstack_slow('Int64')
+ 874±4μs 1.23±0.05ms 1.41 reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
+ 866±6μs 1.15±0.03ms 1.33 reshape.ReshapeExtensionDtype.time_unstack_fast('Period[s]')
+ 5.16±0.1ms 6.71±2ms 1.3 reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Int64')
+ 7.95±0.07ms 10.2±4ms 1.29 reshape.ReshapeMaskedArrayDtype.time_stack('Int64')
+ 1.71±0.09ms 2.18±0.2ms 1.27 reshape.ReshapeExtensionDtype.time_unstack_slow('Period[s]')
+ 3.33±0.02ms 3.95±0.3ms 1.19 reshape.Cut.time_cut_datetime(4)
+ 16.5±0.3ms 19.2±0.4ms 1.17 reshape.Unstack.time_full_product('category')
+ 7.94±0.06ms 9.21±0.6ms 1.16 reshape.ReshapeMaskedArrayDtype.time_stack('Float64')
+ 40.1±0.06ms 45.8±2ms 1.14 reshape.Crosstab.time_crosstab_normalize_margins
+ 4.17±0.02ms 4.74±0.2ms 1.14 reshape.Cut.time_cut_datetime(10)
+ 105±2ms 120±6ms 1.14 reshape.Pivot.time_reshape_pivot_time_series
+ 31.1±0.2ms 34.9±2ms 1.12 reshape.PivotTable.time_pivot_table_margins_only_column
+ 3.78±0.02ms 4.25±0.2ms 1.12 reshape.ReshapeExtensionDtype.time_stack('Period[s]')
+ 5.21±0.1ms 5.79±0.5ms 1.11 reshape.ReshapeMaskedArrayDtype.time_unstack_fast('Float64')
+ 27.7±0.2ms 30.6±0.8ms 1.1 reshape.PivotTable.time_pivot_table_agg

@rhshadrach rhshadrach added this to the 3.0 milestone Jan 28, 2024
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great update! I think the performance penalty we're seeing looks good for fixing the behavior. Also needs a note in the 3.0.0 whatsnew under the Reshaping section (once you merge main).


if not self.sort:
# Create new codes considering that labels are already sorted
codes = [np.array(factorize(code)[0], dtype=code.dtype) for code in codes]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need to wrap with np.array(...) - is that right?

Comment on lines 190 to 192
def sorted_labels(self) -> list[np.ndarray]:
if self.sort:
indexer, _ = self._indexer_and_to_sort
return self.labels
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit confusing to me: what was sorted_labels has become labels, and the code for sorted_labels returns the same result as labels when sort=True. If there are good reasons behind these names, maybe add a short docstring to make that clear? Otherwise, perhaps a renaming is in order.

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Feb 28, 2024
@rhshadrach
Copy link
Member

@renanffernando - are you interested in continuing here? If not, I plan to finish this up.

@rhshadrach rhshadrach removed the Stale label Feb 28, 2024
@rhshadrach rhshadrach self-assigned this Feb 28, 2024
@rhshadrach
Copy link
Member

With the latest commit, I'm seeing:

asv continuous -f 1.1 -E virtualenv upstream/main HEAD -b "^reshape"
BENCHMARKS NOT SIGNIFICANTLY CHANGED.

@@ -2703,16 +2703,3 @@ def test_pivot_table_with_margins_and_numeric_column_names(self):
index=Index(["a", "b", "All"], name=0),
)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize("m", [1, 10])
def test_unstack_shares_memory(self, m):
Copy link
Member

@rhshadrach rhshadrach May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was added in #57487. To fix this bug, we now need to unconditionally do a take in _make_sorted_values. I don't think the fact that it shared memory in the m=1 case was important - I think that was the only reason this test was added.

cc @phofl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep the test, it still ensures correct behavior. You can remove the shares memory assertion, but everything else should still pass. A failure would imply a legit bug

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing - done. Renamed since shares_memory was no longer accurate.

@rhshadrach rhshadrach requested a review from mroeschke May 2, 2024 22:40
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK. Just needs a whatsnew

@rhshadrach
Copy link
Member

@mroeschke - good to merge?

@phofl
Copy link
Member

phofl commented May 16, 2024

One comment about the test, the tests should stay, just removing the assertion on shares memory is the way to go

@mroeschke mroeschke merged commit b991274 into pandas-dev:main May 21, 2024
47 checks passed
@mroeschke
Copy link
Member

Thanks @renanffernando and @rhshadrach

@bluss
Copy link

bluss commented Jun 18, 2024

Thanks for fixing. Is it possible to backport this fix to 2.2.x?

@mroeschke
Copy link
Member

Is it possible to backport this fix to 2.2.x?

Backports are reserved for bugs introduce in recent prior versions, and from the issues it doesn't appear that this is the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
5 participants