Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CoW: Use exponential backoff when clearing dead references #55518

Merged
merged 9 commits into from
Oct 22, 2023
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.1.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ Fixed regressions
- Fixed regression in :meth:`DataFrame.join` where result has missing values and dtype is arrow backed string (:issue:`55348`)
- Fixed regression in :meth:`DataFrame.resample` which was extrapolating back to ``origin`` when ``origin`` was outside its bounds (:issue:`55064`)
- Fixed regression in :meth:`DataFrame.sort_index` which was not sorting correctly when the index was a sliced :class:`MultiIndex` (:issue:`55379`)
- Fixed performance regression in Copy-on-Write mechanism (:issue:`55256`, :issue:`55245`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regression occurs without Copy-on-Write too. I think we should mention that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I struggled a bit with the wording, any suggestions?

Copy link
Member

@lithomas1 lithomas1 Oct 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

Fixed performance regression in DataFrame copying, DataFrame iteration, and groupby methods taking user defined functions.

?

I think it's better to leave the copy-on-write part out - I personally couldn't find a way to word it without making it seem like the issue was with Copy-on-Write on only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am not really happy with listing methods, since this affects all kinds of things with wide data frames

phofl marked this conversation as resolved.
Show resolved Hide resolved

.. ---------------------------------------------------------------------------
.. _whatsnew_212.bug_fixes:
Expand Down
24 changes: 18 additions & 6 deletions pandas/_libs/internals.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -890,17 +890,29 @@ cdef class BlockValuesRefs:
"""
cdef:
public list referenced_blocks
public int clear_counter

def __cinit__(self, blk: Block | None = None) -> None:
if blk is not None:
self.referenced_blocks = [weakref.ref(blk)]
else:
self.referenced_blocks = []

def _clear_dead_references(self) -> None:
self.referenced_blocks = [
ref for ref in self.referenced_blocks if ref() is not None
]
self.clear_counter = 500 # set reasonably high

def _clear_dead_references(self, force=False) -> None:
jreback marked this conversation as resolved.
Show resolved Hide resolved
# Use exponential backoff to decide when we want to clear references
# if force=False. Clearing for every insertion causes slowdowns if
# all these objects stay alive, e.g. df.items() for wide DataFrames
# see GH#55245 and GH#55008
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a reference issue to eventually change this to a WeakSet-like implementation? IIRC I saw that discussed somewhere

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open an issue about that as a follow up when this is in

if force or len(self.referenced_blocks) > self.clear_counter:
self.referenced_blocks = [
ref for ref in self.referenced_blocks if ref() is not None
]
nr_of_refs = len(self.referenced_blocks)
if nr_of_refs < self.clear_counter // 2:
self.clear_counter = max(self.clear_counter // 2, 500)
Copy link
Contributor

@wangwillian0 wangwillian0 Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest a shrink factor of 4 or more. If it's the same as the growth factor it can create a few corner cases that will still have O(n^2). e.g. length going back and forth between (500*2^n)-1 and (500*2^n)+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this happen as well for a shrink factor of 4? And this would only happen If we have this interleaved with inlace modifications, e.g if force=True, correct? Merging for now, but happy to follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging for now, but happy to follow up

+1

Copy link
Contributor

@wangwillian0 wangwillian0 Oct 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a factor of 4 you would need to change the length to the extremes of the range [500*2^(n-1); 500*2^n], which is at least 500 (and more for a larger n), this is much better than triggering the slow operation on just adding and removing 3 references.

elif nr_of_refs > self.clear_counter:
self.clear_counter = max(self.clear_counter * 2, nr_of_refs)

def add_reference(self, blk: Block) -> None:
"""Adds a new reference to our reference collection.
Expand Down Expand Up @@ -934,6 +946,6 @@ cdef class BlockValuesRefs:
-------
bool
"""
self._clear_dead_references()
self._clear_dead_references(force=True)
# Checking for more references than block pointing to itself
return len(self.referenced_blocks) > 1
Loading