CoW: Use weakref callbacks to track dead references #55539

wangwillian0 · 2023-10-16T01:02:18Z

closes PERF: hash_pandas_object performance regression between 2.1.0 and 2.1.1 #55245
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This is an alternative solution to #55518. What it happening here:

Uses the weakref.ref callback to count the number of dead references present at the referenced_blocks list.
Takes 20% more time at the constructor because of the function declaration!
- The function declaration can be moved to the add_reference and add_index_reference, but this is just moving the performance hit to these methods.
Everything is O(1) now, including has_reference. (_clear_dead_references is amortized to O(1))
If the number of dead references is more than 256 and +50% of the array, it will prune these references
- The implication is that the linear slow dead reference cleaning happens next to the GC calls, which I think this is a more intuitive behavior.

(also check another possibilities at #55008)

Let me know if I should continue working on this solution.

WillAyd · 2023-10-16T12:03:53Z

pandas/_libs/internals.pyx

+        def _weakref_cb(item: weakref.ref, selfref: weakref.ref = weakref.ref(self)) -> None:
+            self = selfref()
+            if self is not None:
+                self.dead_counter += 1


Is this ever reachable without the GIL? This doesn't appear thread safe so thinking through what impacts that might have

Sorry, can you explain a bit more? Is pandas preparing to be thread-safe and use the no-GIL feature of the coming python version?

We release the GIL in quite a few places in the Cython code today, mostly for downstream libraries like Dask to take advantage of. Though I guess this in its current state wouldn't work without the GIL since it is already using weakref on main

WillAyd · 2023-10-16T12:34:26Z

Assuming this is an alternate solution to @phofl PR in #55518

phofl · 2023-10-16T12:36:53Z

#55518 needs to be back ported. This is something that we can consider for main and 2.2, but I want to test this more before we rely on it.

wangwillian0 · 2023-11-05T20:01:56Z

I just fixed the tests. Any thoughts about it @phofl @WillAyd?

WillAyd

Generally looks good though will defer to @phofl who is more familiar with this. I think he is traveling this week so thanks for your patience in advance

WillAyd · 2023-11-06T14:00:05Z

pandas/_libs/internals.pyx


    def __cinit__(self, blk: Block | None = None) -> None:
+        def _weakref_cb(item: weakref.ref, selfref: weakref.ref = weakref.ref(self)) -> None:


Not sure of all of the impacts but I think cinit is mostly used for C-level structures. I think setting this callback should instead be done in init

I just updated it

Actually, it's doesn't work because cinit is called before init, so we need the callback ready there, unless we move everything to init. I changed it back 😅

Hmm OK. And this definitely doesn't cause any memory leaks or issues with reference counting then right?

It should be correct as long as the reference list is not modified manually without the classes methods or run under concurrency.

Another thing is that taking the length of the list of refs directly won't give the correct length too (the actual value is len(referenced_blocks) - dead_counter). From what I looked, the exact length was never used anywhere else in the code right now, but I can implement the len method if needed.

If someone modified the list manually without the classes methods, the counter would be wrong and the memory consumption could grow a lot (just like the current solution at main) and the has_reference() will be plain wrong.

Sorry, I meant multi threading

Yes I understood this, but again, this is a realistic scenario

The problem would be that self.dead_counter += 1 isn't atomic. I can add a lock to it but, correct if I'm wrong, the rest of the code isn't thread-safe too. e.g. list comprehension from _clear_dead_references doesn't seem to be thread-safe

Here is a sample of how _clear_dead_references is not really thread safe:

import threading class Test: def __init__(self): self.v = list(range(1000)) def add(self, x): self.v.append(x) def rebuild(self): self.v = [x for x in self.v] def simulate(self): for _ in range(1000): self.add(1) self.rebuild() def race(self): threads = [threading.Thread(target=self.simulate) for _ in range(10)] for t in threads: t.start() for t in threads: t.join() print(len(self.v)) test1 = Test() test2 = Test() test1.race() test2.race()

I couldn't reproduce the same with pandas code, but I think it's more about timing than something else. The point is, is this class supposed to be thread-safe?

Nope that works, was just not sure if I was missing something obvious

wangwillian0 · 2023-12-01T15:25:30Z

Any news?

mroeschke · 2023-12-01T18:50:58Z

Personally I would be more partial to exploring #55631 (IMO I think the mental model of using a set would be simpler)

phofl · 2023-12-01T18:55:20Z

Can you run the ctors.py benchmarks?

wangwillian0 · 2023-12-05T01:58:15Z

Sorry for the delay. Here are the benchmarks:

Change	Before [`593fa85`]	After [`98addad`]	Ratio	Benchmark (Parameter)
+	5.74±0.2ms	7.37±0.2ms	1.29	frame_ctor.FromArrays.time_frame_from_arrays_int
+	7.65±0.08ms	8.63±0.2ms	1.13	frame_ctor.FromArrays.time_frame_from_arrays_sparse
+	1.43±0.07ms	1.47±0.06ms	1.03	frame_ctor.FromRecords.time_frame_from_records_generator(1000)

phofl · 2023-12-05T23:20:22Z

How does this correspond to the other timings in the issue?

wangwillian0 · 2023-12-06T03:07:19Z

How does this correspond to the other timings in the issue?

Not sure if I fully understand what is being asked, but this ~20% increase is from the overhead in the constructor. What this PR will be better is mainly at has_reference, which will be O(1) independently of the number of references. Insertions also should be a little faster because there is no periodic calls to the cleanings method, but this is probably not meaningful in any way,

wangwillian0 · 2023-12-13T14:05:48Z

Updates?

github-actions · 2024-01-13T00:05:53Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Co-authored-by: José Lucas Silva Mayer <[email protected]>

wangwillian0 · 2024-01-15T04:46:16Z

Hey @phofl, what is the current status of this PR?

wangwillian0 · 2024-02-08T01:24:28Z

@phofl?

wangwillian0 · 2024-02-17T21:11:56Z

Hi, @phofl, can you give an update?

mroeschke · 2024-08-19T17:11:48Z

Thanks for the PR but it does seem there much bandwidth or interest from the core team to pursue this approach so closing this out for now. Can reopen if there's renewed interest

wangwillian0 requested a review from WillAyd as a code owner October 16, 2023 01:02

wangwillian0 mentioned this pull request Oct 16, 2023

CoW: Clear dead references every time we add a new one #55008

Merged

5 tasks

wangwillian0 force-pushed the weakref-callback branch 2 times, most recently from 9cc2a3e to e47fb58 Compare October 16, 2023 03:04

WillAyd reviewed Oct 16, 2023

View reviewed changes

wangwillian0 force-pushed the weakref-callback branch 4 times, most recently from 6eb63ce to c0f401d Compare November 5, 2023 19:24

WillAyd reviewed Nov 6, 2023

View reviewed changes

wangwillian0 force-pushed the weakref-callback branch 2 times, most recently from 583b908 to c1b9b83 Compare November 17, 2023 22:16

wangwillian0 force-pushed the weakref-callback branch from c1b9b83 to 98addad Compare December 5, 2023 01:18

github-actions bot added the Stale label Jan 13, 2024

CoW: Use weakref callbacks to track dead references

3c79e28

Co-authored-by: José Lucas Silva Mayer <[email protected]>

wangwillian0 force-pushed the weakref-callback branch from 98addad to 3c79e28 Compare January 15, 2024 00:37

simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Feb 8, 2024

simonjayhawkins added the Needs Review label Feb 8, 2024

mroeschke closed this Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoW: Use weakref callbacks to track dead references #55539

CoW: Use weakref callbacks to track dead references #55539

wangwillian0 commented Oct 16, 2023 •

edited

Loading

WillAyd Oct 16, 2023

wangwillian0 Oct 16, 2023

WillAyd Oct 16, 2023

WillAyd commented Oct 16, 2023

phofl commented Oct 16, 2023

wangwillian0 commented Nov 5, 2023

WillAyd left a comment

WillAyd Nov 6, 2023

wangwillian0 Nov 17, 2023

wangwillian0 Nov 17, 2023

WillAyd Nov 17, 2023

wangwillian0 Nov 17, 2023

wangwillian0 Nov 17, 2023

phofl Nov 17, 2023

wangwillian0 Nov 17, 2023

wangwillian0 Nov 20, 2023

phofl Dec 1, 2023

wangwillian0 commented Dec 1, 2023

mroeschke commented Dec 1, 2023

phofl commented Dec 1, 2023

wangwillian0 commented Dec 5, 2023

phofl commented Dec 5, 2023

wangwillian0 commented Dec 6, 2023

wangwillian0 commented Dec 13, 2023

github-actions bot commented Jan 13, 2024

wangwillian0 commented Jan 15, 2024

wangwillian0 commented Feb 8, 2024

wangwillian0 commented Feb 17, 2024

mroeschke commented Aug 19, 2024


		def __cinit__(self, blk: Block \| None = None) -> None:
		def _weakref_cb(item: weakref.ref, selfref: weakref.ref = weakref.ref(self)) -> None:

CoW: Use weakref callbacks to track dead references #55539

CoW: Use weakref callbacks to track dead references #55539

Conversation

wangwillian0 commented Oct 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd commented Oct 16, 2023

phofl commented Oct 16, 2023

wangwillian0 commented Nov 5, 2023

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wangwillian0 commented Dec 1, 2023

mroeschke commented Dec 1, 2023

phofl commented Dec 1, 2023

wangwillian0 commented Dec 5, 2023

phofl commented Dec 5, 2023

wangwillian0 commented Dec 6, 2023

wangwillian0 commented Dec 13, 2023

github-actions bot commented Jan 13, 2024

wangwillian0 commented Jan 15, 2024

wangwillian0 commented Feb 8, 2024

wangwillian0 commented Feb 17, 2024

mroeschke commented Aug 19, 2024

wangwillian0 commented Oct 16, 2023 •

edited

Loading