-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Pickling the subset of dataframe isn't zero-copy #55781
Comments
That is surprising. Can you track where the copy is being made? (an example without ray would be helpful) |
@jbrockmendel here is the example without ray: import pandas as pd
import numpy as np
import pickle
df = pd.DataFrame(np.zeros((100, 100)))
subset = df.iloc[:, 0 : 32]
subset_buffers = []
dumped_subset = pickle.dumps(subset, protocol=5, buffer_callback=subset_buffers.append)
# this failed so out-of-band serialization doesn't work
assert len(subset_buffers) != 0 |
"without ray" could have been more specific: is there a pandas-only reproducer? i dont know cloudpickle any more than i know ray |
@jbrockmendel it's not that easy to get a reproducer with just pandas. However the problem can also be reproduced with standard python pickle module. |
Can we patch ndarray.copy to raise/breakpoint? |
@jbrockmendel I see the difference at blocks level: pickle.dumps(df._mgr.blocks[0], protocol=5, buffer_callback=subset_buffers.append)
# vs
pickle.dumps(subset._mgr.blocks[0], protocol=5, buffer_callback=subset_buffers.append) |
The reason is that the internal numpy array is not a contiguous piece of memory. Do you think it makes sense to call the (Pdb) subset._mgr.blocks[0].values.__reduce_ex__(5)
(<built-in function _reconstruct>, (<class 'numpy.ndarray'>, (0,), b'b'), (1, (32, 100), dtype('float64'), False, b[bytes]\x
(Pdb) df._mgr.blocks[0].values.__reduce_ex__(5)
(<function _frombuffer at 0x00000273DA6C4C10>, (<pickle.PickleBuffer object at 0x0000027435C65B40>, dtype('float64'), (100, 100), 'F'))
(Pdb) subset._mgr.blocks[0].values.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
(Pdb) df._mgr.blocks[0].values.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
If it is possible to zero-copy pickle a subset of dataframe, then this will be an excellent optimization for memory use.
Looks like that the logic responsible for this behavior is implemented in
internals.pyx
.@jbrockmendel I saw that you recently edited pieces of code nearby, maybe you know how and whether this can be done?
Expected Behavior
Pickling the subset of dataframe is zero-copy.
Installed Versions
The text was updated successfully, but these errors were encountered: