Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Pickling the subset of dataframe isn't zero-copy #55781

Open
2 of 3 tasks
anmyachev opened this issue Oct 31, 2023 · 7 comments
Open
2 of 3 tasks

BUG: Pickling the subset of dataframe isn't zero-copy #55781

anmyachev opened this issue Oct 31, 2023 · 7 comments
Labels
Bug IO Pickle read_pickle, to_pickle

Comments

@anmyachev
Copy link
Contributor

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
import ray

df = pd.DataFrame(np.zeros((100, 100)))

# Pickling the entire dataframe is zero-copy
try:
    ray.get(ray.put(df)).values[0,0] = 10
except ValueError as err:
    # ValueError: assignment destination is read-only
    print("Zero-copy pickling")
else:
    raise RuntimeError("Not zero-copy pickling")

# Pickling the subset of dataframe isn't zero-copy
subset = df.iloc[:, 0 : 32]

# It looks like when an object is recreated, additional memory is allocated
# Can this be avoided?
ray.get(ray.put(subset)).values[0,0] = 10

Issue Description

If it is possible to zero-copy pickle a subset of dataframe, then this will be an excellent optimization for memory use.

Looks like that the logic responsible for this behavior is implemented in internals.pyx.

@jbrockmendel I saw that you recently edited pieces of code nearby, maybe you know how and whether this can be done?

Expected Behavior

Pickling the subset of dataframe is zero-copy.

Installed Versions

numpy: 1.25.2
pandas: 2.1.2
ray: 2.7.1
@anmyachev anmyachev added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 31, 2023
@jbrockmendel
Copy link
Member

That is surprising. Can you track where the copy is being made? (an example without ray would be helpful)

@anmyachev
Copy link
Contributor Author

anmyachev commented Nov 2, 2023

@jbrockmendel here is the example without ray:

import pandas as pd
import numpy as np
import pickle

df = pd.DataFrame(np.zeros((100, 100)))

subset = df.iloc[:, 0 : 32]
subset_buffers = []
dumped_subset = pickle.dumps(subset, protocol=5, buffer_callback=subset_buffers.append)
# this failed so out-of-band serialization doesn't work
assert len(subset_buffers) != 0

@jbrockmendel
Copy link
Member

"without ray" could have been more specific: is there a pandas-only reproducer? i dont know cloudpickle any more than i know ray

@anmyachev
Copy link
Contributor Author

@jbrockmendel it's not that easy to get a reproducer with just pandas. However the problem can also be reproduced with standard python pickle module.

@jbrockmendel
Copy link
Member

Can we patch ndarray.copy to raise/breakpoint?

@anmyachev
Copy link
Contributor Author

@jbrockmendel I see the difference at blocks level:

pickle.dumps(df._mgr.blocks[0], protocol=5, buffer_callback=subset_buffers.append)
# vs
pickle.dumps(subset._mgr.blocks[0], protocol=5, buffer_callback=subset_buffers.append)

@anmyachev
Copy link
Contributor Author

The reason is that the internal numpy array is not a contiguous piece of memory. Do you think it makes sense to call the numpy.ascontiguousarray function?

(Pdb) subset._mgr.blocks[0].values.__reduce_ex__(5)     
(<built-in function _reconstruct>, (<class 'numpy.ndarray'>, (0,), b'b'), (1, (32, 100), dtype('float64'), False, b[bytes]\x
(Pdb) df._mgr.blocks[0].values.__reduce_ex__(5)     
(<function _frombuffer at 0x00000273DA6C4C10>, (<pickle.PickleBuffer object at 0x0000027435C65B40>, dtype('float64'), (100, 100), 'F'))
(Pdb) subset._mgr.blocks[0].values.flags       
  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

(Pdb) df._mgr.blocks[0].values.flags     
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

@lithomas1 lithomas1 added IO Pickle read_pickle, to_pickle and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Pickle read_pickle, to_pickle
Projects
None yet
Development

No branches or pull requests

3 participants