Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure HDFStore read gives column-major data with CoW #55743

Merged

Conversation

jorisvandenbossche
Copy link
Member

Fixing a failing test that turned up in #55732.
This PR specifically fixes pandas/tests/io/pytables/test_store.py::test_hdfstore_strides, which tests that the data coming back from HDF are column-major (the "idea" of the test was to ensure the layout is preserved, i.e. when starting with a DataFrame that is column-major, you should get back one with the same layout. But in practice we don't do that, we just simply always return column-major data because of concat making a copy by default (see #22073 (comment)), and we only test the column-major case in the test)

We have ran into this issue before: in pandas/tests/io/pytables/test_store.py::test_hdfstore_strides, @phofl initially just skipped the test for CoW. But then in the final version of that PR, we removed that skip in favor of adding a copy=True inside the concat call in HDFStore.read.
However, later, we then changed concat to ignore the copy keyword alltogether when CoW is enabled, even when specifically passing copy=True (done in #51464). Because of that, this test is failing again (the reason this wasn't caught in our CI, s because the specific test was in the meantime moved to "single_cpu", which we don't run on the CoW CI builds)

This PR implements the option to manually copy inside HDFStore.read when CoW is enabled to achieve the same end result, but two other options would be:

  1. Copy manually in HDFStore.read to ensure column-major data (i.e. current version of this PR)
  2. Don't copy, and live with HDFStore returning row-major data (and update the test to reflect that for CoW). This gives a faster read, but lower performance afterwards (probably not the best option)
  3. Change pd.concat(..) again to do honor the copy=True option, even in the case of CoW

@mroeschke mroeschke added this to the 2.2 milestone Oct 30, 2023
@mroeschke mroeschke merged commit ceee19b into pandas-dev:main Oct 30, 2023
36 of 37 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the cow-hdfstore-read-layout branch November 5, 2023 22:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants