Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Read CSV on python engine fails when skiprows and chunk size are specified (#55677, #56323) #56250

Merged
merged 11 commits into from
Dec 5, 2023

Conversation

Flytre
Copy link
Contributor

@Flytre Flytre commented Nov 30, 2023

-Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met.

-Added a regression test to ensure this #55677 can be quickly caught in the future if it reappears.

-Fixed a flawed test case that now screens for #56323 regressions.

Flytre and others added 2 commits November 29, 2023 21:09
Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met.

Added a regression test to ensure this bug can be quickly caught in the future if it reappears.
Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met.

Added a regression test to ensure this bug can be quickly caught in the future if it reappears.
@Flytre
Copy link
Contributor Author

Flytre commented Nov 30, 2023

pre-commit.ci autofix

pre-commit-ci bot and others added 5 commits November 30, 2023 02:34
Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met.

Added a regression test to ensure this bug can be quickly caught in the future if it reappears.
Made changes consistment with mypy checking
Made changes consistment with mypy checking and pre-commit
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments but I think this looks pretty reasonable. thanks!

row_index += 1
new_rows.append(new_row)
else:
# Maintain legacy chunking behavior
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't planning on removing this branch - it just serves the non-callable case right? If so I think this comment is a bit confusing so can just be removed

Copy link
Contributor Author

@Flytre Flytre Dec 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some weird behavior in this branch, and I'd argue to remove it.

Consider this csv file:

col_a
0
1
2
3
4
5
6
7
8
9

If we read this file in via read_csv:

text_file_reader = pd.read_csv("dummy.csv",
                               engine='$ENGINE',
                               skiprows=[1, 2, 3, 7, 10],
                               chunksize=2)

With the python engine we get the following result:

   col_a
0      3
1      4
   col_a
2      5
3      7
4      8

With c engine we get a different result:

   col_a
0      3
1      4
   col_a
2      5
3      7
   col_a
4      8

With the python engine and skiprows=lambda x: x in [1, 2, 3, 7, 10] (this PR adds this behavior, currently with these parameters pandas raises an exception):

   col_a
0      3
1      4
   col_a
2      5
3      7
   col_a
4      8

If we remove the entire else branch and use the logic this PR adds for the 'callable' case, the python engine will match the c engine result in both cases, whether given a list or callable skiprows.

Thoughts? If you'd like to go ahead with the proposed update, should I create a separate issue for it and also link it to this PR?

Copy link
Member

@WillAyd WillAyd Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I would go ahead and do your proposed change now. If it makes the behavior exactly match the c engine might as well do the fix all at once

df2 = next(reader)

tm.assert_frame_equal(
df1, DataFrame({"col_a": [20, 30, 60, 70]}, index=[0, 1, 2, 3])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but you don't need to specify index here; will help condense this to one line

@mattharrison
Copy link

Does this support the pyarrow engine as well?

@Flytre
Copy link
Contributor Author

Flytre commented Dec 3, 2023

Does this support the pyarrow engine as well?

pyarrow doesn't support chunksize, so it shouldn't be affected at all. This PR only modifies the python engine.

@Flytre Flytre requested a review from WillAyd December 3, 2023 19:36
@Flytre
Copy link
Contributor Author

Flytre commented Dec 3, 2023

Hey Will! Left a couple comments I want you to check out and then I'll implement the changes the way you'd prefer from there.

This commit:
-Fixes GH 56323 by replacing the python engine chunksize logic
-Fixes formatting on the added test_skiprows test case
-Fixes incorrect test in read_fwf that expected an output chunk of size 3 when chunksize=2 was specified.
@Flytre Flytre changed the title BUG: Read CSV on python engine fails with callable skiprows and chunk size specified (#55677) BUG: Read CSV on python engine fails when skiprows and chunk size are specified (#55677, #56323) Dec 4, 2023
@mroeschke mroeschke added the IO CSV read_csv, to_csv label Dec 4, 2023
@Flytre
Copy link
Contributor Author

Flytre commented Dec 4, 2023

Passes test, looking for feedback / to merge if ready.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @mroeschke any thoughts?

@mattharrison
Copy link

mattharrison commented Dec 4, 2023 via email

@mroeschke mroeschke added this to the 2.2 milestone Dec 5, 2023
@mroeschke mroeschke merged commit 593fa85 into pandas-dev:main Dec 5, 2023
48 checks passed
@mroeschke
Copy link
Member

Thanks @Flytre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
4 participants