-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Read CSV on python engine fails when skiprows and chunk size are specified (#55677, #56323) #56250
Conversation
Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met. Added a regression test to ensure this bug can be quickly caught in the future if it reappears.
Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met. Added a regression test to ensure this bug can be quickly caught in the future if it reappears.
pre-commit.ci autofix |
for more information, see https://pre-commit.ci
Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met. Added a regression test to ensure this bug can be quickly caught in the future if it reappears.
Made changes consistment with mypy checking
Made changes consistment with mypy checking and pre-commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments but I think this looks pretty reasonable. thanks!
pandas/io/parsers/python_parser.py
Outdated
row_index += 1 | ||
new_rows.append(new_row) | ||
else: | ||
# Maintain legacy chunking behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't planning on removing this branch - it just serves the non-callable case right? If so I think this comment is a bit confusing so can just be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's some weird behavior in this branch, and I'd argue to remove it.
Consider this csv file:
col_a
0
1
2
3
4
5
6
7
8
9
If we read this file in via read_csv:
text_file_reader = pd.read_csv("dummy.csv",
engine='$ENGINE',
skiprows=[1, 2, 3, 7, 10],
chunksize=2)
With the python engine we get the following result:
col_a
0 3
1 4
col_a
2 5
3 7
4 8
With c engine we get a different result:
col_a
0 3
1 4
col_a
2 5
3 7
col_a
4 8
With the python engine and skiprows=lambda x: x in [1, 2, 3, 7, 10] (this PR adds this behavior, currently with these parameters pandas raises an exception):
col_a
0 3
1 4
col_a
2 5
3 7
col_a
4 8
If we remove the entire else branch and use the logic this PR adds for the 'callable' case, the python engine will match the c engine result in both cases, whether given a list or callable skiprows.
Thoughts? If you'd like to go ahead with the proposed update, should I create a separate issue for it and also link it to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I would go ahead and do your proposed change now. If it makes the behavior exactly match the c engine might as well do the fix all at once
df2 = next(reader) | ||
|
||
tm.assert_frame_equal( | ||
df1, DataFrame({"col_a": [20, 30, 60, 70]}, index=[0, 1, 2, 3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit but you don't need to specify index here; will help condense this to one line
Does this support the pyarrow engine as well? |
pyarrow doesn't support chunksize, so it shouldn't be affected at all. This PR only modifies the python engine. |
Hey Will! Left a couple comments I want you to check out and then I'll implement the changes the way you'd prefer from there. |
This commit: -Fixes GH 56323 by replacing the python engine chunksize logic -Fixes formatting on the added test_skiprows test case -Fixes incorrect test in read_fwf that expected an output chunk of size 3 when chunksize=2 was specified.
Sync with main branch of pandas: failing CI/CD due to test updates.
Passes test, looking for feedback / to merge if ready. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm @mroeschke any thoughts?
Would love to see pyarrow support for this
…On Mon, Dec 4, 2023, 2:13 PM William Ayd ***@***.***> wrote:
***@***.**** approved this pull request.
lgtm @mroeschke <https://github.com/mroeschke> any thoughts?
—
Reply to this email directly, view it on GitHub
<#56250 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA5E3JABHICEUHPDQOXDH3YHY4HBAVCNFSM6AAAAABAAOIJRKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONRTGQZTAOBYHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Thanks @Flytre |
-Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met.
-Added a regression test to ensure this #55677 can be quickly caught in the future if it reappears.
-Fixed a flawed test case that now screens for #56323 regressions.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.