BUG: Read CSV on python engine fails when skiprows and chunk size are specified (#55677, #56323) #56250

Flytre · 2023-11-30T02:24:35Z

-Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met.

-Added a regression test to ensure this #55677 can be quickly caught in the future if it reappears.

-Fixed a flawed test case that now screens for #56323 regressions.

[✅ ] closes BUG: TextFileReader from read_csv() fails to iterate when using a callable skiprows and engine='python' #55677 (Replace xxxx with the GitHub issue number)
[✅ ] closes BUG: Python Engine does not respect chunksize when skiprows is specified as a nonempty list. #56323 (Replace xxxx with the GitHub issue number)
[✅] Tests added and passed if fixing a bug or adding a new feature
[✅] All code checks passed.
[✅] Added type annotations to new arguments/methods/functions.
[✅] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met. Added a regression test to ensure this bug can be quickly caught in the future if it reappears.

Flytre · 2023-11-30T02:32:03Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met. Added a regression test to ensure this bug can be quickly caught in the future if it reappears.

Made changes consistment with mypy checking

Made changes consistment with mypy checking and pre-commit

WillAyd

minor comments but I think this looks pretty reasonable. thanks!

WillAyd · 2023-12-01T05:05:53Z

pandas/io/parsers/python_parser.py

+                                row_index += 1
+                                new_rows.append(new_row)
+                        else:
+                            # Maintain legacy chunking behavior


We aren't planning on removing this branch - it just serves the non-callable case right? If so I think this comment is a bit confusing so can just be removed

There's some weird behavior in this branch, and I'd argue to remove it.

Consider this csv file:

col_a 0 1 2 3 4 5 6 7 8 9

If we read this file in via read_csv:

text_file_reader = pd.read_csv("dummy.csv", engine='$ENGINE', skiprows=[1, 2, 3, 7, 10], chunksize=2)

With the python engine we get the following result:

col_a 0 3 1 4 col_a 2 5 3 7 4 8

With c engine we get a different result:

col_a 0 3 1 4 col_a 2 5 3 7 col_a 4 8

With the python engine and skiprows=lambda x: x in [1, 2, 3, 7, 10] (this PR adds this behavior, currently with these parameters pandas raises an exception):

col_a 0 3 1 4 col_a 2 5 3 7 col_a 4 8

If we remove the entire else branch and use the logic this PR adds for the 'callable' case, the python engine will match the c engine result in both cases, whether given a list or callable skiprows.

Thoughts? If you'd like to go ahead with the proposed update, should I create a separate issue for it and also link it to this PR?

In that case I would go ahead and do your proposed change now. If it makes the behavior exactly match the c engine might as well do the fix all at once

WillAyd · 2023-12-01T05:08:51Z

pandas/tests/io/parser/test_skiprows.py

+    df2 = next(reader)
+
+    tm.assert_frame_equal(
+        df1, DataFrame({"col_a": [20, 30, 60, 70]}, index=[0, 1, 2, 3])


nit but you don't need to specify index here; will help condense this to one line

mattharrison · 2023-12-01T15:42:48Z

Does this support the pyarrow engine as well?

Flytre · 2023-12-03T19:29:34Z

Does this support the pyarrow engine as well?

pyarrow doesn't support chunksize, so it shouldn't be affected at all. This PR only modifies the python engine.

Flytre · 2023-12-03T19:37:40Z

Hey Will! Left a couple comments I want you to check out and then I'll implement the changes the way you'd prefer from there.

This commit: -Fixes GH 56323 by replacing the python engine chunksize logic -Fixes formatting on the added test_skiprows test case -Fixes incorrect test in read_fwf that expected an output chunk of size 3 when chunksize=2 was specified.

Sync with main branch of pandas: failing CI/CD due to test updates.

Flytre · 2023-12-04T19:16:52Z

Passes test, looking for feedback / to merge if ready.

WillAyd

lgtm @mroeschke any thoughts?

mattharrison · 2023-12-04T21:30:51Z

Would love to see pyarrow support for this

…

On Mon, Dec 4, 2023, 2:13 PM William Ayd ***@***.***> wrote: ***@***.**** approved this pull request. lgtm @mroeschke <https://github.com/mroeschke> any thoughts? — Reply to this email directly, view it on GitHub <#56250 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA5E3JABHICEUHPDQOXDH3YHY4HBAVCNFSM6AAAAABAAOIJRKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONRTGQZTAOBYHE> . You are receiving this because you commented.Message ID: ***@***.***>

mroeschke · 2023-12-05T00:27:43Z

Thanks @Flytre

Flytre and others added 2 commits November 29, 2023 21:09

Fix -GH 55677:

6cd0188

Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met. Added a regression test to ensure this bug can be quickly caught in the future if it reappears.

Fix -GH 55677:

9233273

Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met. Added a regression test to ensure this bug can be quickly caught in the future if it reappears.

pre-commit-ci bot and others added 5 commits November 30, 2023 02:34

[pre-commit.ci] auto fixes from pre-commit.com hooks

35a6929

for more information, see https://pre-commit.ci

Fix -GH 55677:

5469a47

Added support for the python parser to handle using skiprows and chunk_size options at the same time to ensure API contract is met. Added a regression test to ensure this bug can be quickly caught in the future if it reappears.

Merge remote-tracking branch 'origin/fix_55677_2' into fix_55677_2

b887383

Fix -GH 55677:

7d0b165

Made changes consistment with mypy checking

Fix -GH 55677:

0808458

Made changes consistment with mypy checking and pre-commit

WillAyd reviewed Dec 1, 2023

View reviewed changes

Flytre requested a review from WillAyd December 3, 2023 19:36

Fix -GH 55677 & 56323:

c3b330e

This commit: -Fixes GH 56323 by replacing the python engine chunksize logic -Fixes formatting on the added test_skiprows test case -Fixes incorrect test in read_fwf that expected an output chunk of size 3 when chunksize=2 was specified.

Flytre changed the title ~~BUG: Read CSV on python engine fails with callable skiprows and chunk size specified (#55677)~~ BUG: Read CSV on python engine fails when skiprows and chunk size are specified (#55677, #56323) Dec 4, 2023

Flytre and others added 3 commits December 4, 2023 12:51

Trigger CI

265bd1e

Merge remote-tracking branch 'upstream/main' into fix_55677_2

c5a6fdb

Sync with main branch of pandas: failing CI/CD due to test updates.

Merge branch 'main' into fix_55677_2

f870732

mroeschke added the IO CSV read_csv, to_csv label Dec 4, 2023

WillAyd approved these changes Dec 4, 2023

View reviewed changes

mroeschke approved these changes Dec 5, 2023

View reviewed changes

mroeschke added this to the 2.2 milestone Dec 5, 2023

mroeschke merged commit 593fa85 into pandas-dev:main Dec 5, 2023
48 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Read CSV on python engine fails when skiprows and chunk size are specified (#55677, #56323) #56250

BUG: Read CSV on python engine fails when skiprows and chunk size are specified (#55677, #56323) #56250

Flytre commented Nov 30, 2023 •

edited

Loading

Flytre commented Nov 30, 2023

WillAyd left a comment

WillAyd Dec 1, 2023

Flytre Dec 3, 2023 •

edited

Loading

WillAyd Dec 4, 2023 •

edited

Loading

WillAyd Dec 1, 2023

mattharrison commented Dec 1, 2023

Flytre commented Dec 3, 2023 •

edited

Loading

Flytre commented Dec 3, 2023

Flytre commented Dec 4, 2023 •

edited

Loading

WillAyd left a comment

mattharrison commented Dec 4, 2023 via email

mroeschke commented Dec 5, 2023

BUG: Read CSV on python engine fails when skiprows and chunk size are specified (#55677, #56323) #56250

BUG: Read CSV on python engine fails when skiprows and chunk size are specified (#55677, #56323) #56250

Conversation

Flytre commented Nov 30, 2023 • edited Loading

Flytre commented Nov 30, 2023

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd Dec 1, 2023

Choose a reason for hiding this comment

Flytre Dec 3, 2023 • edited Loading

Choose a reason for hiding this comment

WillAyd Dec 4, 2023 • edited Loading

Choose a reason for hiding this comment

WillAyd Dec 1, 2023

Choose a reason for hiding this comment

mattharrison commented Dec 1, 2023

Flytre commented Dec 3, 2023 • edited Loading

Flytre commented Dec 3, 2023

Flytre commented Dec 4, 2023 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

mattharrison commented Dec 4, 2023 via email

mroeschke commented Dec 5, 2023

Flytre commented Nov 30, 2023 •

edited

Loading

Flytre Dec 3, 2023 •

edited

Loading

WillAyd Dec 4, 2023 •

edited

Loading

Flytre commented Dec 3, 2023 •

edited

Loading

Flytre commented Dec 4, 2023 •

edited

Loading