Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add on_bad_lines for pyarrow #54643

Merged
merged 19 commits into from
Sep 22, 2023

Conversation

amithkk
Copy link
Contributor

@amithkk amithkk commented Aug 19, 2023

This adds the on_bad_lines argument to the pyarrow engine for the from_csv parser that closely follows the behaviour of the python engine. Internally utilizes pyarrow's invalid_row_handler. The built-in callable implementation slightly differs for pyarrow, so the difference is appropriately documented by pointing to pyarrow's documentation.

Usage Example:

example.csv:

a,b,c
acol1,bcol1,ccol1
acol2,ccol2

example.py

df_arrow = pd.read_csv(r"example.csv", engine='pyarrow', dtype_backend='pyarrow', on_bad_lines='warn')
print(df_arrow)

Console output

ParserWarning: Expected 3 columns, but found 2: acol2,ccol2
       a      b      c
0  acol1  bcol1  ccol1

@amithkk amithkk force-pushed the feat-pyarrow-bad-lines branch from 8b760eb to 62c873b Compare August 20, 2023 11:56
@lithomas1
Copy link
Member

Does this fix any tests?

There are some xfailed tests for on_bad_lines I think somewhere (sorry that I don't remember where) for the pyarrow engine.

@amithkk
Copy link
Contributor Author

amithkk commented Aug 21, 2023

@lithomas1 The tests have been refactored to not fail for both the "python" engine and the "pyarrow" engine (see: pandas/tests/io/parser/test_unsupported.py)

I didn't find any other tests that were invalidated (as confirmed by the CI).

@mroeschke mroeschke added IO CSV read_csv, to_csv Arrow pyarrow functionality labels Aug 21, 2023
@amithkk
Copy link
Contributor Author

amithkk commented Aug 21, 2023

@lithomas1 btw, is there any specific action I should be taking to request a review?

@amithkk amithkk force-pushed the feat-pyarrow-bad-lines branch from 4ee6141 to bee7351 Compare August 22, 2023 14:28
@amithkk
Copy link
Contributor Author

amithkk commented Aug 22, 2023

Rebased from master to pull in compat workaround so that CI Passes

@amithkk amithkk requested a review from mroeschke August 22, 2023 16:29
@mroeschke
Copy link
Member

Can you add a whatsnew entry to 2.2.0.rst?

@lithomas1
Copy link
Member

lithomas1 commented Aug 22, 2023

@lithomas1 The tests have been refactored to not fail for both the "python" engine and the "pyarrow" engine (see: pandas/tests/io/parser/test_unsupported.py)

I didn't find any other tests that were invalidated (as confirmed by the CI).

Try looking in pandas/tests/io/parser/common/test_read_errors.py

Those tests are currently skippped, but on_bad_lines is being used there. You should be able to make at least some of them pass.

(You can try xfailing the tests there by changing the "pyarrow_skip" to a "pyarrow_xfail")

@amithkk
Copy link
Contributor Author

amithkk commented Aug 23, 2023

@lithomas1 I've gotten all the tests within test_read_errors.py applicable for this change to pass.

@amithkk amithkk requested a review from mroeschke August 23, 2023 21:11
@amithkk
Copy link
Contributor Author

amithkk commented Aug 25, 2023

@lithomas1 Are there any other changes that you'd like me to look at from the test perspective?

@amithkk amithkk requested a review from lithomas1 August 25, 2023 20:51
@amithkk amithkk requested a review from mroeschke August 26, 2023 19:48
@amithkk
Copy link
Contributor Author

amithkk commented Aug 31, 2023

Any updates on this PR?

@mroeschke mroeschke added this to the 2.2 milestone Sep 1, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cc @lithomas1 if you have any comments

@amithkk
Copy link
Contributor Author

amithkk commented Sep 8, 2023

@lithomas1 Anything else that needs to be covered in this pull request?

Copy link
Member

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping. I left some comments.

pandas/tests/io/parser/common/test_read_errors.py Outdated Show resolved Hide resolved
pandas/tests/io/parser/common/test_read_errors.py Outdated Show resolved Hide resolved
@amithkk amithkk requested a review from lithomas1 September 11, 2023 20:57
@amithkk
Copy link
Contributor Author

amithkk commented Sep 11, 2023

@lithomas1 I've chained the ArrowInvalid exception with ParserError so that the tests are uniform. Does this match what you had in mind?

@amithkk
Copy link
Contributor Author

amithkk commented Sep 20, 2023

@lithomas1 Awaiting your review

Copy link
Member

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now. Thanks!

@mroeschke mroeschke merged commit d9a2ad1 into pandas-dev:main Sep 22, 2023
32 of 33 checks passed
@mroeschke
Copy link
Member

Thanks @amithkk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Pandas 2.0 with pyarrow engine add the argument like 'skip_bad_lines=True'
3 participants