-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TST (string dtype): resolve all xfails in IO parser tests #60321
TST (string dtype): resolve all xfails in IO parser tests #60321
Conversation
@@ -260,8 +257,12 @@ def test_warn_if_chunks_have_mismatched_type(all_parsers): | |||
"Specify dtype option on import or set low_memory=False.", | |||
buf, | |||
) | |||
|
|||
assert df.a.dtype == object | |||
if parser.engine == "c" and parser.low_memory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't low_memory
still be using the proper data type? Or why would that stick to object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not super familiar with the parser code, but I think that with the low memory parser, parsing is done in chunks, and so if the inference changes later on, you end up with chunks with different types, and then get object dtype as a result.
In the test here, we have a column with mostly integers, and only a few strings in the middle. So with the default parser, it will decide based on the values in the full column that the dtype should be string. But chunk by chunk you get some chunks as integer and some as string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see...yea that's a weird one
@@ -260,8 +257,12 @@ def test_warn_if_chunks_have_mismatched_type(all_parsers): | |||
"Specify dtype option on import or set low_memory=False.", | |||
buf, | |||
) | |||
|
|||
assert df.a.dtype == object | |||
if parser.engine == "c" and parser.low_memory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see...yea that's a weird one
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
Will backport this one too |
…in IO parser tests (cherry picked from commit ee3c18f)
Manual backport -> #60330 |
#60330) * Backport PR #60321: TST (string dtype): resolve all xfails in IO parser tests (cherry picked from commit ee3c18f) * BUG: Avoid RangeIndex conversion in read_csv if dtype is specified (#59316) Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: Matthew Roeschke <[email protected]>
There are two remaining xfails left: one related to invalid unicode (that errors if using the pyarrow-backed string dtype, so we should probably have a fall back to object dtype fo that case), and another one about specifying the
names
keyword with the pyarrow engine giving object-dtype columns.xref #54792