-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_csv not respecting object dtype when option is set #56047
Conversation
Can we get this one in? |
@@ -1846,7 +1851,29 @@ def read(self, nrows: int | None = None) -> DataFrame: | |||
else: | |||
new_rows = len(index) | |||
|
|||
df = DataFrame(col_dict, columns=columns, index=index) | |||
if hasattr(self, "orig_options"): | |||
dtype_arg = self.orig_options.get("dtype", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the dtype
option normally applied in _engine.read
? Just curious why it needs to be done here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the DataFrame constructor infers object to string again if the option is set, which would discard the original dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK makes sense.
Could we defer looping over col_dict
if dtype isn't specified to be object-like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, only doing this now if we have a dict or object dtype
@@ -295,18 +295,8 @@ def read(self) -> DataFrame: | |||
dtype_mapping[pa.null()] = pd.Int64Dtype() | |||
frame = table.to_pandas(types_mapper=dtype_mapping.get) | |||
elif using_pyarrow_string_dtype(): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These mappers don't work, arrow supports type -> type not column -> type
cc @mroeschke gentle ping |
Thanks @phofl |
@@ -1846,7 +1853,40 @@ def read(self, nrows: int | None = None) -> DataFrame: | |||
else: | |||
new_rows = len(index) | |||
|
|||
df = DataFrame(col_dict, columns=columns, index=index) | |||
if hasattr(self, "orig_options"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do something more explicit than a hasattr check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, you can subclass the reader, so we don't have any control over it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does anybody actually do this? i judge those people, their ethics, and their hygiene.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's something I can't answer, we might want to deprecate maybe, but we are stuck with hasattr here until then
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.we are not honouring object dtype here, thoughts on performance @jbrockmendel ?