-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_csv not respecting object dtype when option is set #56047
Changes from 12 commits
91836bd
f960b16
3c946b3
e93cfed
5665275
7f70503
886dcc9
02a5228
867abce
3031d0d
abcefc8
51a367e
d38b9eb
000cd8e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,7 +5,10 @@ | |
""" | ||
from __future__ import annotations | ||
|
||
from collections import abc | ||
from collections import ( | ||
abc, | ||
defaultdict, | ||
) | ||
import csv | ||
import sys | ||
from textwrap import fill | ||
|
@@ -23,6 +26,8 @@ | |
|
||
import numpy as np | ||
|
||
from pandas._config import using_copy_on_write | ||
|
||
from pandas._libs import lib | ||
from pandas._libs.parsers import STR_NA_VALUES | ||
from pandas.errors import ( | ||
|
@@ -38,8 +43,10 @@ | |
is_float, | ||
is_integer, | ||
is_list_like, | ||
pandas_dtype, | ||
) | ||
|
||
from pandas import Series | ||
from pandas.core.frame import DataFrame | ||
from pandas.core.indexes.api import RangeIndex | ||
from pandas.core.shared_docs import _shared_docs | ||
|
@@ -1846,7 +1853,40 @@ def read(self, nrows: int | None = None) -> DataFrame: | |
else: | ||
new_rows = len(index) | ||
|
||
df = DataFrame(col_dict, columns=columns, index=index) | ||
if hasattr(self, "orig_options"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we do something more explicit than a hasattr check? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no, you can subclass the reader, so we don't have any control over it There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. does anybody actually do this? i judge those people, their ethics, and their hygiene. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's something I can't answer, we might want to deprecate maybe, but we are stuck with hasattr here until then |
||
dtype_arg = self.orig_options.get("dtype", None) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but the DataFrame constructor infers object to string again if the option is set, which would discard the original dtype There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK makes sense. Could we defer looping over There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated, only doing this now if we have a dict or object dtype |
||
else: | ||
dtype_arg = None | ||
|
||
if isinstance(dtype_arg, dict): | ||
dtype = defaultdict(lambda: None) | ||
dtype.update(dtype_arg) | ||
elif dtype_arg is not None and pandas_dtype(dtype_arg) in ( | ||
np.str_, | ||
np.object_, | ||
): | ||
dtype = defaultdict(lambda: dtype_arg) | ||
else: | ||
dtype = None | ||
|
||
if dtype is not None: | ||
new_col_dict = {} | ||
for k, v in col_dict.items(): | ||
d = ( | ||
dtype[k] | ||
if pandas_dtype(dtype[k]) in (np.str_, np.object_) | ||
else None | ||
) | ||
new_col_dict[k] = Series(v, index=index, dtype=d, copy=False) | ||
else: | ||
new_col_dict = col_dict | ||
|
||
df = DataFrame( | ||
new_col_dict, | ||
columns=columns, | ||
index=index, | ||
copy=not using_copy_on_write(), | ||
) | ||
|
||
self._currow += new_rows | ||
return df | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These mappers don't work, arrow supports type -> type not column -> type