Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Inconsistent dtype behavior between read_csv and convert_dtypes with pyarrow backend for datetime #55422

Open
2 of 3 tasks
koizumihiroo opened this issue Oct 6, 2023 · 5 comments
Assignees
Labels
Arrow pyarrow functionality Bug Datetime Datetime data dtype IO CSV read_csv, to_csv

Comments

@koizumihiroo
Copy link

koizumihiroo commented Oct 6, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import io

data = """datetime,id
2023-12-31 12:34:56,a
2000-01-01 01:23:45,b
"""

df0 = pd.read_csv(io.StringIO(data), parse_dates=["datetime"])
df0.dtypes
# datetime    datetime64[ns]
# id                  object
# dtype: object

df0.convert_dtypes(dtype_backend="pyarrow").dtypes
# datetime    timestamp[ns][pyarrow]
# id                 string[pyarrow]
# dtype: object

df1 = pd.read_csv(io.StringIO(data), parse_dates=["datetime"], dtype_backend="pyarrow")
df1.dtypes
# datetime     datetime64[ns]
# id          string[pyarrow]
# dtype: object

Issue Description

I'm uncertain whether this is a bug or intentional design.

On pandas 2.1.1, when utilizing read_csv(..., dtype_backend="pyarrow"), a column specified in parse_dates returns a datetime64[ns] type, whereas I expect a timestamp[ns][pyarrow] type, as indicated by the results of .convert_dtypes(dtype_backend="pyarrow").

Thanks,

Expected Behavior

df1.dtypes
# datetime    timestamp[ns][pyarrow]
# id                 string[pyarrow]
# dtype: object

Installed Versions

$ docker run -it --rm python:3.11.6-slim bash
root@2eff93196ac4:/# pip install -U pip & pip install pandas==2.1.1 & pip install pyarrow==13.0.0
root@2eff93196ac4:/# python
>>> pd.show_versions()
/usr/local/lib/python3.11/site-packages/_distutils_hack/init.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit : e86ed37
python : 3.11.6.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.49-linuxkit-pr
Version : #1 SMP Thu May 25 07:17:40 UTC 2023
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.1
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 13.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@koizumihiroo koizumihiroo added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 6, 2023
@koizumihiroo koizumihiroo changed the title BUG: Inconsistent Dtype Behavior between read_csv and convert_dtypes with PyArrow Backend for datetime BUG: Inconsistent dtype behavior between read_csv and convert_dtypes with pyarrow backend for datetime Oct 6, 2023
@paulreece
Copy link
Contributor

Hey there, thanks for reporting the issue.

As you can see here this is using a pyarrow method from_numpy_dtype to convert when you call convert_dtypes

def to_pyarrow_type(
dtype: ArrowDtype | pa.DataType | Dtype | None,
) -> pa.DataType | None:
"""
Convert dtype to a pyarrow type instance.
"""
if isinstance(dtype, ArrowDtype):
return dtype.pyarrow_dtype
elif isinstance(dtype, pa.DataType):
return dtype
elif isinstance(dtype, DatetimeTZDtype):
return pa.timestamp(dtype.unit, dtype.tz)
elif dtype:
try:
# Accepts python types too
# Doesn't handle all numpy types
return pa.from_numpy_dtype(dtype)
except pa.ArrowNotImplementedError:
pass
return None

See here about read_csv and pyarrow. I'm sure one of the maintainers can elaborate more about this and whether the behavior in io needs to be changed to be more like convert_dtypes.

The 'pyarrow' engine was added as an *experimental* engine, and some features
are unsupported, or may not work correctly, with this engine.
converters : dict of {{Hashable : Callable}}, optional

@paulreece paulreece added the IO CSV read_csv, to_csv label Oct 6, 2023
@lithomas1
Copy link
Member

I think this works correctly with the pyarrow engine (If you add engine="pyarrow" you get timestamp[s][pyarrow] as the dtype).

I think this is an issue with the C and Python engines, though.

@lithomas1 lithomas1 added Datetime Datetime data dtype Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 8, 2023
@lithomas1 lithomas1 self-assigned this Oct 8, 2023
@koizumihiroo
Copy link
Author

Thanks. I was under the impression that read_csv used the pyarrow engine because read_parquet was using it in my environment.

If you add engine="pyarrow" you get timestamp[s][pyarrow] as the dtype).

Confirmed. And this issue can be considered closed.

@benjaminbauer
Copy link

I do not agree with the OP. dtype_backend="pyarrow" should produce timestamp[s][pyarrow] irregardless of the defined engine. So this is a bug

@phalvesmbai
Copy link

Using pandas 2.1.4. I'm seeing the same error using read_sql and the engine cannot be specified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Datetime Datetime data dtype IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

5 participants