Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: undeprecate is_datetime64tz_dtype #55514

Closed
3 tasks done
Tikkoneus opened this issue Oct 14, 2023 · 7 comments
Closed
3 tasks done

BUG: undeprecate is_datetime64tz_dtype #55514

Tikkoneus opened this issue Oct 14, 2023 · 7 comments
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@Tikkoneus
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
df = pd.DataFrame([{"date": pd.Timestamp.now()}])
df.date = pd.to_datetime(df.date, utc=True)
pd.api.types.is_datetime64tz_dtype(df.date)
isinstance(df.date, pd.DatetimeTZDtype)
pd.DatetimeTZDtype.is_dtype(df.date)

Issue Description

is_datetime64tz_dtype is being deprecated but the suggested replacement function doesn't behave the same way. Looking at is_datetime64tz_dtype it can return True if isinstance(arr_or_dtype, DatetimeTZDtype) or if DatetimeTZDtype.is_dtype(arr_or_dtype). Thus the suggested use of isinstance(...) is not a complete replacement for the deprecated function.

The code above illustrates a trivial case where is_datetime64tz_dtype returns True as does DatetimeTZDtype.is_dtype() but isinstance(..., pd.DatetimeTZDtype) returns False.

Expected Behavior

The replacement for a deprecated fuction should behave exactly the same under all inputs.

Installed Versions

pd.show_versions()
c:\Python311\Lib\site-packages_distutils_hack_init_.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")

INSTALLED VERSIONS

commit : e86ed37
python : 3.11.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22621
machine : AMD64
processor : Intel64 Family 6 Model 166 Stepping 0, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252

pandas : 2.1.1
numpy : 1.26.0
pytz : 2023.3.post1
dateutil : 2.8.2
setuptools : 65.5.0
pip : 23.2.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

@Tikkoneus Tikkoneus added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 14, 2023
@jbrockmendel
Copy link
Member

The suggested replacement requires you to pass arr.dtype instead of arr. Among other things this is more performant than the deprecated usage.

@Tikkoneus
Copy link
Author

Tikkoneus commented Oct 14, 2023

Oooh, thanks! I do see that now that I look at the deprecation warning, but since the old function took arr or dtype it wasn't clear to me that I had to change the argument in addition to the function. Can the deprecation warning be changed to something like:

is_datetime64tz_dtype is deprecated and will be removed in a future version.
Instead use `isinstance(dtype, pd.DatetimeTZDtype)` for dtypes or `isinstance(arr.dtype, pd.DatetimeTZDtype)` for arrays.

Alternatively, could we add a new pd type checker which checks both types? This would be a more direct replacement to the older function. As it stands removing the old function will require everyone to put in both checks (arr and dtype) so it seems to me to make more sense to collect both checks in one place. Something like:

def pd.is_arr_or_instance(d, t):
   return isinstance(d,t) or (isinstance(d,pd.Series) and isinstance(d.dtype, t))

@jbrockmendel
Copy link
Member

will require everyone to put in both checks (arr or dtype)

in most cases you should be able to know which type of object you have and use a single appropriate check. Do you have a particular use case where this is burdensome?

If you really need is_arr_or_instance, you've just written it; why would it need to live in pandas?

@Tikkoneus
Copy link
Author

There are over 4k uses of is_datetime64tz_dtype which don't explicitly call .dtype. That search isn't 100% accurate (there may be cases where .dtype is called on a previous line, etc.), but it also only covers code in github and there's lots of code (my own, for example) which uses pandas but isn't on github. I think this shows that because the previous function took either arrs or dtypes people are, indeed, using it that way. You're absolutely right l can use that little helper function in my own code, but to me it seems to make sense to officially include it in pandas so that it can be correctly/efficiently implemented for everyone.

For example, I don't know pandas well enough to know what types constitute arrays. Would hasattr('dtype') be better or worse than the above isinstance(pd.Series)?

def pd.is_arr_or_instance(d, t):
   return isinstance(d,t) or (hasattr(d,'dtype') and isinstance(d.dtype, t))

As for my specific example, there are a handful of functions shared between pd.Series and pd.Timestamp, such as .tz_convert() and .tz_localize(). It made sense for me to have a single processing function which accepted both and performed the same operation. Thus I have a function which can accept either an array or a dtype and it passes that to is_datetime64tz_dtype`. I can change my own code easily enough, but again it seems to me personally to be a benefit to provide a correctly implemented direct replacement for the original.

@jbrockmendel
Copy link
Member

For example, I don't know pandas well enough to know what types constitute arrays. Would hasattr('dtype') be better or worse than the above isinstance(pd.Series)?

The fact that this is ambiguous is exactly why the non-ambiguous check is better.

correctly/efficiently implemented for everyone.

A big part of the reason it was deprecated was that it was very inefficient. Checks like these happen a lot and their poor performance add up.

@Tikkoneus
Copy link
Author

Fair enough! Well, thank you very much for your clarifications, and hopefully if anyone else bumps into this issue they'll at least see this thread.

@mroeschke
Copy link
Member

Thanks for suggestion, but per the discussion the original proposal is unlikely to happen so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

3 participants