Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: pandas.Timestamp.round infer correct timezone around DST #37592

Open
3 tasks done
constantinevitt opened this issue Nov 2, 2020 · 10 comments
Open
3 tasks done

ENH: pandas.Timestamp.round infer correct timezone around DST #37592

constantinevitt opened this issue Nov 2, 2020 · 10 comments
Assignees
Labels
Enhancement Timezones Timezone data dtype

Comments

@constantinevitt
Copy link

constantinevitt commented Nov 2, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
from datetime import timedelta as td

test_ts = pd.Timestamp('2020-11-02 05:00:01.167', tz='UTC')
lt_ts = test_ts.tz_convert('America/New_York')
lt_ts
# Timestamp('2020-11-02 00:00:01.167000-0500', tz='America/New_York')
lt_ts -= td(days=1)
lt_ts
# Timestamp('2020-11-01 01:00:01.167000-0400', tz='America/New_York')
lt_ts.round(freq='H')



Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "pandas/_libs/tslibs/timestamps.pyx", line 1145, in pandas._libs.tslibs.timestamps.Timestamp.round
  File "pandas/_libs/tslibs/timestamps.pyx", line 1099, in pandas._libs.tslibs.timestamps.Timestamp._round
  File "pandas/_libs/tslibs/timestamps.pyx", line 1317, in pandas._libs.tslibs.timestamps.Timestamp.tz_localize
  File "pandas/_libs/tslibs/tzconversion.pyx", line 63, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc_single
  File "pandas/_libs/tslibs/tzconversion.pyx", line 275, in pandas._libs.tslibs.tzconversion.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from 2020-11-01 01:00:00, try using the 'ambiguous' argument

Problem description

pandas.Timestamp.round method crashes.

Expected Output

Timestamp('2020-11-01 01:00:00.0000-0400', tz='America/New_York')

Output of pd.show_versions()

pd.show_versions()
INSTALLED VERSIONS

commit : 67a3d42
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-52-generic
Version : #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 1.1.4
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 50.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@constantinevitt constantinevitt added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 2, 2020
@constantinevitt constantinevitt changed the title BUG: pandas.Timestamp.round crash BUG: pandas.Timestamp.round DST related crash Nov 2, 2020
@mroeschke
Copy link
Member

This is expected as the ambiguous keyword is 'raise' by default. There are options to determine which DST behavior you would like: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timestamp.round.html

@jorisvandenbossche
Copy link
Member

@mroeschke I don't fully understand why the error is expected (a colleague also ran into this, and I couldn't explain why it would be ambigious).

So I certainly understand that the tz-naive rounded timestamp is ambiguous:

In [39]: ts = pd.Timestamp('2020-11-01 05:00:01.167', tz='UTC').tz_convert("America/New_York")

In [40]: ts
Out[40]: Timestamp('2020-11-01 01:00:01.167000-0400', tz='America/New_York')

In [42]: ts.tz_localize(None).round('H')
Out[42]: Timestamp('2020-11-01 01:00:00')

In [43]: ts.tz_localize(None).round('H').tz_localize("America/New_York")
...
AmbiguousTimeError: Cannot infer dst time from 2020-11-01 01:00:00, try using the 'ambiguous' argument

So the rounded timestamp of "2020-11-01 01:00:00" without timezone is clearly ambiguous.
But in this case our original timestamp had a timezone, and thus "in theory", you can know which hour should be the correct "closest" hour?
To illustrate, the data around that time with hourly frequency:

In [44]: pd.date_range("2020-11-01", freq="1H", periods=5, tz="America/New_York")
Out[44]: 
DatetimeIndex(['2020-11-01 00:00:00-04:00', '2020-11-01 01:00:00-04:00',
               '2020-11-01 01:00:00-05:00', '2020-11-01 02:00:00-05:00',
               '2020-11-01 03:00:00-05:00'],
              dtype='datetime64[ns, America/New_York]', freq='H')

And of course there are two hours with "01:00:00", but since the original timestamp was "01:00:01.167000-0400", we could know that the correct result is "01:00:00-04:00" and not "01:00:00-05:00" ?

But so the reason that round raises that error, is because we first convert to local naive timestamps (wall time), do the rounding, and then localize back to the original timezone (and then giving this error). So the question is then probably: what's the reason that it is needed to convert to wall time to round? Rounding the underlying UTC values gives incorrect results?

@constantinevitt
Copy link
Author

@mroeschke
Why is throwing an error the expected behavior here? There is absolutely no ambiguity about the timestamp I'm supplying to the method. The ambiguity is introduced during the execution of method itself, hence it is a problem. Please re-open this issue.

@mroeschke
Copy link
Member

@jorisvandenbossche So I made an assertion that we probably couldn't convert to UTC first before rounding in #22560 (comment), and I believe the reason is that some UTC offsets are not one full hour.

e.g. for UTC +10:30 (https://en.wikipedia.org/wiki/List_of_UTC_time_offsets#UTC+10:30,_K%E2%80%A0) and a round('H') operation

  • Local Time: 10:29
  • Convert to UTC: 23:59
  • Round: 00:00
  • Convert to local time: 10:30 (I think the expected result would be 10:00)

And generally the recommendation to round via wall time was suggested in #22560 (comment), though that comment does also talk about rounding via absolute time when the frequency is an hour or less (like in this example)

I do agree with both of your sentiments that choosing the -04:00 is the more intuitive result in this case, but I'm not sure the best way to support inferring this if we've chosen to round based on wall time.

@jorisvandenbossche
Copy link
Member

Thanks for those pointers! Certainly understand now that it is indeed not possible to generally use the underlying UTC values but requiring the conversion to wall time.

I have the feeling it would still be nice to support some common cases where it is possible though. Because in general timezones with hourly offsets are more common, and I think rounding to hour / minute / second also seem typical usecases of the round method. And for those common cases, it seems annoying it can't "simply do the right thing".
I don't know how easy it would be to reliably detect those cases, though (specifically for the timezone). Because I think some timezones also have different offsets throughout their lifetime .. ?

@mroeschke
Copy link
Member

Yeah time zones are definitely subject to change so detecting when to correctly round via wall time with hour or less frequencies might get tricky.

I'll reopen to garner more feedback as a new enhancement proposal to round with wall-time-and-timezone in these cases, though I'll state my -0 position as the logic will get more complex and explicit > implicit though at the cost of convenience.

@mroeschke mroeschke reopened this Nov 5, 2020
@mroeschke mroeschke changed the title BUG: pandas.Timestamp.round DST related crash ENH: pandas.Timestamp.round infer correct timezone around DST Nov 5, 2020
@mroeschke mroeschke added Enhancement Timezones Timezone data dtype and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 5, 2020
@lafrech
Copy link

lafrech commented Nov 2, 2022

Stumbled upon this today.

Reading the comments here, I now understand what happens inside (conversion to naive local -> round -> localization to aware fails), but I was indeed surprised that pandas would treat my aware datetime as ambiguous, as said in #37592 (comment).

Looks like I get it working with

timestamp.round(freq, ambiguous=timestamp.fold)

Are there cases where this would fail? If not, is there a downside to doing it internally?

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Mar 18, 2023

I don't think it's possible to do this in such a way as to avoid surprises completely. I can think of two ways to do the truncation:

A. unlocalize, truncate, localize
B. calculate the amount to truncate by using the unlocalized version, then subtract

Let's consider the examples:

  1. '2021-11-07 01:30:00 CST' truncated by '2h'
  2. '2021-11-07 01:30:00 CST' truncated by '1h'

Using logic A. in example 1. would look like:

  • '2021-11-07 01:30:00 CST' -> '2021-11-07 01:30:00' -> '2021-11-07 00:00:00' -> '2021-11-07 00:00:00 CDT'

and in example 2.:

  • '2021-11-07 01:30:00 CST' -> '2021-11-07 01:30:00' -> '2021-11-07 01:00:00' -> "ambiguous error"

Using logic B. in example 1. would look like:

  • '2021-11-07 01:30:00 CST' -> '2021-11-07 01:30:00 CST' - ('2021-11-07 01:30:00' % '2h') -> '2021-11-07 01:00:00 CDT'

and in example 2.:

  • '2021-11-07 01:30:00 CST' -> '2021-11-07 01:30:00 CST' - ('2021-11-07 01:30:00' % '1h') -> '2021-11-07 01:00:00 CST'

To summarise:

  • logic A example 1: looks wrong
  • logic A example 2: looks right
  • logic B example 1: looks right
  • logic B example 2: looks wrong

So, not really sure where to go from here, other than to suggest users to use the 'ambiguous' argument

@lafrech
Copy link

lafrech commented Mar 23, 2023

Looks like I get it working with

timestamp.round(freq, ambiguous=timestamp.fold)

Are there cases where this would fail? If not, is there a downside to doing it internally?

This doesn't work with Pandas 2.0. I changed to

timestamp.round(freq, ambiguous=not bool(timestamp.fold))

Dates are only ambiguous on fall DST where

  • 0 => early => DST => True
  • 1=> late => not DST => False

@MarcoGorelli
Copy link
Member

Revisiting this several months later - I don't think I see downsides to doing

timestamp.round(freq, ambiguous=timestamp.fold)

, as suggested by @lafrech . Maybe we could just do that, and document it?

@MarcoGorelli MarcoGorelli self-assigned this Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

5 participants