Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added option to optionally localize retreived timeseries to UTC #53

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

jkupka
Copy link
Contributor

@jkupka jkupka commented Jul 15, 2024

Timestamps are stored in UTC format in MongoDB. Before this PR timeseries timestamps retreived from the database had no localisation which resulted in shifted timestamps in applications because they had no information about the timezone.
This PR adds an option to localize timestamps as UTC to address this issue.

@jkupka jkupka requested review from jthurner and julffers July 15, 2024 14:06
@jthurner
Copy link
Collaborator

jthurner commented Jul 22, 2024

This change will just optionally give you timezone-aware datetimes with the timezone set to UTC, and only on that endpoint. What mongodb returns will always be UTC, is there any advantage in ever returning a naive datetime?

Would it not be better to 1) always return timezone-aware datetimes (instantiate MongoClient with tz_aware=True) and 2) make the returned timezone configurable, defaulting to UTC (instantiate MongoClient with tzinfo=datetime.timezone.UTC)).

The real trouble then comes from storing naive datetimes which are not in UTC, but that seems to be less easy to control on the database level (pymongo just stores them as UTC).

https://pymongo.readthedocs.io/en/stable/examples/datetimes.html

@lthurner
Copy link
Contributor

lthurner commented Jul 29, 2024

If I understand the pymongo documentation correctly, on the database level:

  • timestamps are saved without timezone by default
  • they can be saved with the timezone information from the pandas datetime object with tz_aware=True
  • but: even with tz_aware=True, the dates are always returned without timezone, but converted to UTC

Therefore, I can see three options:

1. store timezone in database timestamps

  • store timezone aware timestamps in database
  • timestamps are converted to UTC when loading from database (pymongo)
  • convert to any timezone the user wants to when reading timeseries (pandahub)

Examples:

  • a timeseries with timestamp 11:00 / UTC+1 is saved in the database. It is returned as 10:00 / UTC. The user can retrieve the timeseries in any timezone he wants, but the information about which timezone was defined on storage (UTC+1) is lost.
  • a timeseries with timestamp 11:00 that is meant as UTC+1 is saved to the database, but the UTC+1 timezone is not set in the timestamp will be returned as 11:00 UTC. If the user specifies UTC+1 as output, it will be converted to 12:00 UTC+1.

Pro / Con:
+ user can view the timeseries data in any timezone, conversion is handled internally
- timezone is not stored in database, it is up to the user to define in which timezone the data is supposed to be given.
- timezone data has to be correctly set when saving the timeseries, otherwise the values will be interpreted wrong

2. set timezone on loading

  • store timestamps without timezone in database
  • interpret the timezone the user wants to when reading timeseries (pandahub)

Example:
A timeseries with timestamp 11:00 is saved in the database (timezone can not be defined on storage). It is always returned as 11:00. If the user defines the timezone as UTC+1, it will be returned as 11:00 / UTC+1, if he defines UTC it will be returned as 11:00 / UTC.

Pro / Con:
+ user can view the timeseries data in any timezone
- there is no conversion of timezones, only interpreting existing data as timezone A or B
- timezone is not stored in database, it is up to the user to define in which timezone the data is supposed to be interpreted as

3. store timezone in metadata

  • store timestamps without timezone in database
  • add timezone in metadata of timeseries
  • set the timezon of the pandas dataframe to the timezone saved in the metadata (if the parameter does not exist, default to UTC+1)

Example:
A timeseries with timestamp 11:00 is saved in the database (timezone of the timestamp itself can not be defined on storage). The timeseries receives UTC+1 as a timezone, which is saved in the timeseries metadata (together with, min_value, max_value, ts_format etc.). On loading the timeseries, the timezone in pandas is set to the timeseries, so that the timestamp is returned as 11:00 / UTC +1

Pro / Con:
- user can view the timeseries data only in the timezone it is stored in
+ timezone information is stored in the database
+ most "what you see is what you get" solution without any magical timezone conversions in the database
+ backwards compatible, since timestamps do not have to be timezone aware

After writing it down like that, Option 3. would be my preferred solution.

@jthurner
Copy link
Collaborator

These 3 options don't feel quite right to be honest.

Conceptually:

  • a timestamp is always tied to a timezone, implicitly or explicitly
  • UTC[-0] is a timezone, not the absence of a timezone
  • mongodb assumes time objects are always in UTC
  • pymongo converts datetimes to UTC before sending them to mongodb - if the timezone information is missing it assumes UTC
  • the tz_aware option only affects retrieval, not entry - you will get a timezone-aware datetime as UTC (no conversion involved)
  • the tzinfo option will additionally convert to the given timezone automatically

Therefore, I would argue that:

  • saving a naive datetime object meant to be in local time through pymongo is a user error, and not recoverable by logic
  • the only sane way to attach timezone information is by using tz aware datetime objects
  • tz_aware should be the pandahub-wide default as it communicates the timezone and allows for conversion if required
  • the timezone in which the data originated in represents metadata which may be stored separately

Option 3 breaks with the pymongo convention on how timestamps are saved, and makes a simple timestamp ambigious if you do not know where the metadata is found. You would need to scrub timezone information from incoming timestamps without converting them (otherwise pymongo would convert to UTC on insert). It also adds code to pandahub doing things already implemented in pymongo.

I do not quite understand the usecase for retrieving timeseries in the timezone they originated in - if you need to work with them you probably want UTC, if you just need to display them you probably want the local time zone of the user. But here is a POC implementing that requirement:

from datetime import datetime
from zoneinfo import ZoneInfo

ph = PandaHub()
ts = datetime.now(ZoneInfo("Europe/Berlin"))
ts_id = ph.test_save_timestamp(ts)

## default timezone (UTC)
ph.test_get_timestamp(ts_id)
#>  datetime.datetime(2024, 7, 31, 12, 32, 15, 127000, tzinfo=zoneinfo.ZoneInfo(key='UTC'))

## timezone from metadata (when saved)
ph.test_get_timestamp(ts_id, "ORIGIN")
#>  datetime.datetime(2024, 7, 31, 14, 32, 15, 127000, tzinfo=zoneinfo.ZoneInfo(key='Europe/Berlin'))

## specific timezone
ph.test_get_timestamp(ts_id, "America/New_York")
#> datetime.datetime(2024, 7, 31, 8, 32, 15, 127000, tzinfo=zoneinfo.ZoneInfo(key='America/New_York'))

## change default timezone on PandaHub initialization
ph = PandaHub(tzinfo="Europe/Berlin")

## default timezone (Europe/Berlin)
ph.test_get_timestamp(ts_id)
#> datetime.datetime(2024, 7, 31, 14, 32, 15, 127000, tzinfo=zoneinfo.ZoneInfo(key='Europe/Berlin'))

The changes also return timezone-aware datetimes everywhere (defaulting to UTC) , which should be the correct thing to do but may require some migrations for existing collections if they contain timestamps saved as naive objects meant to be in local time.

@lthurner
Copy link
Contributor

lthurner commented Jul 31, 2024

Yes I think there where still some misunderstandings in my logic about the way pymongo handles the timestamps. I think I understand now and I agree with your points. There probably is a good reasons why it is recommended to always store in UTC and we should keep it that way as well.

The only thing that is still unclear to me: what is the value of tz_aware? If all timestamps are stored in the MongoDB as UTC, then the status quo is that they are returned as naive timestamps. pandas seems to implicitly assume naive timestamps are UTC timestamps by default. So getting a timeseries with naive timestamps or UTC timestamps seems to be kind of the same? That would be the only difference with tz_aware, right?

@jthurner
Copy link
Collaborator

pandas seems to implicitly assume naive timestamps are UTC timestamps by default. So getting a timeseries with naive timestamps or UTC timestamps seems to be kind of the same? That would be the only difference with tz_aware, right?

Yes, if pandas defaults to UTC for naive datetimes, it should make no difference in this case. But you need tz_aware=True to be able to use tzinfo, and it removes ambiguity, for example when accessing a datetime directly from the database or when not converting it to a pandas series (e.g. saving last_accessed timestamps on a project).

jthurner
jthurner previously approved these changes Nov 6, 2024
Copy link
Collaborator

@jthurner jthurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing PR approval

@jthurner jthurner dismissed their stale review November 6, 2024 15:09

just testing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants