-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement default time resolution for CF time encoding/decoding #9580
Conversation
…efault "ns" resolution
2ecb52b
to
b0a325d
Compare
There are some issues with pandas.Datetime having the Another minor issue is for the older py3.10 based envs where one error isn't raised any more and in the other cftime-processing is requested (by raising OutOfBoundsError) but cftime isN#t installed. |
da454a8
to
6f4d194
Compare
Is Genereally, I've avoided this kind of "global option setting" in my code, the results are too surprising. |
@hmaarrfk Currently "ns" is the default. This PR is meant to allow enabling other resolutions, without breaking backwards compatibility for the default "ns" resolution. |
but this isn't quite correct. the "with" context managers are designed to make you feel "safe" when using them. The set, and unset variables. Consider workers using it to "temporarily change the resolution" during their computation. Now you spawn them. Each will enter a context manager. Will you return to the original global state when they all exit? |
I see, thanks for highlighting this. For that use case you would need to set the options globally (without the context manager). See https://docs.xarray.dev/en/stable/generated/xarray.set_options.html. |
It just gets kinda annoying to explain why libraries should not be using this new In either case, I typically avoid it alltogether and specify things via kwargs. |
I think the ability to set defaults is "interesting", but I don't think I'll be a user of it. Thats all I am trying to say. |
Thanks for sharing your concerns here, @hmaarrfk. The proposed changes are more or less just for testing out other resolutions, to have a playground for use cases and workflows to finally resolve #7493. So, finally, after resolving all issues, adapting tests etc.pp., this set_options should not be needed any more. Does that make sense? |
this might in fact be helpful in a test suite. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow thanks for digging into this Kai! This approach does seem like it could be a helpful bridge toward a more fully-formed solution. I haven't had a chance to think too deeply about the bigger picture here yet, but I left a couple initial questions.
This already works for encoding/decoding with different default resolutions (but, note the datetime64 issue for "0001-01-01T00:00:00.000000000")
Right, in the "ns"
case since "0001-01-01T00:00:00.000000000"
cannot be represented with a datetime64[ns]
value, it has already experienced overflow before xarray sees it.
else np.datetime_data(ref_date.asm8)[0] | ||
) | ||
ref_date_delta = np.timedelta64(1, ref_date_unit).astype("timedelta64[ns]") | ||
time_delta = np.timedelta64(1, time_units).astype("timedelta64[ns]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered the case where flat_num_dates
has a floating point dtype? In that situation one could end up needing finer precision than the time_units
specified in the encoding.
But maybe another question at this stage is whether we would want the encoding information (reference date or time units) to affect the datetime precision we decode to—currently it appears one could end up with non-default precision if the time units in the file specified something finer than the user-specified precision via set_options
, which could be surprising.
In the long term, however, I agree that it would be nice to have some way of inferring this automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective the decoding process should just do what it is supposed to do , which is decode into the designated datetimes as given by time_unit since reference_date
. As our final dtype (np.datetime64
) is based on unix epoch (1970-01-01
, with different possible resolutions) we need to find the fitting resolution.
days since 1970-01-01 00:00:00
-> "s"
(reference_date takes precedence over time_unit since it is in second resolution)
seconds since 1970-01-01
-> "s"
(time_unit take precedence over reference_date)
milliseconds since 1970-01-01 00:00:00.000000001
-> "ns"
(reference_date takes precedence over time_unit)
For now we have a third possible element to consider, the default resolution, which currently is "ns" (or might be changed to something else now).
The current proposed workflow with the default resolution
should finally not be needed any more, when all issues have been ironed out. Or, we use something like None
to specify automatic inferring.
But maybe another question at this stage is whether we would want the encoding information (reference date or time units) to affect the datetime precision we decode to—currently it appears one could end up with non-default precision if the time units in the file specified something finer than the user-specified precision via
set_options
, which could be surprising.
Maybe the idea of a default resolution should only come into consideration in the encoding process, when the user does not specify any units
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or no default resolution at all and try to infer everything (if that's possible at all)?
A good example showing the ambiguities between on-disk and in-memory is something like
- on-disk:
[0, 1, 2, 3]
withdays since 1970-01-01 12:00:00
will translate to: - in-memory:
[12, 36, 60, 84]
withdatetime64[h]
- in_memory default
"s"
:[ 43200, 129600, 216000, 302400]
withdatetime64[s]
A default resolution of seconds s
would be not problem for the above, but for this example:
- on-disk: [0, 1, 2, 3] with
seconds since 1970-01-01 00:00:00.001
will translate to: - in-memory: [1, 1001, 2001, 3001] with
datetime64[ms]
- in_memory default
"s"
:[0, 1, 2, 3]
withdatetime64[s]
we would run into issues. The more I think about this, the more think inferring as much as possible is the way to go, to not round/cutoff or otherwise get time mismatches between on-disk and in-memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I agree long-term some kind of inference would be best—ideally we do not lose precision based on what was written on disk. It is just somewhat complicated (in the case of floating point times, one would need to take into account the values themselves in addition to the metadata), but we can try to hash out now what that could look like (or punt and just say floats are always decoded to a user-configurable precision, defaulting to nanosecond).
For context, I interpreted this PR as temporarily aiming to maintain a consistent NumPy datetime precision throughout xarray, while at least allowing for some flexibility (hence the modifications to the casting logic in variable.py
to also respect the time_resolution
set in set_options
), so it was surprising to see one place where datetime precisions other than that specified in set_options
could appear.
In your eyes, at what point do we remove the casting logic and fully allow mixed precision datetime values? Is this merely just an issue with the tests at this stage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spencerkclark Sorry, went down that rabbit hole the last days. I think I'm on a good way to have things sorted out. There are different locations in the code and tests were the gears mesh together.
The overall decoding and encoding logic seem to work in my local feature branch. I'll update this PR when I've fixed all the bits and pieces. Need to dig further now, back into the hole 🐰
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries—if at any point you feel it would be helpful to discuss anything in a more interactive way we could try to find a time to meet. Happy to keep things asynchronous / open though too. Thanks again for your work on this!
xarray/coding/times.py
Outdated
fnd_min, fnd_max = flat_num_dates.min(), flat_num_dates.max() | ||
min_delta = fnd_min * np.timedelta64(1, time_units) | ||
max_delta = fnd_max * np.timedelta64(1, time_units) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice thing about pd.to_timedelta
is that it will not truncate floating point inputs. Is the issue that prompted this change that there is no way currently to specify the precision of the timedelta that pd.to_timedelta
returns (pandas-dev/pandas#49060), i.e. it will still always return a nanosecond-precision timedelta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, pd.to_timedelta will straight convert to ns-resolution (and this raises subsequently).
Thanks for this -- I haven't thought through all the details, but in my case, I'd tend to set the defaut to ms upfront and be done with it -- I never need nanosecond precision, and I often need datetime compatible encoding :-) A few thoughts: r.e. floating point time encoding -- I really wish so many people wouldn't do this. But as a rule, if folks are using float, they are getting less precision than they need, and if they are using double they are getting more, and I seriously doubt ANYONE has carefully thought out what precision to use, so I wouldn't worry about preserving the precision. if anyone actually needs better than millisecond precision, and is encoding with a floating point, type, they are doing something wrong already. r.e. threading -- using a context manger for this in multithreading is HARD. I would tend to recommend that folks set it globally. And in tests, if the tests are multi-threaded, that could get really ugly :-( However, it looks like the Python Decimal module may have solved this issue: https://docs.python.org/3/library/decimal.html#decimal.localcontext (If I understand tha=ose docs correct, which I may not) so maybe we could make a thread safe context manager? |
I think we are in agreement—I'm mainly pointing out that it does happen in the wild, and in some circumstances it can be relatively sane (e.g. encoding in multiples of quarters of days, or tenths of seconds, etc.). For this PR, what I'm suggesting we consider is that we keep things simple and punt on any precision inference—e.g. if one uses |
Thanks! I think this is going in a great direction. |
whats-new.rst
api.rst
This is a first attempt to get behind #7493 and #9154.
time_resolution
option toOPTIONS
to be set withxr.set_options(time_resolution="us")
(defaulting tons
)nanosecond_precision_timestamp
todefault_precision_timestamp
, this now returns lowest needed resolution or default resolution (if that is higher)decode_datetime_with_pandas
to accomodate for non-naosecond treatment,as pd.to_timedelta returns nanosecond-resolution in any case the processing had to be adapted in case of non-nanosecond timedeltas
Besides enabling the changed warnings no tests have been changed so far, everything should work as expected with the default resolution of
ns
.This already works for encoding/decoding with different default resolutions (but, note the datetime64 issue for "0001-01-01T00:00:00.000000000"):
Output
Enabling the proposed changes would allow us to (at least) get feedback which of their workflows already work if they use another default resolution and which not. It would also help us in investigation solutions to fully resolve #7493. I've already tried to come up with a fully automagic version, but that didn't work out (yet) more or less because the current tests are not easy to transform to different default resolution.
Anyway, it would be great to get any feedback here. Pinging some folks who might have interest, @spencerkclark, @ChrisBarker-NOAA, @rabernat, @khider, please include other interested parties, too.