Implement default time resolution for CF time encoding/decoding #9580

kmuehlbauer · 2024-10-04T12:29:35Z

Closes #xxxx
Tests added
User visible changes (including notable bug fixes) are documented in whats-new.rst
New functions/methods are listed in api.rst

This is a first attempt to get behind #7493 and #9154.

added time_resolution option to OPTIONS to be set with xr.set_options(time_resolution="us") (defaulting to ns)
refactored nanosecond_precision_timestamp to default_precision_timestamp, this now returns lowest needed resolution or default resolution (if that is higher)
refactored decode_datetime_with_pandas to accomodate for non-naosecond treatment,
as pd.to_timedelta returns nanosecond-resolution in any case the processing had to be adapted in case of non-nanosecond timedeltas
adapted some warnings

Besides enabling the changed warnings no tests have been changed so far, everything should work as expected with the default resolution of ns.

This already works for encoding/decoding with different default resolutions (but, note the datetime64 issue for "0001-01-01T00:00:00.000000000"):

for tu in ["s", "ms", "us", "ns"]:
    print("-----------------------------------------------------------------------------------")
    times = [
        np.datetime64("0001-01-01T00:00:00.000000000", tu), 
        np.datetime64("1677-09-21T00:12:43.145224193", tu), 
        np.datetime64("1970-01-01T00:00:00", tu), 
        np.datetime64("NaT")
    ]
    encoding = dict(dtype="int64", _FillValue=20)
    with xr.set_options(time_resolution=tu):
        var = xr.Variable(["time"], times, encoding=encoding)
        assert var.dtype == np.dtype(f"=M8[{tu}]")
        print(var)
        encoded_var = xr.conventions.encode_cf_variable(var)
        print(encoded_var)
        decoded_var = xr.conventions.decode_cf_variable("foo", encoded_var)
        assert decoded_var.dtype == np.dtype(f"=M8[{tu}]")
        print(decoded_var.load())
        print(var.encoding, decoded_var.encoding)

Output

-----------------------------------------------------------------------------------
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00', '1677-09-21T00:12:43',
       '1970-01-01T00:00:00',                 'NaT'],
      dtype='datetime64[s]')
<xarray.Variable (time: 4)> Size: 32B
array([          0, 52912224763, 62135596800,          20])
Attributes:
    units:       seconds since 0001-01-01 00:00:00
    calendar:    proleptic_gregorian
    _FillValue:  20
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00', '1677-09-21T00:12:43',
       '1970-01-01T00:00:00',                 'NaT'],
      dtype='datetime64[s]')
{'dtype': 'int64', '_FillValue': 20} {'_FillValue': np.int64(20), 'units': 'seconds since 0001-01-01 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
-----------------------------------------------------------------------------------
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00.000', '1677-09-21T00:12:43.145',
       '1970-01-01T00:00:00.000',                     'NaT'],
      dtype='datetime64[ms]')
<xarray.Variable (time: 4)> Size: 32B
array([             0, 52912224763145, 62135596800000,             20])
Attributes:
    units:       milliseconds since 0001-01-01 00:00:00
    calendar:    proleptic_gregorian
    _FillValue:  20
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00.000', '1677-09-21T00:12:43.145',
       '1970-01-01T00:00:00.000',                     'NaT'],
      dtype='datetime64[ms]')
{'dtype': 'int64', '_FillValue': 20} {'_FillValue': np.int64(20), 'units': 'milliseconds since 0001-01-01 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
-----------------------------------------------------------------------------------
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00.000000', '1677-09-21T00:12:43.145224',
       '1970-01-01T00:00:00.000000',                        'NaT'],
      dtype='datetime64[us]')
<xarray.Variable (time: 4)> Size: 32B
array([                0, 52912224763145224, 62135596800000000,
                      20])
Attributes:
    units:       microseconds since 0001-01-01 00:00:00
    calendar:    proleptic_gregorian
    _FillValue:  20
<xarray.Variable (time: 4)> Size: 32B
array(['0001-01-01T00:00:00.000000', '1677-09-21T00:12:43.145224',
       '1970-01-01T00:00:00.000000',                        'NaT'],
      dtype='datetime64[us]')
{'dtype': 'int64', '_FillValue': 20} {'_FillValue': np.int64(20), 'units': 'microseconds since 0001-01-01 00:00:00', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}
-----------------------------------------------------------------------------------
<xarray.Variable (time: 4)> Size: 32B
array(['1754-08-30T22:43:41.128654848', '1677-09-21T00:12:43.145224193',
       '1970-01-01T00:00:00.000000000',                           'NaT'],
      dtype='datetime64[ns]')
<xarray.Variable (time: 4)> Size: 32B
array([                   0, -2428007457983430655,  6795364578871345152,
                         20])
Attributes:
    units:       nanoseconds since 1754-08-30 22:43:41.128654848
    calendar:    proleptic_gregorian
    _FillValue:  20
<xarray.Variable (time: 4)> Size: 32B
array(['1754-08-30T22:43:41.128654848', '1677-09-21T00:12:43.145224193',
       '1970-01-01T00:00:00.000000000',                           'NaT'],
      dtype='datetime64[ns]')
{'dtype': 'int64', '_FillValue': 20} {'_FillValue': np.int64(20), 'units': 'nanoseconds since 1754-08-30 22:43:41.128654848', 'calendar': 'proleptic_gregorian', 'dtype': dtype('int64')}

Enabling the proposed changes would allow us to (at least) get feedback which of their workflows already work if they use another default resolution and which not. It would also help us in investigation solutions to fully resolve #7493. I've already tried to come up with a fully automagic version, but that didn't work out (yet) more or less because the current tests are not easy to transform to different default resolution.

Anyway, it would be great to get any feedback here. Pinging some folks who might have interest, @spencerkclark, @ChrisBarker-NOAA, @rabernat, @khider, please include other interested parties, too.

…efault "ns" resolution

kmuehlbauer · 2024-10-04T14:02:13Z

There are some issues with pandas.Datetime having the unit/as_unit() missing for 2.0 until 2.2. Not sure how to handle the mypy errors for that.

Another minor issue is for the older py3.10 based envs where one error isn't raised any more and in the other cftime-processing is requested (by raising OutOfBoundsError) but cftime isN#t installed.

hmaarrfk · 2024-10-06T14:26:09Z

Is set_options going to be compatible with free-threaded python?

Genereally, I've avoided this kind of "global option setting" in my code, the results are too surprising.
However I understand why you might be tempted to do it here.

kmuehlbauer · 2024-10-06T14:30:35Z

@hmaarrfk Currently "ns" is the default. This PR is meant to allow enabling other resolutions, without breaking backwards compatibility for the default "ns" resolution.

hmaarrfk · 2024-10-06T14:34:14Z

but this isn't quite correct.

the "with" context managers are designed to make you feel "safe" when using them. The set, and unset variables.

Consider workers using it to "temporarily change the resolution" during their computation.

Now you spawn them. Each will enter a context manager. Will you return to the original global state when they all exit?

kmuehlbauer · 2024-10-06T14:43:24Z

I see, thanks for highlighting this.

For that use case you would need to set the options globally (without the context manager). See https://docs.xarray.dev/en/stable/generated/xarray.set_options.html.

hmaarrfk · 2024-10-06T15:12:00Z

It just gets kinda annoying to explain why libraries should not be using this new with mechanism.

In either case, I typically avoid it alltogether and specify things via kwargs.

hmaarrfk · 2024-10-06T15:13:25Z

I think the ability to set defaults is "interesting", but I don't think I'll be a user of it. Thats all I am trying to say.

kmuehlbauer · 2024-10-06T15:28:31Z

Thanks for sharing your concerns here, @hmaarrfk.

The proposed changes are more or less just for testing out other resolutions, to have a playground for use cases and workflows to finally resolve #7493.

So, finally, after resolving all issues, adapting tests etc.pp., this set_options should not be needed any more. Does that make sense?

hmaarrfk · 2024-10-06T15:33:20Z

The proposed changes are more or less just for testing out other resolutions

this might in fact be helpful in a test suite.

spencerkclark

Wow thanks for digging into this Kai! This approach does seem like it could be a helpful bridge toward a more fully-formed solution. I haven't had a chance to think too deeply about the bigger picture here yet, but I left a couple initial questions.

This already works for encoding/decoding with different default resolutions (but, note the datetime64 issue for "0001-01-01T00:00:00.000000000")

Right, in the "ns" case since "0001-01-01T00:00:00.000000000" cannot be represented with a datetime64[ns] value, it has already experienced overflow before xarray sees it.

spencerkclark · 2024-10-06T20:57:37Z

xarray/coding/times.py

+            else np.datetime_data(ref_date.asm8)[0]
+        )
+        ref_date_delta = np.timedelta64(1, ref_date_unit).astype("timedelta64[ns]")
+        time_delta = np.timedelta64(1, time_units).astype("timedelta64[ns]")


Have you considered the case where flat_num_dates has a floating point dtype? In that situation one could end up needing finer precision than the time_units specified in the encoding.

But maybe another question at this stage is whether we would want the encoding information (reference date or time units) to affect the datetime precision we decode to—currently it appears one could end up with non-default precision if the time units in the file specified something finer than the user-specified precision via set_options, which could be surprising.

In the long term, however, I agree that it would be nice to have some way of inferring this automatically.

From my perspective the decoding process should just do what it is supposed to do , which is decode into the designated datetimes as given by time_unit since reference_date. As our final dtype (np.datetime64) is based on unix epoch (1970-01-01, with different possible resolutions) we need to find the fitting resolution.

days since 1970-01-01 00:00:00 -> "s" (reference_date takes precedence over time_unit since it is in second resolution)
seconds since 1970-01-01 -> "s" (time_unit take precedence over reference_date)
milliseconds since 1970-01-01 00:00:00.000000001 -> "ns" (reference_date takes precedence over time_unit)

For now we have a third possible element to consider, the default resolution, which currently is "ns" (or might be changed to something else now).

The current proposed workflow with the default resolution should finally not be needed any more, when all issues have been ironed out. Or, we use something like None to specify automatic inferring.

But maybe another question at this stage is whether we would want the encoding information (reference date or time units) to affect the datetime precision we decode to—currently it appears one could end up with non-default precision if the time units in the file specified something finer than the user-specified precision via set_options, which could be surprising.

Maybe the idea of a default resolution should only come into consideration in the encoding process, when the user does not specify any units?

Or no default resolution at all and try to infer everything (if that's possible at all)?

A good example showing the ambiguities between on-disk and in-memory is something like

on-disk: [0, 1, 2, 3] with days since 1970-01-01 12:00:00
will translate to:

in-memory: [12, 36, 60, 84] with datetime64[h]

in_memory default "s": [ 43200, 129600, 216000, 302400] with datetime64[s]

A default resolution of seconds s would be not problem for the above, but for this example:

on-disk: [0, 1, 2, 3] with seconds since 1970-01-01 00:00:00.001
will translate to:

in-memory: [1, 1001, 2001, 3001] with datetime64[ms]

in_memory default "s": [0, 1, 2, 3] with datetime64[s]

we would run into issues. The more I think about this, the more think inferring as much as possible is the way to go, to not round/cutoff or otherwise get time mismatches between on-disk and in-memory.

Indeed I agree long-term some kind of inference would be best—ideally we do not lose precision based on what was written on disk. It is just somewhat complicated (in the case of floating point times, one would need to take into account the values themselves in addition to the metadata), but we can try to hash out now what that could look like (or punt and just say floats are always decoded to a user-configurable precision, defaulting to nanosecond).

For context, I interpreted this PR as temporarily aiming to maintain a consistent NumPy datetime precision throughout xarray, while at least allowing for some flexibility (hence the modifications to the casting logic in variable.py to also respect the time_resolution set in set_options), so it was surprising to see one place where datetime precisions other than that specified in set_options could appear.

In your eyes, at what point do we remove the casting logic and fully allow mixed precision datetime values? Is this merely just an issue with the tests at this stage?

@spencerkclark Sorry, went down that rabbit hole the last days. I think I'm on a good way to have things sorted out. There are different locations in the code and tests were the gears mesh together.

The overall decoding and encoding logic seem to work in my local feature branch. I'll update this PR when I've fixed all the bits and pieces. Need to dig further now, back into the hole 🐰

No worries—if at any point you feel it would be helpful to discuss anything in a more interactive way we could try to find a time to meet. Happy to keep things asynchronous / open though too. Thanks again for your work on this!

spencerkclark · 2024-10-06T23:58:24Z

xarray/coding/times.py

+            fnd_min, fnd_max = flat_num_dates.min(), flat_num_dates.max()
+            min_delta = fnd_min * np.timedelta64(1, time_units)
+            max_delta = fnd_max * np.timedelta64(1, time_units)


A nice thing about pd.to_timedelta is that it will not truncate floating point inputs. Is the issue that prompted this change that there is no way currently to specify the precision of the timedelta that pd.to_timedelta returns (pandas-dev/pandas#49060), i.e. it will still always return a nanosecond-precision timedelta?

Yes, pd.to_timedelta will straight convert to ns-resolution (and this raises subsequently).

ChrisBarker-NOAA · 2024-10-07T02:07:47Z

Thanks for this -- I haven't thought through all the details, but in my case, I'd tend to set the defaut to ms upfront and be done with it -- I never need nanosecond precision, and I often need datetime compatible encoding :-)

A few thoughts: r.e. floating point time encoding -- I really wish so many people wouldn't do this. But as a rule, if folks are using float, they are getting less precision than they need, and if they are using double they are getting more, and I seriously doubt ANYONE has carefully thought out what precision to use, so I wouldn't worry about preserving the precision. if anyone actually needs better than millisecond precision, and is encoding with a floating point, type, they are doing something wrong already.

r.e. threading -- using a context manger for this in multithreading is HARD. I would tend to recommend that folks set it globally. And in tests, if the tests are multi-threaded, that could get really ugly :-(

However, it looks like the Python Decimal module may have solved this issue:

https://docs.python.org/3/library/decimal.html#decimal.localcontext

(If I understand tha=ose docs correct, which I may not) so maybe we could make a thread safe context manager?

spencerkclark · 2024-10-07T12:12:39Z

r.e. floating point time encoding -- I really wish so many people wouldn't do this.

I think we are in agreement—I'm mainly pointing out that it does happen in the wild, and in some circumstances it can be relatively sane (e.g. encoding in multiples of quarters of days, or tenths of seconds, etc.).

For this PR, what I'm suggesting we consider is that we keep things simple and punt on any precision inference—e.g. if one uses xarray.set_options(time_resolution="s") then xarray should always produce second-precision datetime values when decoding, even if that means precision loss (right now that is not necessarily the case).

ChrisBarker-NOAA · 2024-10-09T15:19:28Z

Thanks! I think this is going in a great direction.

kmuehlbauer · 2024-10-15T06:41:15Z

This PR will be superseded by #9618. While #9618 is based on this PR it goes a bit further and implements kind of automatic resolution inference where the default time resolution is the resolution to which dates should at least be decoded to.

kmuehlbauer added 2 commits October 4, 2024 14:09

implement default precision timestamp using predefined options with d…

fa5d2e3

…efault "ns" resolution

match test warning

b0a325d

kmuehlbauer force-pushed the default-time-resolution branch from 2ecb52b to b0a325d Compare October 4, 2024 12:45

kmuehlbauer added 2 commits October 4, 2024 15:34

fix test issues, work around pandas _as_unit/as_unit issue

f194093

fix mypy in options

11ed195

fix typing

6f4d194

kmuehlbauer force-pushed the default-time-resolution branch from da454a8 to 6f4d194 Compare October 5, 2024 14:45

kmuehlbauer mentioned this pull request Oct 6, 2024

datetime64 can no longer be serialized with numpy 2.1??? #9423

Closed

5 tasks

spencerkclark reviewed Oct 7, 2024

View reviewed changes

refactor out _check_date_for_units_since_refdate

cecd561

kmuehlbauer mentioned this pull request Oct 13, 2024

Relax nanosecond datetime restriction in CF time decoding #9618

Open

9 tasks

kmuehlbauer closed this Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement default time resolution for CF time encoding/decoding #9580

Implement default time resolution for CF time encoding/decoding #9580

kmuehlbauer commented Oct 4, 2024 •

edited

Loading

kmuehlbauer commented Oct 4, 2024

hmaarrfk commented Oct 6, 2024

kmuehlbauer commented Oct 6, 2024

hmaarrfk commented Oct 6, 2024

kmuehlbauer commented Oct 6, 2024

hmaarrfk commented Oct 6, 2024

hmaarrfk commented Oct 6, 2024

kmuehlbauer commented Oct 6, 2024

hmaarrfk commented Oct 6, 2024

spencerkclark left a comment

spencerkclark Oct 6, 2024

kmuehlbauer Oct 7, 2024

kmuehlbauer Oct 7, 2024

spencerkclark Oct 7, 2024 •

edited

Loading

kmuehlbauer Oct 9, 2024

spencerkclark Oct 9, 2024

spencerkclark Oct 6, 2024

kmuehlbauer Oct 7, 2024

ChrisBarker-NOAA commented Oct 7, 2024

spencerkclark commented Oct 7, 2024 •

edited

Loading

ChrisBarker-NOAA commented Oct 9, 2024

kmuehlbauer commented Oct 15, 2024

Implement default time resolution for CF time encoding/decoding #9580

Implement default time resolution for CF time encoding/decoding #9580

Conversation

kmuehlbauer commented Oct 4, 2024 • edited Loading

kmuehlbauer commented Oct 4, 2024

hmaarrfk commented Oct 6, 2024

kmuehlbauer commented Oct 6, 2024

hmaarrfk commented Oct 6, 2024

kmuehlbauer commented Oct 6, 2024

hmaarrfk commented Oct 6, 2024

hmaarrfk commented Oct 6, 2024

kmuehlbauer commented Oct 6, 2024

hmaarrfk commented Oct 6, 2024

spencerkclark left a comment

Choose a reason for hiding this comment

spencerkclark Oct 6, 2024

Choose a reason for hiding this comment

kmuehlbauer Oct 7, 2024

Choose a reason for hiding this comment

kmuehlbauer Oct 7, 2024

Choose a reason for hiding this comment

spencerkclark Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

kmuehlbauer Oct 9, 2024

Choose a reason for hiding this comment

spencerkclark Oct 9, 2024

Choose a reason for hiding this comment

spencerkclark Oct 6, 2024

Choose a reason for hiding this comment

kmuehlbauer Oct 7, 2024

Choose a reason for hiding this comment

ChrisBarker-NOAA commented Oct 7, 2024

spencerkclark commented Oct 7, 2024 • edited Loading

ChrisBarker-NOAA commented Oct 9, 2024

kmuehlbauer commented Oct 15, 2024

kmuehlbauer commented Oct 4, 2024 •

edited

Loading

spencerkclark Oct 7, 2024 •

edited

Loading

spencerkclark commented Oct 7, 2024 •

edited

Loading