Set ZArray default fill_value as NaT for datetime64 #206

thodson-usgs · 2024-07-31T03:20:04Z

This PR fixes an error introduced in 10bd53d, which broke the default fill_value for datetime64 fields. Closes #201.

To replicate the error:

aws s3api get-object --region us-west-2 --no-sign-request --bucket wrf-se-ak-ar5 --key ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc WRFDS_2060-01-01.nc
aws s3api get-object --region us-west-2 --no-sign-request --bucket wrf-se-ak-ar5 --key ccsm/rcp85/daily/2060/WRFDS_2060-01-02.nc WRFDS_2060-01-02.nc

then run this example.py

import xarray as xr
import glob
from virtualizarr import open_virtual_dataset

virtual_datasets = [
    open_virtual_dataset(filepath,
                         indexes={},
                         loadable_variables=['Time'],
                         cftime_variables=['Time'])
    for filepath in glob.glob('*.nc')
]

virtual_ds = xr.combine_nested(virtual_datasets, concat_dim=["Time"], coords="minimal", compat="override")
virtual_ds.virtualize.to_kerchunk('combined.json', format='json')

ds = xr.open_dataset('combined.json', engine="kerchunk")

The PR implements Tom's suggestion below: #206 (comment)

thodson-usgs · 2024-07-31T16:20:04Z

@TomAugspurger, referencing back to #201. Pulling you over here, because we might be crossing issues.

TomAugspurger · 2024-07-31T16:31:04Z

Thanks. I had to modify that slightly to work with my env:

aws s3api get-object --region us-west-2 --no-sign-request --bucket wrf-se-ak-ar5 --key ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc WRFDS_2060-01-01.nc
  aws s3api get-object --region us-west-2 --no-sign-request --bucket wrf-se-ak-ar5 --key ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc WRFDS_2060-01-02.nc

Taking a look.

TomAugspurger · 2024-07-31T16:34:52Z

What version of Zarr do you have?

thodson-usgs · 2024-07-31T16:37:14Z

What version of Zarr do you have?

2.18.2

Also note the typo in get-object, the second file should be -02.nc. Corrected in the original example.

TomAugspurger · 2024-07-31T16:53:06Z

Thanks, I was able to reproduce it.

I would suggest something like this:

diff --git a/virtualizarr/zarr.py b/virtualizarr/zarr.py
index e5015b3..3647db5 100644
--- a/virtualizarr/zarr.py
+++ b/virtualizarr/zarr.py
@@ -32,13 +32,14 @@ ZAttrs = NewType(
 )  # just the .zattrs (for one array or for the whole store/group)
 FillValueT = bool | str | float | int | list | None
 
-ZARR_DEFAULT_FILL_VALUE: dict[np.dtype, FillValueT] = {
+ZARR_DEFAULT_FILL_VALUE: dict[str, FillValueT] = {
     # numpy dtypes's hierarchy lets us avoid checking for all the widths
     # https://numpy.org/doc/stable/reference/arrays.scalars.html
-    np.dtype("bool"): False,
-    np.dtype("int"): 0,
-    np.dtype("float"): 0.0,
-    np.dtype("complex"): [0.0, 0.0],
+    np.dtype("bool").kind: False,
+    np.dtype("int").kind: 0,
+    np.dtype("float").kind: 0.0,
+    np.dtype("complex").kind: [0.0, 0.0],
+    np.dtype("datetime64").kind: np.datetime64("NaT").view("i8").item(),
 }
 """
 The value and format of the fill_value depend on the `data_type` of the array.
@@ -67,7 +68,7 @@ class ZArray(BaseModel):
     chunks: tuple[int, ...]
     compressor: dict | None = None
     dtype: np.dtype
-    fill_value: FillValueT = Field(default=0.0, validate_default=True)
+    fill_value: FillValueT = Field(None, validate_default=True)
     filters: list[dict] | None = None
     order: Literal["C", "F"]
     shape: tuple[int, ...]
@@ -90,7 +91,7 @@ class ZArray(BaseModel):
     @model_validator(mode="after")
     def _check_fill_value(self) -> Self:
         if self.fill_value is None:
-            self.fill_value = ZARR_DEFAULT_FILL_VALUE.get(self.dtype, 0.0)
+            self.fill_value = ZARR_DEFAULT_FILL_VALUE.get(self.dtype.kind, 0.0)
         return self
 
     @property

That adds a default fill value for datetimes that ends up as NaT. Unfortunately, something strange was going on with the lookup of that dtype, so I had to change that mapping to use .dtype.kind instead. Dunno if that'll cause other issues.

I think that once virtualizarr is able to use zarr.Array internally then things will be smoother.

thodson-usgs · 2024-07-31T17:11:08Z

Great suggestion, @TomAugspurger!
I revised the PR, and it passed my example, as well as the test suite. However, I have no sense of who might be affected by this change and whether it's worth the risk.

TomNicholas · 2024-07-31T17:11:10Z

Thanks for rooting this out guys! Let's add a regression test to be sure the fix is working.

sharkinsspatial · 2024-08-02T23:22:50Z

Just a note that this regression appears to affect one of the roundtrip tests in my HDF reader PR. It also seems that this PR has the same result. I'll need to dig in a bit more to see exactly what is occurring in the date decoding chain there.

TomAugspurger

Here's an integration test:

diff --git a/virtualizarr/tests/test_integration.py b/virtualizarr/tests/test_integration.py
index 239316a..7210b3f 100644
--- a/virtualizarr/tests/test_integration.py
+++ b/virtualizarr/tests/test_integration.py
@@ -4,6 +4,9 @@ import xarray as xr
 import xarray.testing as xrt
 
 from virtualizarr import open_virtual_dataset
+from virtualizarr.manifests.array import ManifestArray
+from virtualizarr.manifests.manifest import ChunkManifest
+from virtualizarr.zarr import ZArray
 
 
 @pytest.mark.parametrize(
@@ -166,6 +169,50 @@ class TestKerchunkRoundtrip:
         # assert equal to original dataset
         xrt.assert_identical(roundtrip, ds)
 
+    def test_datetime64_dtype_fill_value(self, tmpdir, format):
+        chunks_dict = {
+            "0.0.0": {"path": "foo.nc", "offset": 100, "length": 100},
+        }
+        manifest = ChunkManifest(entries=chunks_dict)
+        chunks = (1, 1, 1)
+        shape = (1, 1, 1)
+        zarray = ZArray(
+            chunks=chunks,
+            compressor={"id": "zlib", "level": 1},
+            dtype=np.dtype("<M8[ns]"),
+            # fill_value=0.0,
+            filters=None,
+            order="C",
+            shape=shape,
+            zarr_format=2,
+        )
+        marr1 = ManifestArray(zarray=zarray, chunkmanifest=manifest)
+        ds = xr.Dataset(
+            {
+                "a": xr.DataArray(
+                    marr1,
+                    attrs={
+                        "_FillValue": np.datetime64("1970-01-01T00:00:00.000000000")
+                    },
+                )
+            }
+        )
+
+        if format == "dict":
+            # write those references to an in-memory kerchunk-formatted references dictionary
+            ds_refs = ds.virtualize.to_kerchunk(format=format)
+
+            # use fsspec to read the dataset from the kerchunk references dict
+            roundtrip = xr.open_dataset(ds_refs, engine="kerchunk")
+        else:
+            # write those references to disk as kerchunk references format
+            ds.virtualize.to_kerchunk(f"{tmpdir}/refs.{format}", format=format)
+
+            # use fsspec to read the dataset from disk via the kerchunk references
+            roundtrip = xr.open_dataset(f"{tmpdir}/refs.{format}", engine="kerchunk")
+
+        assert roundtrip.a.attrs == ds.a.attrs
+
 
 def test_open_scalar_variable(tmpdir):
     # regression test for GH issue #100

This fails on main with failure to decode metadata.

TomAugspurger · 2024-08-03T18:56:39Z

virtualizarr/zarr.py

+    np.dtype("int").kind: 0,
+    np.dtype("float").kind: 0.0,
+    np.dtype("complex").kind: [0.0, 0.0],
+    np.dtype("datetime64").kind: np.datetime64("NaT").view("i8").item(),


Sorry, I was mistaken about what zarr-python does here :/ Seems like 0 is the appropriate fill value for datetime arrays:

In [43]: zarr.array([np.datetime64(100000000000000000, "ns")]).fill_value Out[43]: np.datetime64('1970-01-01T00:00:00.000000000') In [44]: zarr.array([np.datetime64(100000000000000000, "ns")]).fill_value.item() Out[44]: 0

my apologies!

TomNicholas · 2024-08-05T15:24:20Z

Here's an integration test:

@TomAugspurger I've just sent an invite to give you write access to this repo - if you want to add commits to this branch with your integration test then go for it.

* Change back to 0 * Added integration test

TomAugspurger · 2024-08-05T16:46:46Z

Pushed that test and the change back to 0 for datetime types up. With that, the original example in #206 (comment) is working for me.

TomNicholas · 2024-08-05T16:51:28Z

Okay! Let's add a note about this regression to the release notes then feel free to merge :)

TomAugspurger · 2024-08-05T19:44:59Z

Thanks @thodson-usgs!

Set ZArray fill_value back to nan

e526622

thodson-usgs temporarily deployed to test-release July 31, 2024 03:21 — with GitHub Actions Inactive

thodson-usgs mentioned this pull request Jul 31, 2024

MetadataError from ValueError: Could not convert object to NumPy datetime #201

Closed

Set NaT as datetime64 default fill value

c21b171

thodson-usgs temporarily deployed to test-release July 31, 2024 17:05 — with GitHub Actions Inactive

thodson-usgs mentioned this pull request Jul 31, 2024

Add example to create a virtual dataset using lithops #203

Merged

thodson-usgs changed the title ~~Set ZArray fill_value back to nan~~ Set ZArray default fill_value as NaT for datetime64 Aug 1, 2024

TomNicholas added the CF conventions label Aug 1, 2024

TomNicholas added this to the v1.1.0 milestone Aug 1, 2024

TomAugspurger reviewed Aug 3, 2024

View reviewed changes

TomAugspurger mentioned this pull request Aug 5, 2024

Implement pydantic models as dataclasses #210

Merged

5 tasks

Fixups

6c8e292

* Change back to 0 * Added integration test

TomAugspurger temporarily deployed to test-release August 5, 2024 16:46 — with GitHub Actions Inactive

changelog

8bf20f8

TomAugspurger temporarily deployed to test-release August 5, 2024 18:33 — with GitHub Actions Inactive

TomAugspurger merged commit 04a566c into zarr-developers:main Aug 5, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set ZArray default fill_value as NaT for datetime64 #206

Set ZArray default fill_value as NaT for datetime64 #206

thodson-usgs commented Jul 31, 2024 •

edited

Loading

thodson-usgs commented Jul 31, 2024 •

edited

Loading

TomAugspurger commented Jul 31, 2024

TomAugspurger commented Jul 31, 2024

thodson-usgs commented Jul 31, 2024 •

edited

Loading

TomAugspurger commented Jul 31, 2024

thodson-usgs commented Jul 31, 2024

TomNicholas commented Jul 31, 2024 •

edited

Loading

sharkinsspatial commented Aug 2, 2024

TomAugspurger left a comment

TomAugspurger Aug 3, 2024

TomNicholas commented Aug 5, 2024

TomAugspurger commented Aug 5, 2024

TomNicholas commented Aug 5, 2024

TomAugspurger commented Aug 5, 2024

Set ZArray default fill_value as NaT for datetime64 #206

Set ZArray default fill_value as NaT for datetime64 #206

Conversation

thodson-usgs commented Jul 31, 2024 • edited Loading

thodson-usgs commented Jul 31, 2024 • edited Loading

TomAugspurger commented Jul 31, 2024

TomAugspurger commented Jul 31, 2024

thodson-usgs commented Jul 31, 2024 • edited Loading

TomAugspurger commented Jul 31, 2024

thodson-usgs commented Jul 31, 2024

TomNicholas commented Jul 31, 2024 • edited Loading

sharkinsspatial commented Aug 2, 2024

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Aug 3, 2024

Choose a reason for hiding this comment

TomNicholas commented Aug 5, 2024

TomAugspurger commented Aug 5, 2024

TomNicholas commented Aug 5, 2024

TomAugspurger commented Aug 5, 2024

thodson-usgs commented Jul 31, 2024 •

edited

Loading

thodson-usgs commented Jul 31, 2024 •

edited

Loading

thodson-usgs commented Jul 31, 2024 •

edited

Loading

TomNicholas commented Jul 31, 2024 •

edited

Loading