Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set ZArray default fill_value as NaT for datetime64 #206

Merged
merged 4 commits into from
Aug 5, 2024

Conversation

thodson-usgs
Copy link
Contributor

@thodson-usgs thodson-usgs commented Jul 31, 2024

This PR fixes an error introduced in 10bd53d, which broke the default fill_value for datetime64 fields. Closes #201.

To replicate the error:

aws s3api get-object --region us-west-2 --no-sign-request --bucket wrf-se-ak-ar5 --key ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc WRFDS_2060-01-01.nc
aws s3api get-object --region us-west-2 --no-sign-request --bucket wrf-se-ak-ar5 --key ccsm/rcp85/daily/2060/WRFDS_2060-01-02.nc WRFDS_2060-01-02.nc

then run this example.py

import xarray as xr
import glob
from virtualizarr import open_virtual_dataset

virtual_datasets = [
    open_virtual_dataset(filepath,
                         indexes={},
                         loadable_variables=['Time'],
                         cftime_variables=['Time'])
    for filepath in glob.glob('*.nc')
]

virtual_ds = xr.combine_nested(virtual_datasets, concat_dim=["Time"], coords="minimal", compat="override")
virtual_ds.virtualize.to_kerchunk('combined.json', format='json')

ds = xr.open_dataset('combined.json', engine="kerchunk")

The PR implements Tom's suggestion below: #206 (comment)

@thodson-usgs
Copy link
Contributor Author

thodson-usgs commented Jul 31, 2024

@TomAugspurger, referencing back to #201. Pulling you over here, because we might be crossing issues.

@TomAugspurger
Copy link
Contributor

Thanks. I had to modify that slightly to work with my env:

aws s3api get-object --region us-west-2 --no-sign-request --bucket wrf-se-ak-ar5 --key ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc WRFDS_2060-01-01.nc
  aws s3api get-object --region us-west-2 --no-sign-request --bucket wrf-se-ak-ar5 --key ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc WRFDS_2060-01-02.nc

Taking a look.

@TomAugspurger
Copy link
Contributor

What version of Zarr do you have?

@thodson-usgs
Copy link
Contributor Author

thodson-usgs commented Jul 31, 2024

What version of Zarr do you have?

2.18.2

Also note the typo in get-object, the second file should be -02.nc. Corrected in the original example.

@TomAugspurger
Copy link
Contributor

Thanks, I was able to reproduce it.

I would suggest something like this:

diff --git a/virtualizarr/zarr.py b/virtualizarr/zarr.py
index e5015b3..3647db5 100644
--- a/virtualizarr/zarr.py
+++ b/virtualizarr/zarr.py
@@ -32,13 +32,14 @@ ZAttrs = NewType(
 )  # just the .zattrs (for one array or for the whole store/group)
 FillValueT = bool | str | float | int | list | None
 
-ZARR_DEFAULT_FILL_VALUE: dict[np.dtype, FillValueT] = {
+ZARR_DEFAULT_FILL_VALUE: dict[str, FillValueT] = {
     # numpy dtypes's hierarchy lets us avoid checking for all the widths
     # https://numpy.org/doc/stable/reference/arrays.scalars.html
-    np.dtype("bool"): False,
-    np.dtype("int"): 0,
-    np.dtype("float"): 0.0,
-    np.dtype("complex"): [0.0, 0.0],
+    np.dtype("bool").kind: False,
+    np.dtype("int").kind: 0,
+    np.dtype("float").kind: 0.0,
+    np.dtype("complex").kind: [0.0, 0.0],
+    np.dtype("datetime64").kind: np.datetime64("NaT").view("i8").item(),
 }
 """
 The value and format of the fill_value depend on the `data_type` of the array.
@@ -67,7 +68,7 @@ class ZArray(BaseModel):
     chunks: tuple[int, ...]
     compressor: dict | None = None
     dtype: np.dtype
-    fill_value: FillValueT = Field(default=0.0, validate_default=True)
+    fill_value: FillValueT = Field(None, validate_default=True)
     filters: list[dict] | None = None
     order: Literal["C", "F"]
     shape: tuple[int, ...]
@@ -90,7 +91,7 @@ class ZArray(BaseModel):
     @model_validator(mode="after")
     def _check_fill_value(self) -> Self:
         if self.fill_value is None:
-            self.fill_value = ZARR_DEFAULT_FILL_VALUE.get(self.dtype, 0.0)
+            self.fill_value = ZARR_DEFAULT_FILL_VALUE.get(self.dtype.kind, 0.0)
         return self
 
     @property

That adds a default fill value for datetimes that ends up as NaT. Unfortunately, something strange was going on with the lookup of that dtype, so I had to change that mapping to use .dtype.kind instead. Dunno if that'll cause other issues.

I think that once virtualizarr is able to use zarr.Array internally then things will be smoother.

@thodson-usgs
Copy link
Contributor Author

Great suggestion, @TomAugspurger!
I revised the PR, and it passed my example, as well as the test suite. However, I have no sense of who might be affected by this change and whether it's worth the risk.

@TomNicholas
Copy link
Member

TomNicholas commented Jul 31, 2024

Thanks for rooting this out guys! Let's add a regression test to be sure the fix is working.

@thodson-usgs thodson-usgs changed the title Set ZArray fill_value back to nan Set ZArray default fill_value as NaT for datetime64 Aug 1, 2024
@TomNicholas TomNicholas added this to the v1.1.0 milestone Aug 1, 2024
@sharkinsspatial
Copy link
Collaborator

Just a note that this regression appears to affect one of the roundtrip tests in my HDF reader PR. It also seems that this PR has the same result. I'll need to dig in a bit more to see exactly what is occurring in the date decoding chain there.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an integration test:

diff --git a/virtualizarr/tests/test_integration.py b/virtualizarr/tests/test_integration.py
index 239316a..7210b3f 100644
--- a/virtualizarr/tests/test_integration.py
+++ b/virtualizarr/tests/test_integration.py
@@ -4,6 +4,9 @@ import xarray as xr
 import xarray.testing as xrt
 
 from virtualizarr import open_virtual_dataset
+from virtualizarr.manifests.array import ManifestArray
+from virtualizarr.manifests.manifest import ChunkManifest
+from virtualizarr.zarr import ZArray
 
 
 @pytest.mark.parametrize(
@@ -166,6 +169,50 @@ class TestKerchunkRoundtrip:
         # assert equal to original dataset
         xrt.assert_identical(roundtrip, ds)
 
+    def test_datetime64_dtype_fill_value(self, tmpdir, format):
+        chunks_dict = {
+            "0.0.0": {"path": "foo.nc", "offset": 100, "length": 100},
+        }
+        manifest = ChunkManifest(entries=chunks_dict)
+        chunks = (1, 1, 1)
+        shape = (1, 1, 1)
+        zarray = ZArray(
+            chunks=chunks,
+            compressor={"id": "zlib", "level": 1},
+            dtype=np.dtype("<M8[ns]"),
+            # fill_value=0.0,
+            filters=None,
+            order="C",
+            shape=shape,
+            zarr_format=2,
+        )
+        marr1 = ManifestArray(zarray=zarray, chunkmanifest=manifest)
+        ds = xr.Dataset(
+            {
+                "a": xr.DataArray(
+                    marr1,
+                    attrs={
+                        "_FillValue": np.datetime64("1970-01-01T00:00:00.000000000")
+                    },
+                )
+            }
+        )
+
+        if format == "dict":
+            # write those references to an in-memory kerchunk-formatted references dictionary
+            ds_refs = ds.virtualize.to_kerchunk(format=format)
+
+            # use fsspec to read the dataset from the kerchunk references dict
+            roundtrip = xr.open_dataset(ds_refs, engine="kerchunk")
+        else:
+            # write those references to disk as kerchunk references format
+            ds.virtualize.to_kerchunk(f"{tmpdir}/refs.{format}", format=format)
+
+            # use fsspec to read the dataset from disk via the kerchunk references
+            roundtrip = xr.open_dataset(f"{tmpdir}/refs.{format}", engine="kerchunk")
+
+        assert roundtrip.a.attrs == ds.a.attrs
+
 
 def test_open_scalar_variable(tmpdir):
     # regression test for GH issue #100

This fails on main with failure to decode metadata.

np.dtype("int").kind: 0,
np.dtype("float").kind: 0.0,
np.dtype("complex").kind: [0.0, 0.0],
np.dtype("datetime64").kind: np.datetime64("NaT").view("i8").item(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was mistaken about what zarr-python does here :/ Seems like 0 is the appropriate fill value for datetime arrays:

In [43]: zarr.array([np.datetime64(100000000000000000, "ns")]).fill_value
Out[43]: np.datetime64('1970-01-01T00:00:00.000000000')
In [44]: zarr.array([np.datetime64(100000000000000000, "ns")]).fill_value.item()
Out[44]: 0

my apologies!

@TomNicholas
Copy link
Member

Here's an integration test:

@TomAugspurger I've just sent an invite to give you write access to this repo - if you want to add commits to this branch with your integration test then go for it.

* Change back to 0
* Added integration test
@TomAugspurger
Copy link
Contributor

Pushed that test and the change back to 0 for datetime types up. With that, the original example in #206 (comment) is working for me.

@TomNicholas
Copy link
Member

Okay! Let's add a note about this regression to the release notes then feel free to merge :)

@TomAugspurger TomAugspurger merged commit 04a566c into zarr-developers:main Aug 5, 2024
7 checks passed
@TomAugspurger
Copy link
Contributor

Thanks @thodson-usgs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MetadataError from ValueError: Could not convert object to NumPy datetime
4 participants