-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set ZArray default fill_value as NaT for datetime64 #206
Set ZArray default fill_value as NaT for datetime64 #206
Conversation
@TomAugspurger, referencing back to #201. Pulling you over here, because we might be crossing issues. |
Thanks. I had to modify that slightly to work with my env:
Taking a look. |
What version of Zarr do you have? |
Also note the typo in get-object, the second file should be |
Thanks, I was able to reproduce it. I would suggest something like this: diff --git a/virtualizarr/zarr.py b/virtualizarr/zarr.py
index e5015b3..3647db5 100644
--- a/virtualizarr/zarr.py
+++ b/virtualizarr/zarr.py
@@ -32,13 +32,14 @@ ZAttrs = NewType(
) # just the .zattrs (for one array or for the whole store/group)
FillValueT = bool | str | float | int | list | None
-ZARR_DEFAULT_FILL_VALUE: dict[np.dtype, FillValueT] = {
+ZARR_DEFAULT_FILL_VALUE: dict[str, FillValueT] = {
# numpy dtypes's hierarchy lets us avoid checking for all the widths
# https://numpy.org/doc/stable/reference/arrays.scalars.html
- np.dtype("bool"): False,
- np.dtype("int"): 0,
- np.dtype("float"): 0.0,
- np.dtype("complex"): [0.0, 0.0],
+ np.dtype("bool").kind: False,
+ np.dtype("int").kind: 0,
+ np.dtype("float").kind: 0.0,
+ np.dtype("complex").kind: [0.0, 0.0],
+ np.dtype("datetime64").kind: np.datetime64("NaT").view("i8").item(),
}
"""
The value and format of the fill_value depend on the `data_type` of the array.
@@ -67,7 +68,7 @@ class ZArray(BaseModel):
chunks: tuple[int, ...]
compressor: dict | None = None
dtype: np.dtype
- fill_value: FillValueT = Field(default=0.0, validate_default=True)
+ fill_value: FillValueT = Field(None, validate_default=True)
filters: list[dict] | None = None
order: Literal["C", "F"]
shape: tuple[int, ...]
@@ -90,7 +91,7 @@ class ZArray(BaseModel):
@model_validator(mode="after")
def _check_fill_value(self) -> Self:
if self.fill_value is None:
- self.fill_value = ZARR_DEFAULT_FILL_VALUE.get(self.dtype, 0.0)
+ self.fill_value = ZARR_DEFAULT_FILL_VALUE.get(self.dtype.kind, 0.0)
return self
@property That adds a default fill value for datetimes that ends up as I think that once virtualizarr is able to use |
Great suggestion, @TomAugspurger! |
Thanks for rooting this out guys! Let's add a regression test to be sure the fix is working. |
Just a note that this regression appears to affect one of the roundtrip tests in my HDF reader PR. It also seems that this PR has the same result. I'll need to dig in a bit more to see exactly what is occurring in the date decoding chain there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an integration test:
diff --git a/virtualizarr/tests/test_integration.py b/virtualizarr/tests/test_integration.py
index 239316a..7210b3f 100644
--- a/virtualizarr/tests/test_integration.py
+++ b/virtualizarr/tests/test_integration.py
@@ -4,6 +4,9 @@ import xarray as xr
import xarray.testing as xrt
from virtualizarr import open_virtual_dataset
+from virtualizarr.manifests.array import ManifestArray
+from virtualizarr.manifests.manifest import ChunkManifest
+from virtualizarr.zarr import ZArray
@pytest.mark.parametrize(
@@ -166,6 +169,50 @@ class TestKerchunkRoundtrip:
# assert equal to original dataset
xrt.assert_identical(roundtrip, ds)
+ def test_datetime64_dtype_fill_value(self, tmpdir, format):
+ chunks_dict = {
+ "0.0.0": {"path": "foo.nc", "offset": 100, "length": 100},
+ }
+ manifest = ChunkManifest(entries=chunks_dict)
+ chunks = (1, 1, 1)
+ shape = (1, 1, 1)
+ zarray = ZArray(
+ chunks=chunks,
+ compressor={"id": "zlib", "level": 1},
+ dtype=np.dtype("<M8[ns]"),
+ # fill_value=0.0,
+ filters=None,
+ order="C",
+ shape=shape,
+ zarr_format=2,
+ )
+ marr1 = ManifestArray(zarray=zarray, chunkmanifest=manifest)
+ ds = xr.Dataset(
+ {
+ "a": xr.DataArray(
+ marr1,
+ attrs={
+ "_FillValue": np.datetime64("1970-01-01T00:00:00.000000000")
+ },
+ )
+ }
+ )
+
+ if format == "dict":
+ # write those references to an in-memory kerchunk-formatted references dictionary
+ ds_refs = ds.virtualize.to_kerchunk(format=format)
+
+ # use fsspec to read the dataset from the kerchunk references dict
+ roundtrip = xr.open_dataset(ds_refs, engine="kerchunk")
+ else:
+ # write those references to disk as kerchunk references format
+ ds.virtualize.to_kerchunk(f"{tmpdir}/refs.{format}", format=format)
+
+ # use fsspec to read the dataset from disk via the kerchunk references
+ roundtrip = xr.open_dataset(f"{tmpdir}/refs.{format}", engine="kerchunk")
+
+ assert roundtrip.a.attrs == ds.a.attrs
+
def test_open_scalar_variable(tmpdir):
# regression test for GH issue #100
This fails on main with failure to decode metadata.
virtualizarr/zarr.py
Outdated
np.dtype("int").kind: 0, | ||
np.dtype("float").kind: 0.0, | ||
np.dtype("complex").kind: [0.0, 0.0], | ||
np.dtype("datetime64").kind: np.datetime64("NaT").view("i8").item(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was mistaken about what zarr-python does here :/ Seems like 0
is the appropriate fill value for datetime arrays:
In [43]: zarr.array([np.datetime64(100000000000000000, "ns")]).fill_value
Out[43]: np.datetime64('1970-01-01T00:00:00.000000000')
In [44]: zarr.array([np.datetime64(100000000000000000, "ns")]).fill_value.item()
Out[44]: 0
my apologies!
@TomAugspurger I've just sent an invite to give you write access to this repo - if you want to add commits to this branch with your integration test then go for it. |
Pushed that test and the change back to |
Okay! Let's add a note about this regression to the release notes then feel free to merge :) |
Thanks @thodson-usgs! |
This PR fixes an error introduced in 10bd53d, which broke the default fill_value for datetime64 fields. Closes #201.
To replicate the error:
then run this
example.py
The PR implements Tom's suggestion below: #206 (comment)