Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conformant ZarrV3 codecs and fill values #193

Merged
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
79 commits
Select commit Hold shift + click to select a range
6b7abe2
Generate chunk manifest backed variable from HDF5 dataset.
sharkinsspatial Apr 19, 2024
bca0aab
Transfer dataset attrs to variable.
sharkinsspatial Apr 19, 2024
384ff6b
Get virtual variables dict from HDF5 file.
sharkinsspatial Apr 19, 2024
4c5f9bd
Update virtual_vars_from_hdf to use fsspec and drop_variables arg.
sharkinsspatial Apr 22, 2024
1dd3370
mypy fix to use ChunkKey and empty dimensions list.
sharkinsspatial Apr 22, 2024
d92c75c
Extract attributes from hdf5 root group.
sharkinsspatial Apr 22, 2024
0ed8362
Use hdf reader for netcdf4 files.
sharkinsspatial Apr 22, 2024
f4485fa
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 22, 2024
3cc1254
Merge branch 'main' into hdf5_reader
sharkinsspatial May 8, 2024
0123df7
Fix ruff complaints.
sharkinsspatial May 9, 2024
332bcaa
First steps for handling HDF5 filters.
sharkinsspatial May 10, 2024
c51e615
Initial step for hdf5plugin supported codecs.
sharkinsspatial May 13, 2024
0083f77
Small commit to check compression support in CI environment.
sharkinsspatial May 16, 2024
3c00071
Merge branch 'main' into hdf5_reader
sharkinsspatial May 18, 2024
207c4b5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 19, 2024
c573800
Fix mypy complaints for hdf_filters.
sharkinsspatial May 19, 2024
ef0d7a8
Merge branch 'hdf5_reader' of https://github.com/TomNicholas/Virtuali…
sharkinsspatial May 19, 2024
588e06b
Local pre-commit fix for hdf_filters.
sharkinsspatial May 19, 2024
725333e
Use fsspec reader_options introduced in #37.
sharkinsspatial May 21, 2024
72df108
Fix incorrect zarr_v3 if block position from merge commit ef0d7a8.
sharkinsspatial May 21, 2024
d1e85cb
Fix early return from hdf _extract_attrs.
sharkinsspatial May 21, 2024
1e2b343
Test that _extract_attrs correctly handles multiple attributes.
sharkinsspatial May 21, 2024
7f1c189
Initial attempt at scale and offset via numcodecs.
sharkinsspatial May 22, 2024
908e332
Tests for cfcodec_from_dataset.
sharkinsspatial May 23, 2024
0df332d
Temporarily relax integration tests to assert_allclose.
sharkinsspatial May 24, 2024
ca6b236
Add blosc_lz4 fixture parameterization to confirm libnetcdf environment.
sharkinsspatial May 24, 2024
b7426c5
Check for compatability with netcdf4 engine.
sharkinsspatial May 24, 2024
dac21dd
Use separate fixtures for h5netcdf and netcdf4 compression styles.
sharkinsspatial May 27, 2024
e968772
Print libhdf5 and libnetcdf4 versions to confirm compiled environment.
sharkinsspatial May 27, 2024
9a98e57
Skip netcdf4 style compression tests when libhdf5 < 1.14.
sharkinsspatial May 27, 2024
7590b87
Include imagecodecs.numcodecs to support HDF5 lzf filters.
sharkinsspatial Jun 11, 2024
e9fbc8a
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 11, 2024
14bd709
Remove test that verifies call to read_kerchunk_references_from_file.
sharkinsspatial Jun 11, 2024
acdf0d7
Add additional codec support structures for imagecodecs and numcodecs.
sharkinsspatial Jun 12, 2024
4ba323a
Add codec config test for Zstd.
sharkinsspatial Jun 12, 2024
e14e53b
Include initial cf decoding tests.
sharkinsspatial Jun 21, 2024
b808ded
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 21, 2024
b052f8c
Revert typo for scale_factor retrieval.
sharkinsspatial Jun 21, 2024
01a3980
Update reader to use new numpy manifest representation.
sharkinsspatial Jun 21, 2024
c37d9e5
Temporarily skip test until blosc netcdf4 issue is solved.
sharkinsspatial Jun 22, 2024
17b30d4
Fix Pydantic 2 migration warnings.
sharkinsspatial Jun 22, 2024
f6b596a
Include hdf5plugin and imagecodecs-numcodecs in mamba test environment.
sharkinsspatial Jun 22, 2024
eb6e24d
Mamba attempt with imagecodecs rather than imagecodecs-numcodecs.
sharkinsspatial Jun 22, 2024
c85bd16
Mamba attempt with latest imagecodecs release.
sharkinsspatial Jun 22, 2024
ca435da
Use correct iter_chunks callback function signtature.
sharkinsspatial Jun 26, 2024
3017951
Include pip based imagecodecs-numcodecs until conda-forge availability.
sharkinsspatial Jun 26, 2024
ccf0b73
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 26, 2024
32ba135
Handle non-coordinate dims which are serialized to hdf as empty dataset.
sharkinsspatial Jun 27, 2024
64f446c
Use reader_options for filetype check and update failing kerchunk call.
sharkinsspatial Jun 27, 2024
1c590bb
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 27, 2024
9797346
Fix chunkmanifest shaping for chunked datasets.
sharkinsspatial Jun 30, 2024
c833e19
Handle scale_factor attribute serialization for compressed files.
sharkinsspatial Jun 30, 2024
701bcfa
Include chunked roundtrip fixture.
sharkinsspatial Jun 30, 2024
08c988e
Standardize xarray integration tests for hdf filters.
sharkinsspatial Jun 30, 2024
e6076bd
Merge branch 'hdf5_reader' of https://github.com/TomNicholas/Virtuali…
sharkinsspatial Jun 30, 2024
d684a84
Merge branch 'main' into hdf5_reader
sharkinsspatial Jun 30, 2024
4cb4bac
Update reader selection logic for new filetype determination.
sharkinsspatial Jun 30, 2024
d352104
Use decode_times for integration test.
sharkinsspatial Jun 30, 2024
3d89ea4
Standardize fixture names for hdf5 vs netcdf4 file types.
sharkinsspatial Jun 30, 2024
c9dd0d9
Handle array add_offset property for compressed data.
sharkinsspatial Jul 1, 2024
db5b421
Include h5py shuffle filter.
sharkinsspatial Jul 1, 2024
9a1da32
Make ScaleAndOffset codec last in filters list.
sharkinsspatial Jul 1, 2024
9b2b0f8
Apply ScaleAndOffset codec to _FillValue since it's value is now down…
sharkinsspatial Jul 2, 2024
9ef1362
Coerce scale and add_offset values to native float for JSON serializa…
sharkinsspatial Jul 2, 2024
eb16bc1
Conformant ZarrV3 codecs
ghidalgo3 Jul 17, 2024
5f1b7f9
Update docs
ghidalgo3 Jul 17, 2024
519d45d
Update virtualizarr/zarr.py
ghidalgo3 Jul 18, 2024
76e9c8e
Update virtualizarr/zarr.py
ghidalgo3 Jul 18, 2024
000c520
Change default_fill to 0s
ghidalgo3 Jul 18, 2024
25d04b9
Merge branch 'guhidalgo/fixmetadatacodecs' of https://github.com/ghid…
ghidalgo3 Jul 18, 2024
c2e7279
Generate permutation
ghidalgo3 Jul 18, 2024
145960a
Pythonic isinstance check
ghidalgo3 Jul 18, 2024
c051f04
Add return type to isconfigurable
ghidalgo3 Jul 18, 2024
7a65fbd
Merge remote-tracking branch 'upstream/hdf5_reader' into codecs
Jul 18, 2024
7b09324
Changes from pair programming for zarrv3 to kerchunk file reading
Jul 19, 2024
2c59256
Revert "Merge remote-tracking branch 'upstream/hdf5_reader' into codecs"
Jul 19, 2024
50c3dcd
Fix unit tests
ghidalgo3 Jul 19, 2024
ab97e63
PR comments
ghidalgo3 Jul 22, 2024
0be0728
Remove kwarg in dict default
ghidalgo3 Jul 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ New Features
Breaking changes
~~~~~~~~~~~~~~~~

- Serialize valid ZarrV3 metadata (for :pull:`193`).
By `Gustavo Hidalgo <https://github.com/ghidalgo3>`_.

Deprecations
~~~~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion virtualizarr/tests/test_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ def test_non_dimension_coordinates(self, tmpdir, format):
# regression test for GH issue #105

# set up example xarray dataset containing non-dimension coordinate variables
ds = xr.Dataset(coords={"lat": (["x", "y"], np.arange(6).reshape(2, 3))})
ds = xr.Dataset(coords={"lat": (["x", "y"], np.arange(6.0).reshape(2, 3))})

# save it to disk as netCDF (in temporary directory)
ds.to_netcdf(f"{tmpdir}/non_dim_coords.nc")
Expand Down
62 changes: 56 additions & 6 deletions virtualizarr/tests/test_zarr.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,17 @@
import json

import numpy as np
import pytest
import xarray as xr
import xarray.testing as xrt

from virtualizarr import ManifestArray, open_virtual_dataset
from virtualizarr.manifests.manifest import ChunkManifest
from virtualizarr.zarr import dataset_to_zarr, metadata_from_zarr_json


def test_zarr_v3_roundtrip(tmpdir):
@pytest.fixture
def vds_with_manifest_arrays() -> xr.Dataset:
arr = ManifestArray(
chunkmanifest=ChunkManifest(
entries={"0.0": dict(path="test.nc", offset=6144, length=48)}
Expand All @@ -15,18 +20,63 @@ def test_zarr_v3_roundtrip(tmpdir):
shape=(2, 3),
dtype=np.dtype("<i8"),
chunks=(2, 3),
compressor=None,
compressor="gzip",
filters=None,
fill_value=np.nan,
fill_value=0,
order="C",
zarr_format=3,
),
)
original = xr.Dataset({"a": (["x", "y"], arr)}, attrs={"something": 0})
return xr.Dataset({"a": (["x", "y"], arr)}, attrs={"something": 0})


def isconfigurable(value: dict):
ghidalgo3 marked this conversation as resolved.
Show resolved Hide resolved
"""
Several metadata attributes in ZarrV3 use a dictionary with keys "name" : str and "configuration" : dict
"""
return "name" in value and "configuration" in value

original.virtualize.to_zarr(tmpdir / "store.zarr")

def test_zarr_v3_roundtrip(tmpdir, vds_with_manifest_arrays: xr.Dataset):
vds_with_manifest_arrays.virtualize.to_zarr(tmpdir / "store.zarr")
roundtrip = open_virtual_dataset(
tmpdir / "store.zarr", filetype="zarr_v3", indexes={}
)

xrt.assert_identical(roundtrip, original)
xrt.assert_identical(roundtrip, vds_with_manifest_arrays)


def test_metadata_roundtrip(tmpdir, vds_with_manifest_arrays: xr.Dataset):
dataset_to_zarr(vds_with_manifest_arrays, tmpdir / "store.zarr")
zarray, _, _ = metadata_from_zarr_json(tmpdir / "store.zarr/a/zarr.json")
assert zarray == vds_with_manifest_arrays.a.data.zarray


def test_zarr_v3_metadata_conformance(tmpdir, vds_with_manifest_arrays: xr.Dataset):
"""
Checks that the output metadata of an array variable conforms to this spec
for the required attributes:
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#metadata
"""
dataset_to_zarr(vds_with_manifest_arrays, tmpdir / "store.zarr")
# read the a variable's metadata
with open(tmpdir / "store.zarr/a/zarr.json", mode="r") as f:
metadata = json.loads(f.read())
assert metadata["zarr_format"] == 3
assert metadata["node_type"] == "array"
assert isinstance(metadata["shape"], list) and all(
isinstance(dim, int) for dim in metadata["shape"]
)
assert isinstance(metadata["data_type"], str) or isconfigurable(
metadata["data_type"]
)
assert isconfigurable(metadata["chunk_grid"])
assert isconfigurable(metadata["chunk_key_encoding"])
assert any(
isinstance(metadata["fill_value"], t) for t in (bool, int, float, str, list)
ghidalgo3 marked this conversation as resolved.
Show resolved Hide resolved
)
assert (
isinstance(metadata["codecs"], list)
and len(metadata["codecs"]) > 1
and all(isconfigurable(codec) for codec in metadata["codecs"])
)
97 changes: 91 additions & 6 deletions virtualizarr/zarr.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,10 @@
Literal,
NewType,
Optional,
Union,
)

import numcodecs
import numpy as np
import ujson # type: ignore
import xarray as xr
Expand Down Expand Up @@ -103,6 +105,8 @@ def dict(self) -> dict[str, Any]:

if zarray_dict["fill_value"] is np.nan:
zarray_dict["fill_value"] = None
else:
zarray_dict["fill_value"] = self._default_fill_value()

return zarray_dict

Expand Down Expand Up @@ -134,6 +138,80 @@ def replace(
zarr_format=zarr_format if zarr_format is not None else self.zarr_format,
)

def _default_fill_value(self) -> Union[bool, int, float, str, list]:
"""
The value and format of the fill_value depend on the data_type of the array.
See here for spec:
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value
"""
# numpy dtypes's hierarchy lets us avoid checking for all the widths
# https://numpy.org/doc/stable/reference/arrays.scalars.html
if self.dtype is np.dtype("bool"):
return False
elif self.dtype is np.dtype("int"):
return 0
elif self.dtype is np.dtype("float"):
return "NaN"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the default fill value for float is 0.0:

import zarr
import json
store = zarr.store.MemoryStore(mode="w")
z = zarr.empty((1, 1), store=store)
z[:]

array([[0.]])

(I'm not sure where on the Array / Store / Other object that information lives.)

It'd be nice if zarr-python had this as a constant that we could reuse. Would that make sense, or is there some reason not to I'm missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the specification doesn't require a specific number, just that it not be null. See the note at the bottom of the fill_value section https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#fill-value . However zarr-python does default to 0 eventually and not NaN https://github.com/zarr-developers/zarr-python/blob/37a8441c20dae3b284803bb1b0d2e6c8f040fb3e/src/zarr/array.py#L231C9-L235C31 . I may have had some trouble with the unit tests, but I think it's better to be as similar as possible to zarr-python, I'll change the defaults to 0s.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could also be deferred to a later PR, especially if the true solution is to make it clearer what the default is upstream.

elif self.dtype is np.dtype("complex"):
return ["NaN", "NaN"]
else:
return "NaN"

def _v3_codec_pipeline(self) -> list:
"""
VirtualiZarr internally uses the `filters`, `compressor`, and `order` attributes
from zarr v2, but to create conformant zarr v3 metadata those 3 must be turned into `codecs` objects.
Not all codecs are created equal though: https://github.com/zarr-developers/zarr-python/issues/1943
An array _must_ declare a single ArrayBytes codec, and 0 or more ArrayArray, BytesBytes codecs.
Roughly, this is the mapping:
```
filters: Iterable[ArrayArrayCodec] #optional
compressor: ArrayBytesCodec #mandatory
post_compressor: Iterable[BytesBytesCodec] #optional
```
"""
if self.filters:
filter_codecs_configs = [
numcodecs.get_codec(filter).get_config() for filter in self.filters
]
filters = [
dict(name=codec.pop("id"), configuration=codec)
for codec in filter_codecs_configs
]
else:
filters = []

# Noting here that zarr v3 has very few codecs specificed in the official spec,
# and that there are far more codecs in `numcodecs`. We take a gamble and assume
# that the codec names and configuration are simply mapped into zarrv3 "configurables".
compressor_codec = numcodecs.get_codec(
# default to gzip because it is officially specified in the zarr v3 spec
dict(id=self.compressor or "gzip")
ghidalgo3 marked this conversation as resolved.
Show resolved Hide resolved
).get_config()
compressor_id = compressor_codec.pop("id")
compressor = dict(name=compressor_id, configuration=compressor_codec)

# https://zarr-specs.readthedocs.io/en/latest/v3/codecs/transpose/v1.0.html#transpose-codec-v1
# Either "C" or "F", defining the layout of bytes within each chunk of the array.
# "C" means row-major order, i.e., the last dimension varies fastest;
# "F" means column-major order, i.e., the first dimension varies fastest.
if self.order == "C":
order = tuple(range(len(self.shape)))
ghidalgo3 marked this conversation as resolved.
Show resolved Hide resolved
elif self.order == "F":
order = tuple(reversed(range(len(self.shape))))
ghidalgo3 marked this conversation as resolved.
Show resolved Hide resolved

transpose = dict(name="transpose", configuration=dict(order=order))
# https://github.com/zarr-developers/zarr-python/pull/1944#issuecomment-2151994097
# "If no ArrayBytesCodec is supplied, we can auto-add a BytesCodec"
bytes = dict(
name="bytes", configuration={}
) # TODO need to handle endianess configuration

# The order here is significant!
# [ArrayArray] -> ArrayBytes -> [BytesBytes]
codec_pipeline = [transpose, bytes] + [compressor] + filters
return codec_pipeline


def encode_dtype(dtype: np.dtype) -> str:
# TODO not sure if there is a better way to get the '<i4' style representation of the dtype out
Expand Down Expand Up @@ -234,9 +312,10 @@ def zarr_v3_array_metadata(zarray: ZArray, dim_names: list[str], attrs: dict) ->
"name": "default",
"configuration": {"separator": "/"},
}
metadata["codecs"] = metadata.pop("filters")
metadata.pop("compressor") # TODO this should be entered in codecs somehow
metadata.pop("order") # TODO this should be replaced by a transpose codec
metadata["codecs"] = zarray._v3_codec_pipeline()
metadata.pop("filters")
metadata.pop("compressor")
metadata.pop("order")

# indicate that we're using the manifest storage transformer ZEP
metadata["storage_transformers"] = [
Expand Down Expand Up @@ -282,13 +361,19 @@ def metadata_from_zarr_json(filepath: Path) -> tuple[ZArray, list[str], dict]:
fill_value = np.nan
else:
fill_value = metadata["fill_value"]

all_codecs = [
codec
for codec in metadata["codecs"]
if codec["name"] not in ("transpose", "bytes")
]
compressor = all_codecs[0]
filters = [dict(id=f.pop("name"), **f) for f in all_codecs[1:]] or None
zarray = ZArray(
chunks=metadata["chunk_grid"]["configuration"]["chunk_shape"],
compressor=metadata["codecs"],
compressor=compressor["name"],
dtype=np.dtype(metadata["data_type"]),
fill_value=fill_value,
filters=metadata.get("filters", None),
filters=filters,
order="C",
shape=chunk_shape,
zarr_format=3,
Expand Down
Loading