You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The chunk manifest storage transformer proposed in Manifest storage transformer #287, which would allow zarr stores to redirect zarr readers to read byte ranges from inside arbitrary files, including legacy formats such as netCDF. We (particularly @abarciauskas-bgse, Sean and myself) are working on making this happen already, so that we can open netCDF data via zarr using xarray, effectively upstreaming kerchunk's references format as a zarr extension.
Decoding according to CF conventions via new Zarr codecs. This is currently done automatically and somewhat opaquely by xarray when reading a netCDF file directly, but it's still done by xarray even when we read a netCDF file via kerchunk/virtualizarr byte range references. This decoding step is well-factored out internally inside xarray but not really publicly exposed (at least not without the rest of xarray as a dependency). The suggestion (originally from @rabernat in How to handle encoding VirtualiZarr#68 (comment)) is to lift that code out of xarray as a set of CF-specific zarr codecs that get called when a zarr reader opens a store with a manifest pointing to a netCDF file.
To be really useful this probably also requires variable-length chunking in zarr (i.e. ZEP003).
The advantages of this are:
a) a clearer separation of concerns, with fewer "magic" steps hidden inside xarray,
b) applications that can read zarr but don't want to use xarray could also read and fully decode netCDF data (i.e. pure-zarr users see the same data as xarray users),
c) clearer steps towards generalizing to non-CF encoding conventions used in other domains of science,
d) opening the door to zarr becoming a "universal reader" of any file format whose data can be expressed as a manifest of byte ranges and decoding steps can be expressed as zarr codecs.
Most of the work here would be on the xarray end - there is an ancient issue suggesting something similar in pydata/xarray#155, and a nice explanation of how xarray currently does this step in pydata/xarray#8548. Currently it looks essentially like this
Thanks for the writeup Tom, a big +1 from me on this effort.
One question is how well does xarray's internal concept of a VariableCoder map onto a zarr codec?
From glancing at the signature and a few implementations, it looks like the VariableCoder is totally compatible with the v3 codecs. If I understand correctly, CF Variables are n-dimensional arrays, so we might be looking at translating these to ArrayArrayCodecs
From glancing at the signature and a few implementations, it looks like the VariableCoder is totally compatible with the v3 codecs. If I understand correctly, CF Variables are n-dimensional arrays, so we might be looking at translating these to ArrayArrayCodecs
Does an ArrayArrayCodec know about the names of dimensions? Or metadata attributes (i.e. .zmetadata)? Because the VariableCoder has access to that information, as it is stored on the xarray.Variable object passed in.
Idea: Use zarr readers to open and decode netCDF/HDF/etc. data without xarray by lifting xarray's decoding machinery out as new zarr codecs.
This was suggested by @sharkinsspatial in zarr-developers/VirtualiZarr#68 (comment) and requires two components:
To be really useful this probably also requires variable-length chunking in zarr (i.e. ZEP003).
The advantages of this are:
a) a clearer separation of concerns, with fewer "magic" steps hidden inside xarray,
b) applications that can read zarr but don't want to use xarray could also read and fully decode netCDF data (i.e. pure-zarr users see the same data as xarray users),
c) clearer steps towards generalizing to non-CF encoding conventions used in other domains of science,
d) opening the door to zarr becoming a "universal reader" of any file format whose data can be expressed as a manifest of byte ranges and decoding steps can be expressed as zarr codecs.
Most of the work here would be on the xarray end - there is an ancient issue suggesting something similar in pydata/xarray#155, and a nice explanation of how xarray currently does this step in pydata/xarray#8548. Currently it looks essentially like this
where one of xarray's options for
datastore
is for zarr, and another is for netCDF (these are xarray's "backends"). I'm proposing something more likewhere non-xarray users can still get all of
zarr.Array < CF decoding (using new zarr codecs) < open via "universal" zarr reader < chunk manifest < file
One question is how well does xarray's internal concept of a
VariableCoder
map onto a zarr codec?The text was updated successfully, but these errors were encountered: