-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading data from ManifestArrays without saving references to disk first #124
Comments
Thinking about this more, once zarr-python def to_zarr_array(self: ManifestArray) -> zarr.Array:
... This opens up some interesting possibilities. Currently when you call
The result would be that a user could actually treat a "virtual" xarray Dataset as a normal xarray Dataset, because if they tried to Then you could open any data format that virtualizarr understands via I still need to think through some of the details, but this could potentially be a neat alternative approach to pydata/xarray#9281, and not actually require any upstream changes to xarray! cc @d-v-b |
(One subtlety I'm not sure about here would be around indexes. I think you would probably want to have a solution for loading indexes as laid out in #18, and then have the indexes understand how they can be loaded.) |
Another subtlety to consider is when should the CF decoding happen? You would then have effectively done |
Whilst I've gone off the idea of an xarray backend that loads data by default (#221), there's another potential use case for this idea: Currently there are xarray backends and virtualizarr virtual backends. But the latter has to call the former if it wants to support the
which calls
which for a netCDF4 file would call xarray's So to support virtualizing one filetype, we need both a virtualizarr backend and a corresponding xarray backend. For filetypes that already have xarray backends this isn't a problem, but if we want 3rd parties to add their own virtual backends for custom filetypes, it's doubling the work they have to do. |
@ayushnag I feel like there might be some sneaky trick we could do to load the Like what if we create some kind of special |
Perhaps a |
Yes exactly! And instead of finding the byte range info from files on disk it just gets them from the ManifestArray that's in memory. The advantage of this approach is that we should be able to then just call |
@ayushnag following your suggestion on slack, maybe what we need is to temporarily create an icechunkstore then read from it? Though that would introduce a dependency on icechunk outside of the actual Alternatively maybe creating an It would be cool to find a way to do this using just zarr-python. cc @mpiannucci and @jhamman for feedback on this idea |
I am working on a feature in
virtualizarr
to read dmrpp metadata files and create a virtualxr.Dataset
containing manifest array's that can then be virtualized. This is the current workflow:However the chunk manifest, encoding, attrs, etc. is already in
mds
so is it possible to read data directly from this dataset? My understanding is that once the "chunk manifest" ZEP is approved and thezarr-python
reader inxarray
is updated this should be possible. Thexarray
reader forkerchunk
can accept a file or the reference json object directly fromkerchunk
SingleHdf5ToZarr
andMultiZarrToZarr
. So similarly can we extract the refs frommds
and pass it toxr.open_dataset()
directly?There probably still needs to be a function that extracts the refs so that xarray can make a new
Dataset
object with all the indexes, cf_time handling, andopen_dataset
checks.Even reading directly from the ManifestArray dataset is possible but not sure how the new dataset object with numpy arrays and indexes would be separate from the original dataset
The text was updated successfully, but these errors were encountered: