Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading data from ManifestArrays without saving references to disk first #124

Open
ayushnag opened this issue May 23, 2024 · 8 comments
Open

Comments

@ayushnag
Copy link
Contributor

ayushnag commented May 23, 2024

I am working on a feature in virtualizarr to read dmrpp metadata files and create a virtual xr.Dataset containing manifest array's that can then be virtualized. This is the current workflow:

vdatasets = parser.parse(dmrs)
# vdatasets are xr.Datasets containing ManifestArray's
mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
mds.virtualize.to_kerchunk(filepath=outfile, format=outformat)
ds = xr.open_dataset(outfile, engine="virtualizarr", ...)
ds.time.values

However the chunk manifest, encoding, attrs, etc. is already in mds so is it possible to read data directly from this dataset? My understanding is that once the "chunk manifest" ZEP is approved and the zarr-python reader in xarray is updated this should be possible. The xarray reader for kerchunk can accept a file or the reference json object directly from kerchunk SingleHdf5ToZarr and MultiZarrToZarr. So similarly can we extract the refs from mds and pass it to xr.open_dataset() directly?

There probably still needs to be a function that extracts the refs so that xarray can make a new Dataset object with all the indexes, cf_time handling, and open_dataset checks.

mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
refs = mds.virtualize()
ds = xr.open_dataset(refs, engine="virtualizarr", ...)

Even reading directly from the ManifestArray dataset is possible but not sure how the new dataset object with numpy arrays and indexes would be separate from the original dataset

mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
mds.time.values
@ayushnag ayushnag reopened this May 23, 2024
@TomNicholas TomNicholas changed the title Reading data from ManifestArray's Loading data from ManifestArrays without saving references to disk first May 23, 2024
@TomNicholas
Copy link
Member

TomNicholas commented Aug 7, 2024

Thinking about this more, once zarr-python Array objects support the manifest storage transformer, we should be able to write a new method on ManifestArray objects which constructs the zarr.Array directly, i.e.

def to_zarr_array(self: ManifestArray) -> zarr.Array:
   ...

This opens up some interesting possibilities. Currently when you call .compute on a virtual dataset you get a NotImplementedError, but with this we could change the behaviour to instead:

  1. Turn the ManifestArray into a zarr.Array
  2. Use xarray's zarr backend machinery to open up that zarr array the same way that normally happens when you do xr.open_zarr
  3. Which includes wrapping with xarray's lazy indexing classes,
  4. Call the .compute behaviour that xarray would normally use.

The result would be that a user could actually treat a "virtual" xarray Dataset as a normal xarray Dataset, because if they tried to .compute it it should transform itself into one under the hood!

Then you could open any data format that virtualizarr understands via vz.open_virtual_dataset (or maybe eventually xr.open_dataset(engine='virtualizarr')), and if you want to treat it like an in-memory xarray Dataset from that point on then you can, but if you prefer to manipulate it and save it out as a virtual zarr store on disk you can also do that!

I still need to think through some of the details, but this could potentially be a neat alternative approach to pydata/xarray#9281, and not actually require any upstream changes to xarray!

cc @d-v-b

@TomNicholas
Copy link
Member

(One subtlety I'm not sure about here would be around indexes. I think you would probably want to have a solution for loading indexes as laid out in #18, and then have the indexes understand how they can be loaded.)

@TomNicholas
Copy link
Member

Another subtlety to consider is when should the CF decoding happen? You would then have effectively done open_dataset in a very roundabout way, and we need to make sure not to forget the CF decoding step in there somewhere.

@TomNicholas
Copy link
Member

Whilst I've gone off the idea of an xarray backend that loads data by default (#221), there's another potential use case for this idea: loadable_variables.

Currently there are xarray backends and virtualizarr virtual backends. But the latter has to call the former if it wants to support the loadable_variables kwarg. For example, @sharkinsspatial 's new HDF virtual reader internally calls

loadable_vars, indexes = open_loadable_vars_and_indexes(

which calls xr.open_dataset

which for a netCDF4 file would call xarray's netCDF4 backend to load the loadable_variables.

So to support virtualizing one filetype, we need both a virtualizarr backend and a corresponding xarray backend. For filetypes that already have xarray backends this isn't a problem, but if we want 3rd parties to add their own virtual backends for custom filetypes, it's doubling the work they have to do.

@TomNicholas
Copy link
Member

@ayushnag I feel like there might be some sneaky trick we could do to load the ManifestArrays into memory using xr.open_zarr using either xpublish or some kind of special python zarr store implementation...

Like what if we create some kind of special ManifestStore class along the lines of all the zarr stores listed here, then use the zarr-python library to read from that? Is that possible?

https://zarr.readthedocs.io/en/stable/api/storage.html

@ayushnag
Copy link
Contributor Author

Perhaps a Store very similar to an IcechunkStore except with a subset of the capabilities and requirements? It can have just the portion that provides the abstraction to read from a remote store given only the manifest

@TomNicholas
Copy link
Member

Yes exactly! And instead of finding the byte range info from files on disk it just gets them from the ManifestArray that's in memory.

The advantage of this approach is that we should be able to then just call open_zarr internally to get at the loadable variables, guaranteeing that Xarray will handle CF decoding of those variables in the way it currently does when we use the netCDF backend.

@TomNicholas
Copy link
Member

@ayushnag following your suggestion on slack, maybe what we need is to temporarily create an icechunkstore then read from it? Though that would introduce a dependency on icechunk outside of the actual .to_icechunk method.

Alternatively maybe creating an FsspecStore (or whatever it's called exactly) that can read bytes from S3 ends up being functionally equivalent to your suggestion to use kerchunk above.

It would be cool to find a way to do this using just zarr-python.

cc @mpiannucci and @jhamman for feedback on this idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants