Loading data from ManifestArrays without saving references to disk first #124

ayushnag · 2024-05-23T16:37:31Z

I am working on a feature in virtualizarr to read dmrpp metadata files and create a virtual xr.Dataset containing manifest array's that can then be virtualized. This is the current workflow:

vdatasets = parser.parse(dmrs)
# vdatasets are xr.Datasets containing ManifestArray's
mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
mds.virtualize.to_kerchunk(filepath=outfile, format=outformat)
ds = xr.open_dataset(outfile, engine="virtualizarr", ...)
ds.time.values

However the chunk manifest, encoding, attrs, etc. is already in mds so is it possible to read data directly from this dataset? My understanding is that once the "chunk manifest" ZEP is approved and the zarr-python reader in xarray is updated this should be possible. The xarray reader for kerchunk can accept a file or the reference json object directly from kerchunk SingleHdf5ToZarr and MultiZarrToZarr. So similarly can we extract the refs from mds and pass it to xr.open_dataset() directly?

There probably still needs to be a function that extracts the refs so that xarray can make a new Dataset object with all the indexes, cf_time handling, and open_dataset checks.

mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
refs = mds.virtualize()
ds = xr.open_dataset(refs, engine="virtualizarr", ...)

Even reading directly from the ManifestArray dataset is possible but not sure how the new dataset object with numpy arrays and indexes would be separate from the original dataset

mds = xr.combine_nested(list(vdatasets), **xr_combine_kwargs)
mds.time.values

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-08-07T16:16:43Z

Thinking about this more, once zarr-python Array objects support the manifest storage transformer, we should be able to write a new method on ManifestArray objects which constructs the zarr.Array directly, i.e.

def to_zarr_array(self: ManifestArray) -> zarr.Array:
   ...

This opens up some interesting possibilities. Currently when you call .compute on a virtual dataset you get a NotImplementedError, but with this we could change the behaviour to instead:

Turn the ManifestArray into a zarr.Array
Use xarray's zarr backend machinery to open up that zarr array the same way that normally happens when you do xr.open_zarr
Which includes wrapping with xarray's lazy indexing classes,
Call the .compute behaviour that xarray would normally use.

The result would be that a user could actually treat a "virtual" xarray Dataset as a normal xarray Dataset, because if they tried to .compute it it should transform itself into one under the hood!

Then you could open any data format that virtualizarr understands via vz.open_virtual_dataset (or maybe eventually xr.open_dataset(engine='virtualizarr')), and if you want to treat it like an in-memory xarray Dataset from that point on then you can, but if you prefer to manipulate it and save it out as a virtual zarr store on disk you can also do that!

I still need to think through some of the details, but this could potentially be a neat alternative approach to pydata/xarray#9281, and not actually require any upstream changes to xarray!

cc @d-v-b

TomNicholas · 2024-08-07T16:25:43Z

(One subtlety I'm not sure about here would be around indexes. I think you would probably want to have a solution for loading indexes as laid out in #18, and then have the indexes understand how they can be loaded.)

TomNicholas · 2024-08-07T17:35:13Z

Another subtlety to consider is when should the CF decoding happen? You would then have effectively done open_dataset in a very roundabout way, and we need to make sure not to forget the CF decoding step in there somewhere.

TomNicholas · 2024-12-12T20:53:14Z

Whilst I've gone off the idea of an xarray backend that loads data by default (#221), there's another potential use case for this idea: loadable_variables.

Currently there are xarray backends and virtualizarr virtual backends. But the latter has to call the former if it wants to support the loadable_variables kwarg. For example, @sharkinsspatial 's new HDF virtual reader internally calls

VirtualiZarr/virtualizarr/readers/hdf/hdf.py

Line 71 in 1dbd119

loadable_vars, indexes = open_loadable_vars_and_indexes(

which calls xr.open_dataset

VirtualiZarr/virtualizarr/readers/common.py

Line 50 in 1dbd119

ds = open_dataset(

which for a netCDF4 file would call xarray's netCDF4 backend to load the loadable_variables.

So to support virtualizing one filetype, we need both a virtualizarr backend and a corresponding xarray backend. For filetypes that already have xarray backends this isn't a problem, but if we want 3rd parties to add their own virtual backends for custom filetypes, it's doubling the work they have to do.

TomNicholas · 2024-12-19T19:47:06Z

@ayushnag I feel like there might be some sneaky trick we could do to load the ManifestArrays into memory using xr.open_zarr using either xpublish or some kind of special python zarr store implementation...

Like what if we create some kind of special ManifestStore class along the lines of all the zarr stores listed here, then use the zarr-python library to read from that? Is that possible?

https://zarr.readthedocs.io/en/stable/api/storage.html

ayushnag · 2024-12-19T19:59:29Z

Perhaps a Store very similar to an IcechunkStore except with a subset of the capabilities and requirements? It can have just the portion that provides the abstraction to read from a remote store given only the manifest

TomNicholas · 2024-12-19T20:42:00Z

Yes exactly! And instead of finding the byte range info from files on disk it just gets them from the ManifestArray that's in memory.

The advantage of this approach is that we should be able to then just call open_zarr internally to get at the loadable variables, guaranteeing that Xarray will handle CF decoding of those variables in the way it currently does when we use the netCDF backend.

TomNicholas · 2024-12-19T20:54:12Z

@ayushnag following your suggestion on slack, maybe what we need is to temporarily create an icechunkstore then read from it? Though that would introduce a dependency on icechunk outside of the actual .to_icechunk method.

Alternatively maybe creating an FsspecStore (or whatever it's called exactly) that can read bytes from S3 ends up being functionally equivalent to your suggestion to use kerchunk above.

It would be cool to find a way to do this using just zarr-python.

cc @mpiannucci and @jhamman for feedback on this idea

ayushnag closed this as completed May 23, 2024

ayushnag reopened this May 23, 2024

TomNicholas changed the title ~~Reading data from ManifestArray's~~ Loading data from ManifestArrays without saving references to disk first May 23, 2024

This was referenced Jul 26, 2024

Splitting out lazy indexing layer and backends layer as zarr-python features pydata/xarray#9281

Open

Get xarray.testing.assert_identical to work on datasets containing ManifestArrays #161

Closed

TomNicholas mentioned this issue Aug 20, 2024

Xarray backend which loads data by default #221

Open

TomNicholas mentioned this issue Dec 12, 2024

Opening virtual datasets (dmr-adapter) nsidc/earthaccess#606

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading data from ManifestArrays without saving references to disk first #124

Loading data from ManifestArrays without saving references to disk first #124

ayushnag commented May 23, 2024 •

edited

Loading

TomNicholas commented Aug 7, 2024 •

edited

Loading

TomNicholas commented Aug 7, 2024

TomNicholas commented Aug 7, 2024

TomNicholas commented Dec 12, 2024

TomNicholas commented Dec 19, 2024

ayushnag commented Dec 19, 2024

TomNicholas commented Dec 19, 2024

TomNicholas commented Dec 19, 2024

Loading data from ManifestArrays without saving references to disk first #124

Loading data from ManifestArrays without saving references to disk first #124

Comments

ayushnag commented May 23, 2024 • edited Loading

TomNicholas commented Aug 7, 2024 • edited Loading

TomNicholas commented Aug 7, 2024

TomNicholas commented Aug 7, 2024

TomNicholas commented Dec 12, 2024

TomNicholas commented Dec 19, 2024

ayushnag commented Dec 19, 2024

TomNicholas commented Dec 19, 2024

TomNicholas commented Dec 19, 2024

ayushnag commented May 23, 2024 •

edited

Loading

TomNicholas commented Aug 7, 2024 •

edited

Loading