diff --git a/README.md b/README.md index 3e0bc4b6..00737977 100644 --- a/README.md +++ b/README.md @@ -4,91 +4,7 @@ VirtualiZarr (pronounced "virtualize-arr") grew out of [discussions](https://github.com/fsspec/kerchunk/issues/377) on the [kerchunk repository](https://github.com/fsspec/kerchunk), and is an attempt to provide the game-changing power of kerchunk in a zarr-native way, and with a familiar array-like API. -### What's the difference between VirtualiZarr and Kerchunk? - -The Kerchunk idea solves an incredibly important problem: accessing big archival datasets via a cloud-optimized pattern, but without copying or modifying the original data in any way. This is a win-win-win for users, data engineers, and data providers. Users see fast-opening zarr-compliant stores that work performantly with libraries like xarray and dask, data engineers can provide this speed by adding a lightweight virtualization layer on top of existing data (without having to ask anyone's permission), and data providers don't have to change anything about their legacy files for them to be used in a cloud-optimized way. - -However, kerchunk's current design is limited: -- Store-level abstractions make combining datasets complicated, idiosyncratic, and requires duplicating logic that already exists in libraries like xarray, -- The kerchunk format for storing on-disk references requires the caller to understand it, usually via [`fsspec`](https://github.com/fsspec/filesystem_spec) (which is currently only implemented in python). - -VirtualiZarr aims to build on the excellent ideas of kerchunk whilst solving the above problems: -- Using array-level abstractions instead is more modular, easier to reason about, allows convenient wrapping by high-level tools like xarray, and is simpler to parallelize, -- Writing the virtualized arrays out as a valid Zarr store directly (through new Zarr Extensions) will allow for Zarr implementations in any language to read the archival data. - -### Installation - -Currently you need to clone VirtualiZarr and install it locally: -```shell -git clone virtualizarr -pip install -e . -``` -You will also need a specific branch of xarray in order for concatenation without indexes to work. (See [this comment](https://github.com/TomNicholas/VirtualiZarr/issues/14#issuecomment-2018369470).) You may want to install the dependencies using the `virtualizarr/ci/environment.yml` conda file, which includes the specific branch of xarray required. - -### Usage - -**NOTE: This package is in development. The usage examples in this section are currently aspirational. Progress towards making these examples work is tracked in [issue #2](https://github.com/TomNicholas/VirtualiZarr/issues/2).** - -Let's say you have a bunch of legacy files (e.g. netCDF) which together tile to form a large dataset. Let's imagine you already know how to use xarray to open these files and combine the opened dataset objects into one complete dataset. (If you don't then read the [xarray docs page on combining data](https://docs.xarray.dev/en/stable/user-guide/combining.html).) - -```python -ds = xr.open_mfdataset( - '/my/files*.nc', - engine='h5netcdf', - combine='by_coords', # 'by_coords' requires reading coord data to determine concatenation order -) -ds # the complete lazy xarray dataset -``` - -However, you don't want to run this set of xarray operations every time you open this dataset, as running commands like `xr.open_mfdataset` can be expensive. Instead you would prefer to just be able to open a virtualized Zarr store (i.e. `xr.open_dataset('my_virtual_store.zarr')`), as that would open instantly, but still give access to the same data underneath. - -**`VirtualiZarr` allows you to use the same xarray incantation you would normally use to open and combine all your files, but cache that result as a virtual Zarr store.** - -What's being cached here, you ask? We're effectively caching the result of performing all the various consistency checks that xarray performs when it combines newly-encountered datasets together. Once you have the new virtual Zarr store xarray is able to assume that this checking has already been done, and trusts your Zarr store enough to just open it instantly. - -Creating the virtual store looks very similar to how we normally open data with xarray: - -```python -import virtualizarr # required for the xarray backend and accessor to be present - -virtual_ds = xr.open_mfdataset( - '/my/files*.nc', - engine='virtualizarr', # virtualizarr registers an xarray IO backend that returns ManifestArray objects - combine='by_coords', # 'by_coords' stills requires actually reading coordinate data -) - -virtual_ds # now wraps a bunch of virtual ManifestArray objects directly - -# cache the combined dataset pattern to disk, in this case using the existing kerchunk specification for reference files -virtual_ds.virtualize.to_kerchunk('combined.json', format='json') -``` - -Now you can open your shiny new Zarr store instantly: - -```python -fs = fsspec.filesystem('reference', fo='combined.json') -m = fs.get_mapper('') - -ds = xr.open_dataset(m, engine='kerchunk', chunks={}) # normal xarray.Dataset object, wrapping dask/numpy arrays etc. -``` - -(Since we serialized the cached results using the kerchunk specification then opening this zarr store still requires using fsspec via the kerchunk xarray backend.) - -No data has been loaded or copied in this process, we have merely created an on-disk lookup table that points xarray into the specific parts of the original netCDF files when it needs to read each chunk. - -### How does this work? - -I'm glad you asked! We can think of the problem of providing virtualized zarr-like access to a set of legacy files in some other format as a series of steps: - -1) **Read byte ranges** - We use the various [kerchunk file format backends](https://fsspec.github.io/kerchunk/reference.html#file-format-backends) to determine which byte ranges within a given legacy file would have to be read in order to get a specific chunk of data we want. -2) **Construct a representation of a single file (or array within a file)** - Kerchunk's backends return a nested dictionary representing an entire file, but we instead immediately parse this dict and wrap it up into a set of `ManifestArray` objects. The record of where to look to find the file and the byte ranges is stored under the `ManifestArray.manifest` attribute, in a `ChunkManifest` object. Both steps (1) and (2) are handled by the `'virtualizarr'` xarray backend, which returns one `xarray.Dataset` object per file, each wrapping multiple `ManifestArray` instances (as opposed to e.g. numpy/dask arrays). -3) **Deduce the concatenation order** - The desired order of concatenation can either be inferred from the order in which the datasets are supplied (which is what `xr.combined_nested` assumes), or it can be read from the coordinate data in the files (which is what `xr.combine_by_coords` does). If the ordering information is not present as a coordinate (e.g. because it's in the filename), a pre-processing step might be required. -4) **Check that the desired concatenation is valid** - Whether called explicitly by the user or implicitly via `xr.combine_nested/combine_by_coords/open_mfdataset`, `xr.concat` is used to concatenate/stack the wrapped `ManifestArray` objects. When doing this xarray will spend time checking that the array objects and any coordinate indexes can be safely aligned and concatenated. Along with opening files, and loading coordinates in step (3), this is the main reason why `xr.open_mfdataset` can take a long time to return a dataset created from a large number of files. -5) **Combine into one big dataset** - `xr.concat` dispatches to the `concat/stack` methods of the underlying `ManifestArray` objects. These perform concatenation by merging their respective Chunk Manifests. Using xarray's `combine_*` methods means that we can handle multi-dimensional concatenations as well as merging many different variables. -6) **Serialize the combined result to disk** - The resultant `xr.Dataset` object wraps `ManifestArray` objects which contain the complete list of byte ranges for every chunk we might want to read. We now serialize this information to disk, either using the [kerchunk specification](https://fsspec.github.io/kerchunk/spec.html#version-1), or in future we plan to use [new Zarr extensions](https://github.com/zarr-developers/zarr-specs/issues/287) to write valid Zarr stores directly. -7) **Open the virtualized dataset from disk** - The virtualized zarr store can now be read from disk, skipping all the work we did above. Chunk reads from this store will be redirected to read the corresponding bytes in the original legacy files. - -**Note:** Using the `kerchunk` library alone will perform a similar set of steps overall, but because (3), (4), (5), and (6) are all performed by the `kerchunk.combine.MultiZarrToZarr` function, and no internal abstractions are exposed, the design is much less modular, and the use cases are limited by kerchunk's API surface. +_Please see the [documentation](https://virtualizarr.readthedocs.io/en/latest/)_ ### Development Status and Roadmap @@ -98,6 +14,10 @@ VirtualiZarr is therefore evolving in tandem with developments in the Zarr Speci Whilst we wait for these upstream changes, in the meantime VirtualiZarr aims to provide utility in a significant subset of cases, for example by enabling writing virtualized zarr stores out to the existing kerchunk references format, so that they can be read by fsspec today. +### Credits + +This package was originally developed by [Tom Nicholas](https://github.com/TomNicholas) whilst working at [[C]Worthy](cworthy.org), who deserve credit for allowing him to prioritise a generalizable open-source solution to the dataset virtualization problem. VirtualiZarr is now a community-owned multi-stakeholder project. + ### Licence Apache 2.0 diff --git a/docs/index.md b/docs/index.md index a5300f3e..a768f6da 100644 --- a/docs/index.md +++ b/docs/index.md @@ -16,6 +16,58 @@ VirtualiZarr aims to build on the excellent ideas of kerchunk whilst solving the - Using array-level abstractions instead is more modular, easier to reason about, allows convenient wrapping by high-level tools like xarray, and is simpler to parallelize, - Writing the virtualized arrays out as a valid Zarr store directly (through new Zarr Extensions) will allow for Zarr implementations in any language to read the archival data. +## Aim + +**NOTE: This package is in development. The usage examples in this section are currently aspirational. +See the [Usage docs page](#usage) to see what API works today. Progress towards making all of these examples work is tracked in [issue #2](https://github.com/TomNicholas/VirtualiZarr/issues/2).** + +Let's say you have a bunch of legacy files (e.g. netCDF) which together tile to form a large dataset. Let's imagine you already know how to use xarray to open these files and combine the opened dataset objects into one complete dataset. (If you don't then read the [xarray docs page on combining data](https://docs.xarray.dev/en/stable/user-guide/combining.html).) + +```python +ds = xr.open_mfdataset( + '/my/files*.nc', + engine='h5netcdf', + combine='by_coords', # 'by_coords' requires reading coord data to determine concatenation order +) +ds # the complete lazy xarray dataset +``` + +However, you don't want to run this set of xarray operations every time you open this dataset, as running commands like `xr.open_mfdataset` can be expensive. Instead you would prefer to just be able to open a virtualized Zarr store (i.e. `xr.open_dataset('my_virtual_store.zarr')`), as that would open instantly, but still give access to the same data underneath. + +**`VirtualiZarr` aims to allow you to use the same xarray incantation you would normally use to open and combine all your files, but cache that result as a virtual Zarr store.** + +What's being cached here, you ask? We're effectively caching the result of performing all the various consistency checks that xarray performs when it combines newly-encountered datasets together. Once you have the new virtual Zarr store xarray is able to assume that this checking has already been done, and trusts your Zarr store enough to just open it instantly. + +Creating the virtual store looks very similar to how we normally open data with xarray: + +```python +import virtualizarr # required for the xarray backend and accessor to be present + +virtual_ds = xr.open_mfdataset( + '/my/files*.nc', + engine='virtualizarr', # virtualizarr registers an xarray IO backend that returns ManifestArray objects + combine='by_coords', # 'by_coords' stills requires actually reading coordinate data +) + +virtual_ds # now wraps a bunch of virtual ManifestArray objects directly + +# cache the combined dataset pattern to disk, in this case using the existing kerchunk specification for reference files +virtual_ds.virtualize.to_kerchunk('combined.json', format='json') +``` + +Now you can open your shiny new Zarr store instantly: + +```python +fs = fsspec.filesystem('reference', fo='combined.json') +m = fs.get_mapper('') + +ds = xr.open_dataset(m, engine='kerchunk', chunks={}) # normal xarray.Dataset object, wrapping dask/numpy arrays etc. +``` + +(Since we serialized the cached results using the kerchunk specification then opening this zarr store still requires using fsspec via the kerchunk xarray backend.) + +No data has been loaded or copied in this process, we have merely created an on-disk lookup table that points xarray into the specific parts of the original netCDF files when it needs to read each chunk. + ## Licence Apache 2.0 diff --git a/docs/installation.md b/docs/installation.md index a4297a97..309c34a3 100644 --- a/docs/installation.md +++ b/docs/installation.md @@ -1,5 +1,6 @@ # Installation +Currently you need to clone VirtualiZarr and install it locally: ```shell git clone https://github.com/TomNicholas/VirtualiZarr @@ -7,6 +8,8 @@ cd VirtualiZarr pip install -e . ``` +You will also need a specific branch of xarray in order for concatenation without indexes to work. (See [this comment](https://github.com/TomNicholas/VirtualiZarr/issues/14#issuecomment-2018369470).) You may want to install the dependencies using the `virtualizarr/ci/environment.yml` conda file, which includes the specific branch of xarray required. + ## Install Test Dependencies diff --git a/docs/usage.md b/docs/usage.md index 29dcfcd7..0806fa5f 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -1,49 +1,176 @@ +(usage)= # Usage +This page explains how to use VirtualiZarr today, by introducing the key concepts one-by-one. -Let's say you have a bunch of legacy files (e.g. netCDF) which together tile to form a large dataset. Let's imagine you already know how to use xarray to open these files and combine the opened dataset objects into one complete dataset. (If you don't then read the [xarray docs page on combining data](https://docs.xarray.dev/en/stable/user-guide/combining.html).) +## Opening files as virtual datasets + +VirtualiZarr is for manipulating "virtual" references to pre-existing data stored on disk in a variety of formats, by representing it in terms of the [Zarr data model](https://zarr-specs.readthedocs.io/en/latest/specs.html) of chunked N-dimensional arrays. + +If we have a pre-existing netCDF file on disk, ```python -ds = xr.open_mfdataset( - '/my/files*.nc', - engine='h5netcdf', - combine='by_coords', # 'by_coords' requires reading coord data to determine concatenation order -) -ds # the complete lazy xarray dataset +import xarray as xr + +# create an example pre-existing netCDF4 file +ds = xr.tutorial.open_dataset('air_temperature') +ds.to_netcdf('air.nc') ``` -However, you don't want to run this set of xarray operations every time you open this dataset, as running commands like `xr.open_mfdataset` can be expensive. Instead you would prefer to just be able to open a virtualized Zarr store (i.e. `xr.open_dataset('my_virtual_store.zarr')`), as that would open instantly, but still give access to the same data underneath. +We can open a virtual representation of this file using {py:func}`open_virtual_dataset `. -**`VirtualiZarr` allows you to use the same xarray incantation you would normally use to open and combine all your files, but cache that result as a virtual Zarr store.** +```python +from virtualizarr import open_virtual_dataset -What's being cached here, you ask? We're effectively caching the result of performing all the various consistency checks that xarray performs when it combines newly-encountered datasets together. Once you have the new virtual Zarr store xarray is able to assume that this checking has already been done, and trusts your Zarr store enough to just open it instantly. +vds = open_virtual_dataset('air.nc') +``` -Creating the virtual store looks very similar to how we normally open data with xarray: +(Notice we did not have to explicitly indicate the file format, as {py:func}`open_virtual_dataset ` will attempt to automatically infer it.) + +```{note} +In future we would like for it to be possible to just use `xr.open_dataset`, e.g. + + import virtualizarr + + vds = xr.open_dataset('air.nc', engine='virtualizarr') + +but this requires some [upstream changes](https://github.com/TomNicholas/VirtualiZarr/issues/35) in xarray. +``` + +Printing this "virtual dataset" shows that although it is an instance of `xarray.Dataset`, unlike a typical xarray dataset, it does not contain numpy or dask arrays, but instead it wraps {py:class}`ManifestArray ` objects. ```python -import virtualizarr # required for the xarray backend and accessor to be present +vds +``` +``` + Size: 8MB +Dimensions: (time: 2920, lat: 25, lon: 53) +Coordinates: + lat (lat) float32 100B ManifestArray` objects are each a virtual reference to some data in the `air.nc` netCDF file, with the references stored in the form of "Chunk Manifests". -virtual_ds = xr.open_mfdataset( - '/my/files*.nc', - engine='virtualizarr', # virtualizarr registers an xarray IO backend that returns ManifestArray objects - combine='by_coords', # 'by_coords' stills requires actually reading coordinate data -) +## Chunk Manifests -virtual_ds # now wraps a bunch of virtual ManifestArray objects directly +In the Zarr model N-dimensional arrays are stored as a series of compressed chunks, each labelled by a chunk key which indicates its position in the array. Whilst conventionally each of these Zarr chunks are a separate compressed binary file stored within a Zarr Store, there is no reason why these chunks could not actually already exist as part of another file (e.g. a netCDF file), and be loaded by reading a specific byte range from this pre-existing file. -# cache the combined dataset pattern to disk, in this case using the existing kerchunk specification for reference files -virtual_ds.virtualize.to_kerchunk('combined.json', format='json') +A "Chunk Manifest" is a list of chunk keys and their corresponding byte ranges in specific files, grouped together such that all the chunks form part of one Zarr-like array. For example, a chunk manifest for a 3-dimensional array made up of 4 chunks might look like this: + +```python +{ + "0.0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100}, + "0.0.1": {"path": "s3://bucket/foo.nc", "offset": 200, "length": 100}, + "0.1.0": {"path": "s3://bucket/foo.nc", "offset": 300, "length": 100}, + "0.1.1": {"path": "s3://bucket/foo.nc", "offset": 400, "length": 100}, +} ``` -Now you can open your shiny new Zarr store instantly: +Notice that the `"path"` attribute points to a netCDF file `"foo.nc"` stored in a remote S3 bucket. There is no need for the files the chunk manifest refers to to be local. +Our virtual dataset we opened above contains multiple chunk manifests stored in-memory, which we can see by pulling one out as a python dictionary. + +```python +marr = vds['air'].data +manifest = marr.manifest +manifest.dict() +``` ```python -fs = fsspec.filesystem('reference', fo='combined.json') -m = fs.get_mapper('') +{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}} +``` + +In this case we can see that the `"air"` variable contains only one chunk, the bytes for which live in the `air.nc` file at the location given by the `'offset'` and `'length'` attributes. + +The {py:class}`ChunkManifest ` class is virtualizarr's internal in-memory representation of this manifest. + +## `ManifestArray` class + +A Zarr array is defined not just by the location of its constituent chunk data, but by its array-level attributes such as `shape` and `dtype`. The {py:class}`ManifestArray ` class stores both the array-level attributes and the corresponding chunk manifest. -ds = xr.open_dataset(m, engine='kerchunk', chunks={}) # normal xarray.Dataset object, wrapping dask/numpy arrays etc. +```python +marr +``` +``` +ManifestArray +``` +```python +marr.manifest +``` ``` +ChunkManifest +``` +```python +marr.zarray +``` +``` +ZArray(shape=(2920, 25, 53), chunks=(2920, 25, 53), dtype=int16, compressor=None, filters=None, fill_value=None) +``` + +A `ManifestArray` can therefore be thought of as a virtualized representation of a single Zarr array. + +As it defines various array-like methods, a `ManifestArray` can often be treated like a ["duck array"](https://docs.xarray.dev/en/stable/user-guide/duckarrays.html). In particular, concatenation of multiple `ManifestArray` objects can be done via merging their chunk manifests into one (and re-labelling the chunk keys). + +```python +import numpy as np + +concatenated = np.concatenate([marr, marr], axis=0) +concatenated +``` +``` +ManifestArray +``` +```python +concatenated.manifest.dict() +``` +``` +{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}, + '1.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}} +``` + +This concatenation property is what will allow us to combine the data from multiple netCDF files on disk into a single Zarr store containing arrays of many chunks. + +```{note} +As a single Zarr array has only one array-level set of compression codecs by definition, concatenation of arrays from files saved to disk with differing codecs cannot be achieved through concatenation of `ManifestArray` objects. Implementing this feature will require a more abstract and general notion of concatentation, see [GH issue #5](https://github.com/TomNicholas/VirtualiZarr/issues/5). +``` + +## Virtual Xarray Datasets as Zarr Groups + +The full Zarr model (for a single group) includes multiple arrays, array names, named dimensions, and arbitrary dictionary-like attrs on each array. Whilst the duck-typed `ManifestArray` cannot store all of this information, an `xarray.Dataset` wrapping multiple `ManifestArray`s maps really nicely to the Zarr model. This is what the virtual dataset we opened represents - all the information in one entire Zarr group, but held as references to on-disk chunks instead of in-memory arrays. + +The problem of combining many legacy formatted files (e.g. netCDF) into one virtual Zarr store therefore becomes just a matter of opening each file using `open_virtual_dataset` and using [xarray's various combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html) to combine them into one aggregate virtual dataset. + +## Concatenation via xarray using given order (i.e. without indexes) + +TODO: How concatenating in given order works + +TODO: Note on how this will only work if you have the correct fork of xarray + +TODO: Note on how this could be done using `open_mfdataset(..., combine='nested')` in future + +## Concatenation via xarray using order inferred from indexes + +TODO: How to concatenate with order inferred from indexes automatically + +TODO: Note on how this could be done using `open_mfdataset(..., combine='by_coords')` in future + +## Writing virtual stores to disk + +### Writing as kerchunk format and reading via fsspec + +TODO: Explanation of how this uses kerchunks format + +TODO: Reading using fsspec -(Since we serialized the cached results using the kerchunk specification then opening this zarr store still requires using fsspec via the kerchunk xarray backend.) +### Writing as Zarr -No data has been loaded or copied in this process, we have merely created an on-disk lookup table that points xarray into the specific parts of the original netCDF files when it needs to read each chunk. +TODO: Explanation of how this requires changes in zarr upstream to be able to read it