Skip to content

Commit

Permalink
Merge branch 'main' into fix_group_kwarg
Browse files Browse the repository at this point in the history
  • Loading branch information
TomNicholas authored Dec 9, 2024
2 parents dcbf9ff + af9c374 commit 1991c81
Show file tree
Hide file tree
Showing 5 changed files with 143 additions and 69 deletions.
30 changes: 25 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,27 @@
[![Conda - Downloads](https://img.shields.io/conda/d/conda-forge/virtualizarr
)](https://anaconda.org/conda-forge/virtualizarr)

**VirtualiZarr creates virtual Zarr stores for cloud-friendly access to archival data, using familiar xarray syntax.**
## Cloud-Optimize your Scientific Data as Virtual Zarr stores, using xarray syntax.

VirtualiZarr (pronounced like "virtualizer" but more piratey) grew out of [discussions](https://github.com/fsspec/kerchunk/issues/377) on the [kerchunk repository](https://github.com/fsspec/kerchunk), and is an attempt to provide the game-changing power of kerchunk in a zarr-native way, and with a familiar array-like API.
The best way to distribute large scientific datasets is via the Cloud, in [Cloud-Optimized formats](https://guide.cloudnativegeo.org/) [^1]. But often this data is stuck in legacy pre-Cloud file formats such as netCDF.

You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides [almost all the same features](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare) as Kerchunk.
**VirtualiZarr[^2] makes it easy to create "Virtual" Zarr stores, allowing performant access to legacy data as if it were in the Cloud-Optimized [Zarr format](https://zarr.dev/), _without duplicating any data_.**

Please see the [documentation](https://virtualizarr.readthedocs.io/en/stable/index.html).

### Features

* Create virtual references pointing to bytes inside a legacy file with [`open_virtual_dataset`](https://virtualizarr.readthedocs.io/en/latest/usage.html#opening-files-as-virtual-datasets),
* Supports a [range of legacy file formats](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare), including netCDF4 and HDF5,
* [Combine data from multiple files](https://virtualizarr.readthedocs.io/en/latest/usage.html#combining-virtual-datasets) into one larger store using [xarray's combining functions](https://docs.xarray.dev/en/stable/user-guide/combining.html), such as [`xarray.concat`](https://docs.xarray.dev/en/stable/generated/xarray.concat.html),
* Commit the virtual references to storage either using the [Kerchunk references](https://fsspec.github.io/kerchunk/spec.html) specification or the [Icechunk](https://icechunk.io/) transactional storage engine.
* Users access the virtual dataset using [`xarray.open_dataset`](https://docs.xarray.dev/en/stable/generated/xarray.open_dataset.html#xarray.open_dataset).

### Inspired by Kerchunk

_Please see the [documentation](https://virtualizarr.readthedocs.io/en/stable/index.html)_
VirtualiZarr grew out of [discussions](https://github.com/fsspec/kerchunk/issues/377) on the [Kerchunk repository](https://github.com/fsspec/kerchunk), and is an attempt to provide the game-changing power of kerchunk but in a zarr-native way, and with a familiar array-like API.

You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides [almost all the same features](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare) as Kerchunk.

### Development Status and Roadmap

Expand All @@ -38,7 +52,7 @@ We have a lot of ideas, including:

If you see other opportunities then we would love to hear your ideas!

### Presentations
### Talks and Presentations

- 2024/11/21 - MET Office Architecture Guild - Tom Nicholas - [Slides](https://speakerdeck.com/tomnicholas/virtualizarr-talk-at-met-office)
- 2024/11/13 - Cloud-Native Geospatial conference - Raphael Hagen - [Slides](https://decks.carbonplan.org/cloud-native-geo/11-13-24)
Expand All @@ -52,3 +66,9 @@ This package was originally developed by [Tom Nicholas](https://github.com/TomNi
### Licence

Apache 2.0

### References

[^1]: [_Cloud-Native Repositories for Big Scientific Data_, Abernathey et. al., _Computing in Science & Engineering_.](https://ieeexplore.ieee.org/abstract/document/9354557)

[^2]: (Pronounced like "virtualizer" but more piratey 🦜)
117 changes: 81 additions & 36 deletions docs/faq.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,77 @@
# FAQ

## How does this work?
## Usage questions

### I'm an Xarray user but unfamiliar with Zarr/Cloud - might I still want this?

Potentially yes.

Let's say you have a bunch of legacy files (e.g. netCDF) which together tile along one or more dimensions to form a large dataset.
Let's also imagine you already know how to use xarray to open these files and combine the opened dataset objects into one complete dataset.
(If you don't then read the [xarray docs page on combining data](https://docs.xarray.dev/en/stable/user-guide/combining.html).)

```python
# open_mfdataset does a lot of checks, so can take a while
ds = xr.open_mfdataset(
'/my/files*.nc',
engine='h5netcdf',
combine='nested',
)
ds # the complete lazy xarray dataset
```

However, you don't want to run this set of xarray operations every time you open this dataset, as running commands like `xr.open_mfdataset` can be expensive.
Instead you would prefer to just be able to open a single pre-saved virtual store that points to all your data, as that would open instantly (using `xr.open_dataset('my_virtual_store.zarr')`), but still give access to the same data underneath.

**`VirtualiZarr` aims to allow you to use the same xarray incantation you would normally use to open and combine all your files, but cache that result as a virtual Zarr store.**

You can think of this as effectively caching the result of performing all the various consistency checks that xarray performs when it combines newly-encountered datasets together.
Once you have the new virtual Zarr store xarray is able to assume that this checking has already been done, and trusts your Zarr store enough to just open it instantly.

```{note}
This means you should not change or add to any of the files comprising the store once created. If you want to make changes or add new data, you should look into using [Icechunk](https://icechunk.io/) instead.
```

As Zarr can read data that lives on filesystems too, this can be useful even if you don't plan to put your data in the cloud.
You can create the virtual store once (e.g. as soon as your HPC simulation finishes) and then opening that dataset will be much faster than using `open_mfdataset` each time.

### Is this compatible with Icechunk?

Very much so! VirtualiZarr allows you to ingest data as virtual references and write those references into an [Icechunk](https://icechunk.io/) Store. See the [Icechunk documentation on creating virtual datasets](https://icechunk.io/icechunk-python/virtual/#creating-a-virtual-dataset-with-virtualizarr).

In general once the Icechunk specification reaches a stable v1.0, we would recommend using that over Kerchunk's references format, in order to take advantage of transactional updates, version controlled history, and faster access speeds.

### I have already Kerchunked my data, do I have to redo that?

No - you can simply open the Kerchunk-formatted references you already have into VirtualiZarr directly. Then you can manipulate them, or re-save them into a new format, such as [Icechunk](https://icechunk.io/):

```python
from virtualizarr import open_virtual_dataset

vds = open_virtual_dataset('refs.json')
# vds = open_virtual_dataset('refs.parq') # kerchunk parquet files are supported too

vds.virtualize.to_icechunk(icechunkstore)
```

### I already have some data in Zarr, do I have to resave it?

No! VirtualiZarr can (well, [soon will be able to](https://github.com/zarr-developers/VirtualiZarr/issues/262)) create virtual references pointing to existing Zarr stores in the same way as for other file formats.

### Can I add a new reader for my custom file format?

There are a lot of legacy file formats which could potentially be represented as virtual zarr references (see [this issue](https://github.com/zarr-developers/VirtualiZarr/issues/218) listing some examples).
VirtualiZarr ships with some readers for common formats (e.g. netCDF/HDF5), but you may want to write your own reader for some other file format.

VirtualiZarr is designed in a way to make this as straightforward as possible.
If you want to do this then [this comment](https://github.com/zarr-developers/VirtualiZarr/issues/262#issuecomment-2429968244
) will be helpful.

You can also use this approach to write a reader that starts from a kerchunk-formatted virtual references dict.

Currently if you want to call your new reader from `virtualizarr.open_virtual_dataset` you would need to open a PR to this repository, but we plan to generalize this system to allow 3rd party libraries to plug in via an entrypoint (see [issue #245](https://github.com/zarr-developers/VirtualiZarr/issues/245)).

## How does this actually work?

I'm glad you asked! We can think of the problem of providing virtualized zarr-like access to a set of legacy files in some other format as a series of steps:

Expand All @@ -16,9 +87,9 @@ The above steps could also be performed using the `kerchunk` library alone, but

## How do VirtualiZarr and Kerchunk compare?

You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides almost all the same features as Kerchunk.
You have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides almost all the same features as Kerchunk.

Users of kerchunk may find the following comparison table useful, which shows which features of kerchunk map on to which features of VirtualiZarr.
Users of Kerchunk may find the following comparison table useful, which shows which features of Kerchunk map on to which features of VirtualiZarr.

| Component / Feature | Kerchunk | VirtualiZarr |
| ------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
Expand Down Expand Up @@ -57,15 +128,17 @@ Users of kerchunk may find the following comparison table useful, which shows wh
| Zarr v3 store with `manifest.json` files || `ds.virtualize.to_zarr()`, then read via any Zarr v3 reader which implements the manifest storage transformer ZEP |
| [Icechunk](https://icechunk.io/) store || `ds.virtualize.to_icechunk()`, then read back via xarray (requires zarr-python v3). |

## Why a new project?
## Development

### Why a new project?

The reasons why VirtualiZarr has been developed as separate project rather than by contributing to the Kerchunk library upstream are:
- Kerchunk aims to support non-Zarr-like formats too [(1)](https://github.com/fsspec/kerchunk/issues/386#issuecomment-1795379571) [(2)](https://github.com/zarr-developers/zarr-specs/issues/287#issuecomment-1944439368), whereas VirtualiZarr is more strictly scoped, and may eventually be very tighted integrated with the Zarr-Python library itself,
- Once the VirtualiZarr feature list above is complete, it will likely not share any code with the Kerchunk library, nor import it,
- The API design of VirtualiZarr is deliberately [completely different](https://github.com/fsspec/kerchunk/issues/377#issuecomment-1922688615) to Kerchunk's API, so integration into Kerchunk would have meant duplicated functionality,
- Kerchunk aims to support non-Zarr-like formats too [(1)](https://github.com/fsspec/kerchunk/issues/386#issuecomment-1795379571) [(2)](https://github.com/zarr-developers/zarr-specs/issues/287#issuecomment-1944439368), whereas VirtualiZarr is more strictly scoped, and may eventually be very tighted integrated with the Zarr-Python library itself.
- Whilst some features of VirtualiZarr currently require importing Kerchunk, Kerchunk is an optional dependency, and the VirtualiZarr roadmap aims to at some point not share any code with the Kerchunk library, nor ever require importing it. (You would nevertheless still be able to write out references in the Kerchunk format though!)
- The API design of VirtualiZarr is deliberately [completely different](https://github.com/fsspec/kerchunk/issues/377#issuecomment-1922688615) to Kerchunk's API, so integration into Kerchunk would have meant duplicated functionality.
- Refactoring Kerchunk's existing API to maintain backwards compatibility would have been [challenging](https://github.com/fsspec/kerchunk/issues/434).

## What is the Development Status and Roadmap?
### What is the Development Status and Roadmap?

VirtualiZarr version 1 (mostly) achieves [feature parity](#how-do-virtualizarr-and-kerchunk-compare) with kerchunk's logic for combining datasets, providing an easier way to manipulate kerchunk references in memory and generate kerchunk reference files on disk.

Expand All @@ -80,31 +153,3 @@ We have a lot of ideas, including:
- [Generating references without kerchunk](https://github.com/zarr-developers/VirtualiZarr/issues/78)

If you see other opportunities then we would love to hear your ideas!

## Is this compatible with Icechunk?

Yes! VirtualiZarr allows you to ingest data as virtual references and write those references into an [Icechunk](https://icechunk.io/) Store. See the [Icechunk documentation on creating virtual datasets.](https://icechunk.io/icechunk-python/virtual/#creating-a-virtual-dataset-with-virtualizarr)

## I already have Kerchunked data, do I have to redo that work?

No - you can simply open the Kerchunk-formatted references you already have into VirtualiZarr directly. Then you can re-save them into a new format, e.g. [Icechunk](https://icechunk.io/) like so:

```python
from virtualizarr import open_virtual_dataset

vds = open_virtual_dataset('refs.json')
# vds = open_virtual_dataset('refs.parq') # kerchunk parquet files are supported too

vds.virtualize.to_icechunk(icechunkstore)
```

## Can I add a new reader for my custom file format?

There are a lot of legacy file formats which could potentially be represented as virtual zarr references (see [this issue](https://github.com/zarr-developers/VirtualiZarr/issues/218) for some examples). VirtualiZarr ships with some readers for common formats (e.g. netCDF/HDF5), but you may want to write your own reader for some other file format.

VirtualiZarr is designed in a way to make this as straightforward as possible. If you want to do this then [this comment](https://github.com/zarr-developers/VirtualiZarr/issues/262#issuecomment-2429968244
) will be helpful.

You can also use this approach to write a reader that starts from a kerchunk-formatted virtual references dict.

Currently if you want to call your new reader from `virtualizarr.open_virtual_dataset` you would need to open a PR to this repository, but we plan to generalize this system to allow 3rd party libraries to plug in via an entrypoint (see [issue #245](https://github.com/zarr-developers/VirtualiZarr/issues/245)).
Loading

0 comments on commit 1991c81

Please sign in to comment.