Skip to content

Commit

Permalink
Merge branch 'main' into icechunk-append
Browse files Browse the repository at this point in the history
  • Loading branch information
abarciauskas-bgse authored Dec 4, 2024
2 parents 64e5277 + 13e9077 commit d6ef97f
Show file tree
Hide file tree
Showing 31 changed files with 1,056 additions and 463 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ repos:

- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: "v0.7.2"
rev: "v0.8.1"
hooks:
# Run the linter.
- id: ruff
Expand Down
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ VirtualiZarr (pronounced like "virtualizer" but more piratey) grew out of [discu

You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides [almost all the same features](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare) as Kerchunk.

_Please see the [documentation](https://virtualizarr.readthedocs.io/en/stable/api.html)_
_Please see the [documentation](https://virtualizarr.readthedocs.io/en/stable/index.html)_

### Development Status and Roadmap

Expand All @@ -38,6 +38,13 @@ We have a lot of ideas, including:

If you see other opportunities then we would love to hear your ideas!

### Presentations

- 2024/11/21 - MET Office Architecture Guild - Tom Nicholas - [Slides](https://speakerdeck.com/tomnicholas/virtualizarr-talk-at-met-office)
- 2024/?/? - Cloud-Native Geospatial conference - Raphael Hagen - [Slides](https://decks.carbonplan.org/cloud-native-geo/11-13-24)
- 2024/07/24 - ESIP Meeting - Sean Harkins - [Event](https://2024julyesipmeeting.sched.com/event/1eVP6) / [Recording](https://youtu.be/T6QAwJIwI3Q?t=3689)
- 2024/05/15 - Pangeo showcase - Tom Nicholas - [Event](https://discourse.pangeo.io/t/pangeo-showcase-virtualizarr-create-virtual-zarr-stores-using-xarray-syntax/4127/2) / [Recording](https://youtu.be/ioxgzhDaYiE) / [Slides](https://speakerdeck.com/tomnicholas/virtualizarr-create-virtual-zarr-stores-using-xarray-syntax)

### Credits

This package was originally developed by [Tom Nicholas](https://github.com/TomNicholas) whilst working at [[C]Worthy](cworthy.org), who deserve credit for allowing him to prioritise a generalizable open-source solution to the dataset virtualization problem. VirtualiZarr is now a community-owned multi-stakeholder project.
Expand Down
11 changes: 11 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,16 @@ v1.1.1 (unreleased)
New Features
~~~~~~~~~~~~

- Add a ``virtual_backend_kwargs`` keyword argument to file readers and to ``open_virtual_dataset``, to allow reader-specific options to be passed down.
(:pull:`315`) By `Tom Nicholas <https://github.com/TomNicholas>`_.

Breaking changes
~~~~~~~~~~~~~~~~

- Minimum required version of Xarray is now v2024.10.0.
(:pull:`284`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Opening kerchunk-formatted references from disk which contain relative paths now requires passing the ``fs_root`` keyword argument via ``virtual_backend_kwargs``.
(:pull:`243`) By `Tom Nicholas <https://github.com/TomNicholas>`_.

Deprecations
~~~~~~~~~~~~
Expand All @@ -39,6 +44,10 @@ Documentation
(:pull:`298`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
- More minor improvements to the Contributing Guide.
(:pull:`304`) By `Doug Latornell <https://github.com/DougLatornell>`_.
- Correct some links to the API.
(:pull:`325`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Added links to recorded presentations on VirtualiZarr.
(:pull:`313`) By `Tom Nicholas <https://github.com/TomNicholas>`_.

Internal Changes
~~~~~~~~~~~~~~~~
Expand All @@ -47,6 +56,8 @@ Internal Changes
(:pull:`87`) By `Sean Harkins <https://github.com/sharkinsspatial>`_.
- Support downstream type checking by adding py.typed marker file.
(:pull:`306`) By `Max Jones <https://github.com/maxrjones>`_.
- File paths in chunk manifests are now always stored as abolute URIs.
(:pull:`243`) By `Tom Nicholas <https://github.com/TomNicholas>`_.

.. _v1.1.0:

Expand Down
49 changes: 31 additions & 18 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,15 +17,15 @@ ds = xr.tutorial.open_dataset('air_temperature')
ds.to_netcdf('air.nc')
```

We can open a virtual representation of this file using {py:func}`open_virtual_dataset <virtualizarr.xarray.open_virtual_dataset>`.
We can open a virtual representation of this file using {py:func}`open_virtual_dataset <virtualizarr.open_virtual_dataset>`.

```python
from virtualizarr import open_virtual_dataset

vds = open_virtual_dataset('air.nc')
```

(Notice we did not have to explicitly indicate the file format, as {py:func}`open_virtual_dataset <virtualizarr.xarray.open_virtual_dataset>` will attempt to automatically infer it.)
(Notice we did not have to explicitly indicate the file format, as {py:func}`open_virtual_dataset <virtualizarr.open_virtual_dataset>` will attempt to automatically infer it.)


```{note}
Expand Down Expand Up @@ -106,10 +106,10 @@ manifest = marr.manifest
manifest.dict()
```
```python
{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}}
{'0.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000}}
```

In this case we can see that the `"air"` variable contains only one chunk, the bytes for which live in the `air.nc` file at the location given by the `'offset'` and `'length'` attributes.
In this case we can see that the `"air"` variable contains only one chunk, the bytes for which live in the `file:///work/data/air.nc` file, at the location given by the `'offset'` and `'length'` attributes.

The {py:class}`ChunkManifest <virtualizarr.manifests.ChunkManifest>` class is virtualizarr's internal in-memory representation of this manifest.

Expand Down Expand Up @@ -153,8 +153,8 @@ ManifestArray<shape=(5840, 25, 53), dtype=int16, chunks=(2920, 25, 53)>
concatenated.manifest.dict()
```
```
{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000},
'1.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}}
{'0.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000},
'1.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000}}
```

This concatenation property is what will allow us to combine the data from multiple netCDF files on disk into a single Zarr store containing arrays of many chunks.
Expand Down Expand Up @@ -254,8 +254,8 @@ We can see that the resulting combined manifest has two chunks, as expected.
combined_vds['air'].data.manifest.dict()
```
```
{'0.0.0': {'path': 'air1.nc', 'offset': 15419, 'length': 3869000},
'1.0.0': {'path': 'air2.nc', 'offset': 15419, 'length': 3869000}}
{'0.0.0': {'path': 'file:///work/data/air1.nc', 'offset': 15419, 'length': 3869000},
'1.0.0': {'path': 'file:///work/data/air2.nc', 'offset': 15419, 'length': 3869000}}
```

```{note}
Expand Down Expand Up @@ -327,8 +327,6 @@ Loading variables can be useful in a few scenarios:

To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).



```python
vds = open_virtual_dataset(
'air.nc',
Expand All @@ -354,8 +352,6 @@ Attributes:
title: 4x daily NMC reanalysis (1948)
```



## Writing virtual stores to disk

Once we've combined references to all the chunks of all our legacy files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.
Expand All @@ -364,7 +360,7 @@ Once we've combined references to all the chunks of all our legacy files into on

The [kerchunk library](https://github.com/fsspec/kerchunk) has its own [specification](https://fsspec.github.io/kerchunk/spec.html) for how byte range references should be serialized (either as a JSON or parquet file).

To write out all the references in the virtual dataset as a single kerchunk-compliant JSON or parquet file, you can use the {py:meth}`ds.virtualize.to_kerchunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_kerchunk>` accessor method.
To write out all the references in the virtual dataset as a single kerchunk-compliant JSON or parquet file, you can use the {py:meth}`vds.virtualize.to_kerchunk <virtualizarr.VirtualiZarrDatasetAccessor.to_kerchunk>` accessor method.

```python
combined_vds.virtualize.to_kerchunk('combined.json', format='json')
Expand Down Expand Up @@ -398,7 +394,7 @@ By default references are placed in separate parquet file when the total number

### Writing to an Icechunk Store

We can also write these references out as an [IcechunkStore](https://icechunk.io/). `Icechunk` is a Open-source, cloud-native transactional tensor storage engine that is compatible with zarr version 3. To export our virtual dataset to an `Icechunk` Store, we simply use the {py:meth}`ds.virtualize.to_icechunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_icechunk>` accessor method.
We can also write these references out as an [IcechunkStore](https://icechunk.io/). `Icechunk` is a Open-source, cloud-native transactional tensor storage engine that is compatible with zarr version 3. To export our virtual dataset to an `Icechunk` Store, we simply use the {py:meth}`vds.virtualize.to_icechunk <virtualizarr.VirtualiZarrDatasetAccessor.to_icechunk>` accessor method.

```python
# create an icechunk store
Expand All @@ -415,7 +411,7 @@ See the [Icechunk documentation](https://icechunk.io/icechunk-python/virtual/#cr

### Writing as Zarr

Alternatively, we can write these references out as an actual Zarr store, at least one that is compliant with the [proposed "Chunk Manifest" ZEP](https://github.com/zarr-developers/zarr-specs/issues/287). To do this we simply use the {py:meth}`ds.virtualize.to_zarr <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_zarr>` accessor method.
Alternatively, we can write these references out as an actual Zarr store, at least one that is compliant with the [proposed "Chunk Manifest" ZEP](https://github.com/zarr-developers/zarr-specs/issues/287). To do this we simply use the {py:meth}`vds.virtualize.to_zarr <virtualizarr.VirtualiZarrDatasetAccessor.to_zarr>` accessor method.

```python
combined_vds.virtualize.to_zarr('combined.zarr')
Expand All @@ -435,28 +431,45 @@ The advantage of this format is that any zarr v3 reader that understands the chu
```{note}
Currently there are not yet any zarr v3 readers which understand the chunk manifest ZEP, so until then this feature cannot be used for data processing.
This store can however be read by {py:func}`~virtualizarr.xarray.open_virtual_dataset`, by passing `filetype="zarr_v3"`.
This store can however be read by {py:func}`~virtualizarr.open_virtual_dataset`, by passing `filetype="zarr_v3"`.
```

## Opening Kerchunk references as virtual datasets

You can open existing Kerchunk `json` or `parquet` references as Virtualizarr virtual datasets. This may be useful for converting existing Kerchunk formatted references to storage formats like [Icechunk](https://icechunk.io/).

```python

vds = open_virtual_dataset('combined.json', filetype='kerchunk', indexes={})
# or
vds = open_virtual_dataset('combined.parquet', filetype='kerchunk', indexes={})
```

One difference between the kerchunk references format and virtualizarr's internal manifest representation (as well as icechunk's format) is that paths in kerchunk references can be relative paths. Opening kerchunk references that contain relative local filepaths therefore requires supplying another piece of information: the directory of the ``fsspec`` filesystem which the filepath was defined relative to.

You can dis-ambuiguate kerchunk references containing relative paths by passing the ``fs_root`` kwarg to ``virtual_backend_kwargs``.

```python
# file `relative_refs.json` contains a path like './file.nc'

vds = open_virtual_dataset(
'relative_refs.json',
filetype='kerchunk',
indexes={},
virtual_backend_kwargs={'fs_root': 'file:///some_directory/'}
)

# the path in the virtual dataset will now be 'file:///some_directory/file.nc'
```

Note that as the virtualizarr {py:meth}`vds.virtualize.to_kerchunk <virtualizarr.VirtualiZarrDatasetAccessor.to_kerchunk>` method only writes absolute paths, the only scenario in which you might come across references containing relative paths is if you are opening references that were previously created using the ``kerchunk`` library alone.

## Rewriting existing manifests

Sometimes it can be useful to rewrite the contents of an already-generated manifest or virtual dataset.

### Rewriting file paths

You can rewrite the file paths stored in a manifest or virtual dataset without changing the byte range information using the {py:meth}`ds.virtualize.rename_paths <virtualizarr.xarray.VirtualiZarrDatasetAccessor.rename_paths>` accessor method.
You can rewrite the file paths stored in a manifest or virtual dataset without changing the byte range information using the {py:meth}`vds.virtualize.rename_paths <virtualizarr.VirtualiZarrDatasetAccessor.rename_paths>` accessor method.

For example, you may want to rename file paths according to a function to reflect having moved the location of the referenced files from local storage to an S3 bucket.

Expand Down
1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,6 @@ test = [
"virtualizarr[hdf_reader]"
]


[project.urls]
Home = "https://github.com/TomNicholas/VirtualiZarr"
Documentation = "https://github.com/TomNicholas/VirtualiZarr/blob/main/README.md"
Expand Down
6 changes: 5 additions & 1 deletion virtualizarr/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,14 +105,15 @@ def automatically_determine_filetype(
def open_virtual_dataset(
filepath: str,
*,
filetype: FileType | None = None,
filetype: FileType | str | None = None,
group: str | None = None,
drop_variables: Iterable[str] | None = None,
loadable_variables: Iterable[str] | None = None,
decode_times: bool | None = None,
cftime_variables: Iterable[str] | None = None,
indexes: Mapping[str, Index] | None = None,
virtual_array_class=ManifestArray,
virtual_backend_kwargs: Optional[dict] = None,
reader_options: Optional[dict] = None,
backend: Optional[VirtualBackend] = None,
) -> Dataset:
Expand Down Expand Up @@ -147,6 +148,8 @@ def open_virtual_dataset(
virtual_array_class
Virtual array class to use to represent the references to the chunks in each on-disk array.
Currently can only be ManifestArray, but once VirtualZarrArray is implemented the default should be changed to that.
virtual_backend_kwargs: dict, default is None
Dictionary of keyword arguments passed down to this reader. Allows passing arguments specific to certain readers.
reader_options: dict, default {}
Dict passed into Kerchunk file readers, to allow reading from remote filesystems.
Note: Each Kerchunk file reader has distinct arguments, so ensure reader_options match selected Kerchunk reader arguments.
Expand Down Expand Up @@ -201,6 +204,7 @@ def open_virtual_dataset(
loadable_variables=loadable_variables,
decode_times=decode_times,
indexes=indexes,
virtual_backend_kwargs=virtual_backend_kwargs,
reader_options=reader_options,
)

Expand Down
19 changes: 0 additions & 19 deletions virtualizarr/manifests/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
_isnan,
)
from virtualizarr.manifests.manifest import ChunkManifest
from virtualizarr.types.kerchunk import KerchunkArrRefs
from virtualizarr.zarr import ZArray


Expand Down Expand Up @@ -62,24 +61,6 @@ def __init__(
self._zarray = _zarray
self._manifest = _chunkmanifest

@classmethod
def _from_kerchunk_refs(cls, arr_refs: KerchunkArrRefs) -> "ManifestArray":
from virtualizarr.translators.kerchunk import (
fully_decode_arr_refs,
parse_array_refs,
)

decoded_arr_refs = fully_decode_arr_refs(arr_refs)

chunk_dict, zarray, _zattrs = parse_array_refs(decoded_arr_refs)
manifest = ChunkManifest._from_kerchunk_chunk_dict(chunk_dict)

obj = object.__new__(cls)
obj._manifest = manifest
obj._zarray = zarray

return obj

@property
def manifest(self) -> ChunkManifest:
return self._manifest
Expand Down
Loading

0 comments on commit d6ef97f

Please sign in to comment.