Merge branch 'main' into icechunk-append

zarr-developers · Dec 4, 2024 · d6ef97f · d6ef97f
2 parents 64e5277 + 13e9077
commit d6ef97f
Show file tree

Hide file tree

Showing 31 changed files with 1,056 additions and 463 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -11,7 +11,7 @@ repos:
 
   - repo: https://github.com/astral-sh/ruff-pre-commit
     # Ruff version.
-    rev: "v0.7.2"
+    rev: "v0.8.1"
     hooks:
       # Run the linter.
       - id: ruff

diff --git a/README.md b/README.md
@@ -20,7 +20,7 @@ VirtualiZarr (pronounced like "virtualizer" but more piratey) grew out of [discu
 
 You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides [almost all the same features](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare) as Kerchunk.
 
-_Please see the [documentation](https://virtualizarr.readthedocs.io/en/stable/api.html)_
+_Please see the [documentation](https://virtualizarr.readthedocs.io/en/stable/index.html)_
 
 ### Development Status and Roadmap
 
@@ -38,6 +38,13 @@ We have a lot of ideas, including:
 
 If you see other opportunities then we would love to hear your ideas!
 
+### Presentations
+
+- 2024/11/21 - MET Office Architecture Guild - Tom Nicholas - [Slides](https://speakerdeck.com/tomnicholas/virtualizarr-talk-at-met-office)
+- 2024/?/? - Cloud-Native Geospatial conference - Raphael Hagen - [Slides](https://decks.carbonplan.org/cloud-native-geo/11-13-24)
+- 2024/07/24 - ESIP Meeting - Sean Harkins - [Event](https://2024julyesipmeeting.sched.com/event/1eVP6) / [Recording](https://youtu.be/T6QAwJIwI3Q?t=3689)
+- 2024/05/15 - Pangeo showcase - Tom Nicholas - [Event](https://discourse.pangeo.io/t/pangeo-showcase-virtualizarr-create-virtual-zarr-stores-using-xarray-syntax/4127/2) / [Recording](https://youtu.be/ioxgzhDaYiE) / [Slides](https://speakerdeck.com/tomnicholas/virtualizarr-create-virtual-zarr-stores-using-xarray-syntax)
+
 ### Credits
 
 This package was originally developed by [Tom Nicholas](https://github.com/TomNicholas) whilst working at [[C]Worthy](cworthy.org), who deserve credit for allowing him to prioritise a generalizable open-source solution to the dataset virtualization problem. VirtualiZarr is now a community-owned multi-stakeholder project.

diff --git a/docs/releases.rst b/docs/releases.rst
@@ -9,11 +9,16 @@ v1.1.1 (unreleased)
 New Features
 ~~~~~~~~~~~~
 
+- Add a ``virtual_backend_kwargs`` keyword argument to file readers and to ``open_virtual_dataset``, to allow reader-specific options to be passed down.
+  (:pull:`315`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
+
 Breaking changes
 ~~~~~~~~~~~~~~~~
 
 - Minimum required version of Xarray is now v2024.10.0.
   (:pull:`284`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
+- Opening kerchunk-formatted references from disk which contain relative paths now requires passing the ``fs_root`` keyword argument via ``virtual_backend_kwargs``.
+  (:pull:`243`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
 
 Deprecations
 ~~~~~~~~~~~~
@@ -39,6 +44,10 @@ Documentation
   (:pull:`298`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
 - More minor improvements to the Contributing Guide.
   (:pull:`304`) By `Doug Latornell <https://github.com/DougLatornell>`_.
+- Correct some links to the API.
+  (:pull:`325`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
+- Added links to recorded presentations on VirtualiZarr.
+  (:pull:`313`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
 
 Internal Changes
 ~~~~~~~~~~~~~~~~
@@ -47,6 +56,8 @@ Internal Changes
   (:pull:`87`) By `Sean Harkins <https://github.com/sharkinsspatial>`_.
 - Support downstream type checking by adding py.typed marker file.
   (:pull:`306`) By `Max Jones <https://github.com/maxrjones>`_.
+- File paths in chunk manifests are now always stored as abolute URIs.
+  (:pull:`243`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
 
 .. _v1.1.0:
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -17,15 +17,15 @@ ds = xr.tutorial.open_dataset('air_temperature')
 ds.to_netcdf('air.nc')
 ```
 
-We can open a virtual representation of this file using {py:func}`open_virtual_dataset <virtualizarr.xarray.open_virtual_dataset>`.
+We can open a virtual representation of this file using {py:func}`open_virtual_dataset <virtualizarr.open_virtual_dataset>`.
 
 ```python
 from virtualizarr import open_virtual_dataset
 
 vds = open_virtual_dataset('air.nc')
 ```
 
-(Notice we did not have to explicitly indicate the file format, as {py:func}`open_virtual_dataset <virtualizarr.xarray.open_virtual_dataset>` will attempt to automatically infer it.)
+(Notice we did not have to explicitly indicate the file format, as {py:func}`open_virtual_dataset <virtualizarr.open_virtual_dataset>` will attempt to automatically infer it.)
 
 
 ```{note}
@@ -106,10 +106,10 @@ manifest = marr.manifest
 manifest.dict()
 ```
 ```python
-{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}}
+{'0.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000}}
 ```
 
-In this case we can see that the `"air"` variable contains only one chunk, the bytes for which live in the `air.nc` file at the location given by the `'offset'` and `'length'` attributes.
+In this case we can see that the `"air"` variable contains only one chunk, the bytes for which live in the `file:///work/data/air.nc` file, at the location given by the `'offset'` and `'length'` attributes.
 
 The {py:class}`ChunkManifest <virtualizarr.manifests.ChunkManifest>` class is virtualizarr's internal in-memory representation of this manifest.
 
@@ -153,8 +153,8 @@ ManifestArray<shape=(5840, 25, 53), dtype=int16, chunks=(2920, 25, 53)>
 concatenated.manifest.dict()
 ```
 ```
-{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000},
- '1.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}}
+{'0.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000},
+ '1.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000}}
 ```
 
 This concatenation property is what will allow us to combine the data from multiple netCDF files on disk into a single Zarr store containing arrays of many chunks.
@@ -254,8 +254,8 @@ We can see that the resulting combined manifest has two chunks, as expected.
 combined_vds['air'].data.manifest.dict()
 ```
 ```
-{'0.0.0': {'path': 'air1.nc', 'offset': 15419, 'length': 3869000},
- '1.0.0': {'path': 'air2.nc', 'offset': 15419, 'length': 3869000}}
+{'0.0.0': {'path': 'file:///work/data/air1.nc', 'offset': 15419, 'length': 3869000},
+ '1.0.0': {'path': 'file:///work/data/air2.nc', 'offset': 15419, 'length': 3869000}}
 ```
 
 ```{note}
@@ -327,8 +327,6 @@ Loading variables can be useful in a few scenarios:
 
 To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).
 
-
-
 ```python
 vds = open_virtual_dataset(
     'air.nc',
@@ -354,8 +352,6 @@ Attributes:
     title:        4x daily NMC reanalysis (1948)
 ```
 
-
-
 ## Writing virtual stores to disk
 
 Once we've combined references to all the chunks of all our legacy files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.
@@ -364,7 +360,7 @@ Once we've combined references to all the chunks of all our legacy files into on
 
 The [kerchunk library](https://github.com/fsspec/kerchunk) has its own [specification](https://fsspec.github.io/kerchunk/spec.html) for how byte range references should be serialized (either as a JSON or parquet file).
 
-To write out all the references in the virtual dataset as a single kerchunk-compliant JSON or parquet file, you can use the {py:meth}`ds.virtualize.to_kerchunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_kerchunk>` accessor method.
+To write out all the references in the virtual dataset as a single kerchunk-compliant JSON or parquet file, you can use the {py:meth}`vds.virtualize.to_kerchunk <virtualizarr.VirtualiZarrDatasetAccessor.to_kerchunk>` accessor method.
 
 ```python
 combined_vds.virtualize.to_kerchunk('combined.json', format='json')
@@ -398,7 +394,7 @@ By default references are placed in separate parquet file when the total number
 
 ### Writing to an Icechunk Store
 
-We can also write these references out as an [IcechunkStore](https://icechunk.io/). `Icechunk` is a Open-source, cloud-native transactional tensor storage engine that is compatible with zarr version 3. To export our virtual dataset to an `Icechunk` Store, we simply use the {py:meth}`ds.virtualize.to_icechunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_icechunk>` accessor method.
+We can also write these references out as an [IcechunkStore](https://icechunk.io/). `Icechunk` is a Open-source, cloud-native transactional tensor storage engine that is compatible with zarr version 3. To export our virtual dataset to an `Icechunk` Store, we simply use the {py:meth}`vds.virtualize.to_icechunk <virtualizarr.VirtualiZarrDatasetAccessor.to_icechunk>` accessor method.
 
 ```python
 # create an icechunk store
@@ -415,7 +411,7 @@ See the [Icechunk documentation](https://icechunk.io/icechunk-python/virtual/#cr
 
 ### Writing as Zarr
 
-Alternatively, we can write these references out as an actual Zarr store, at least one that is compliant with the [proposed "Chunk Manifest" ZEP](https://github.com/zarr-developers/zarr-specs/issues/287). To do this we simply use the {py:meth}`ds.virtualize.to_zarr <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_zarr>` accessor method.
+Alternatively, we can write these references out as an actual Zarr store, at least one that is compliant with the [proposed "Chunk Manifest" ZEP](https://github.com/zarr-developers/zarr-specs/issues/287). To do this we simply use the {py:meth}`vds.virtualize.to_zarr <virtualizarr.VirtualiZarrDatasetAccessor.to_zarr>` accessor method.
 
 ```python
 combined_vds.virtualize.to_zarr('combined.zarr')
@@ -435,28 +431,45 @@ The advantage of this format is that any zarr v3 reader that understands the chu
 ```{note}
 Currently there are not yet any zarr v3 readers which understand the chunk manifest ZEP, so until then this feature cannot be used for data processing.
 
-This store can however be read by {py:func}`~virtualizarr.xarray.open_virtual_dataset`, by passing `filetype="zarr_v3"`.
+This store can however be read by {py:func}`~virtualizarr.open_virtual_dataset`, by passing `filetype="zarr_v3"`.
 ```
 
 ## Opening Kerchunk references as virtual datasets
 
 You can open existing Kerchunk `json` or `parquet` references as Virtualizarr virtual datasets. This may be useful for converting existing Kerchunk formatted references to storage formats like [Icechunk](https://icechunk.io/).
 
 ```python
-
 vds = open_virtual_dataset('combined.json', filetype='kerchunk', indexes={})
 # or
 vds = open_virtual_dataset('combined.parquet', filetype='kerchunk', indexes={})
+```
+
+One difference between the kerchunk references format and virtualizarr's internal manifest representation (as well as icechunk's format) is that paths in kerchunk references can be relative paths. Opening kerchunk references that contain relative local filepaths therefore requires supplying another piece of information: the directory of the ``fsspec`` filesystem which the filepath was defined relative to.
+
+You can dis-ambuiguate kerchunk references containing relative paths by passing the ``fs_root`` kwarg to ``virtual_backend_kwargs``.
 
+```python
+# file `relative_refs.json` contains a path like './file.nc'
+
+vds = open_virtual_dataset(
+    'relative_refs.json',
+    filetype='kerchunk',
+    indexes={},
+    virtual_backend_kwargs={'fs_root': 'file:///some_directory/'}
+)
+
+# the path in the virtual dataset will now be 'file:///some_directory/file.nc'
 ```
 
+Note that as the virtualizarr {py:meth}`vds.virtualize.to_kerchunk <virtualizarr.VirtualiZarrDatasetAccessor.to_kerchunk>` method only writes absolute paths, the only scenario in which you might come across references containing relative paths is if you are opening references that were previously created using the ``kerchunk`` library alone.
+
 ## Rewriting existing manifests
 
 Sometimes it can be useful to rewrite the contents of an already-generated manifest or virtual dataset.
 
 ### Rewriting file paths
 
-You can rewrite the file paths stored in a manifest or virtual dataset without changing the byte range information using the {py:meth}`ds.virtualize.rename_paths <virtualizarr.xarray.VirtualiZarrDatasetAccessor.rename_paths>` accessor method.
+You can rewrite the file paths stored in a manifest or virtual dataset without changing the byte range information using the {py:meth}`vds.virtualize.rename_paths <virtualizarr.VirtualiZarrDatasetAccessor.rename_paths>` accessor method.
 
 For example, you may want to rename file paths according to a function to reflect having moved the location of the referenced files from local storage to an S3 bucket.
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -62,7 +62,6 @@ test = [
     "virtualizarr[hdf_reader]"
 ]
 
-
 [project.urls]
 Home = "https://github.com/TomNicholas/VirtualiZarr"
 Documentation = "https://github.com/TomNicholas/VirtualiZarr/blob/main/README.md"

diff --git a/virtualizarr/backend.py b/virtualizarr/backend.py
@@ -105,14 +105,15 @@ def automatically_determine_filetype(
 def open_virtual_dataset(
     filepath: str,
     *,
-    filetype: FileType | None = None,
+    filetype: FileType | str | None = None,
     group: str | None = None,
     drop_variables: Iterable[str] | None = None,
     loadable_variables: Iterable[str] | None = None,
     decode_times: bool | None = None,
     cftime_variables: Iterable[str] | None = None,
     indexes: Mapping[str, Index] | None = None,
     virtual_array_class=ManifestArray,
+    virtual_backend_kwargs: Optional[dict] = None,
     reader_options: Optional[dict] = None,
     backend: Optional[VirtualBackend] = None,
 ) -> Dataset:
@@ -147,6 +148,8 @@ def open_virtual_dataset(
     virtual_array_class
         Virtual array class to use to represent the references to the chunks in each on-disk array.
         Currently can only be ManifestArray, but once VirtualZarrArray is implemented the default should be changed to that.
+    virtual_backend_kwargs: dict, default is None
+        Dictionary of keyword arguments passed down to this reader. Allows passing arguments specific to certain readers.
     reader_options: dict, default {}
         Dict passed into Kerchunk file readers, to allow reading from remote filesystems.
         Note: Each Kerchunk file reader has distinct arguments, so ensure reader_options match selected Kerchunk reader arguments.
@@ -201,6 +204,7 @@ def open_virtual_dataset(
         loadable_variables=loadable_variables,
         decode_times=decode_times,
         indexes=indexes,
+        virtual_backend_kwargs=virtual_backend_kwargs,
         reader_options=reader_options,
     )
 

diff --git a/virtualizarr/manifests/array.py b/virtualizarr/manifests/array.py
@@ -8,7 +8,6 @@
     _isnan,
 )
 from virtualizarr.manifests.manifest import ChunkManifest
-from virtualizarr.types.kerchunk import KerchunkArrRefs
 from virtualizarr.zarr import ZArray
 
 
@@ -62,24 +61,6 @@ def __init__(
         self._zarray = _zarray
         self._manifest = _chunkmanifest
 
-    @classmethod
-    def _from_kerchunk_refs(cls, arr_refs: KerchunkArrRefs) -> "ManifestArray":
-        from virtualizarr.translators.kerchunk import (
-            fully_decode_arr_refs,
-            parse_array_refs,
-        )
-
-        decoded_arr_refs = fully_decode_arr_refs(arr_refs)
-
-        chunk_dict, zarray, _zattrs = parse_array_refs(decoded_arr_refs)
-        manifest = ChunkManifest._from_kerchunk_chunk_dict(chunk_dict)
-
-        obj = object.__new__(cls)
-        obj._manifest = manifest
-        obj._zarray = zarray
-
-        return obj
-
     @property
     def manifest(self) -> ChunkManifest:
         return self._manifest