Paths as URIs (#243)

* validate that paths can be coerced to valid URIs * add a test that paths are converted to URIs * added test and better error if local path is not absolute * raise more informative error if path is not absolute * test that empty paths are allowed * add failing test for raising on malformed paths * fix paths in tests * fix more tests * remove the file:/// prefix when writing to kerchunk format * absolute paths in recently-added tests * absolute paths in recently-added tests * fix one more test * stop wrapping specific error in less useful one * moved remaining kerchunk parsing logic out into translator file * add fs_root parameter to validation fn * demote ChunkEntry to a TypedDict to separate validation fn * actually instead add new constructor method to TypedDict * test kerchunk writer with absolute paths * make kerchunk reader tests pass * try to implement the fs_root concatenation * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * implementation using cloudpathlib working * check that fs_root doesn't have a filetype suffix * add cloudpathlib to dependencies * allow http paths, and require file suffixes * unit tests for path validation * test whether we can catch malformed paths * test fs_root * add (unimplemented) validate_entries kwarg to .from_arrays * add .keys(), .values(), .items() * test validate_paths during .from_arrays * ensure validation actually normalizes paths to uris * test .rename_paths correctly validates paths * some release notes * remove now-superfluous coercion to URI in icechunk writer * added link to icechunk writer performance benchmark * add reader_kwargs argument to open_virtual_dataset, and pass it down to every reader * ensure relative paths containing .. can be normalized * ensure HDF5 reader always returns absolute URIs * ensure HDF reader always returns absolute URIs * add relative path handling to other kerchunk-based readers * add dmrpp relative path integration test * fix kerchunk relative paths test by pluggin through fs_root kwarg * fix dmrpp tests by using absolute filepaths in DMR++ contents * clarify new dmrpp test * test handling of relative filepaths to dmrpp files * group related tests * removed cloudpathlib from validation code * fix bug but restrict fs_root to only handle filesystem paths, not bucket url prefixes * global list of recognized URI prefixes * cleanup * remove cloudpathlib from dependencies * fix/ignore some typing errors * rewrite tests to use a new dmrparser_factory * rewrite using global dict of XML strings * fix final test by explicitly passing in tmp_path instead of using a fixture which requests tmp_path * fix bug with not converting Path objects to strings * dmrpp relative paths tests passing * fix type hint for filetype kwarg * user documentation on fs_root * change example manifests to use URIs * reminder that rename_paths exists * update release notes * remove note about .rename_paths --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: ayushnag <[email protected]>
zarr-developers · Dec 3, 2024 · a08b495 · a08b495
1 parent 2bfeee3
commit a08b495
Show file tree

Hide file tree

Showing 25 changed files with 711 additions and 252 deletions.
diff --git a/docs/releases.rst b/docs/releases.rst
@@ -17,6 +17,8 @@ Breaking changes
 
 - Minimum required version of Xarray is now v2024.10.0.
   (:pull:`284`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
+- Opening kerchunk-formatted references from disk which contain relative paths now requires passing the ``fs_root`` keyword argument via ``virtual_backend_kwargs``.
+  (:pull:`243`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
 
 Deprecations
 ~~~~~~~~~~~~
@@ -50,6 +52,8 @@ Internal Changes
   (:pull:`87`) By `Sean Harkins <https://github.com/sharkinsspatial>`_.
 - Support downstream type checking by adding py.typed marker file.
   (:pull:`306`) By `Max Jones <https://github.com/maxrjones>`_.
+- File paths in chunk manifests are now always stored as abolute URIs.
+  (:pull:`243`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
 
 .. _v1.1.0:
 

diff --git a/docs/usage.md b/docs/usage.md
@@ -106,10 +106,10 @@ manifest = marr.manifest
 manifest.dict()
 ```
 ```python
-{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}}
+{'0.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000}}
 ```
 
-In this case we can see that the `"air"` variable contains only one chunk, the bytes for which live in the `air.nc` file at the location given by the `'offset'` and `'length'` attributes.
+In this case we can see that the `"air"` variable contains only one chunk, the bytes for which live in the `file:///work/data/air.nc` file, at the location given by the `'offset'` and `'length'` attributes.
 
 The {py:class}`ChunkManifest <virtualizarr.manifests.ChunkManifest>` class is virtualizarr's internal in-memory representation of this manifest.
 
@@ -153,8 +153,8 @@ ManifestArray<shape=(5840, 25, 53), dtype=int16, chunks=(2920, 25, 53)>
 concatenated.manifest.dict()
 ```
 ```
-{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000},
- '1.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}}
+{'0.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000},
+ '1.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000}}
 ```
 
 This concatenation property is what will allow us to combine the data from multiple netCDF files on disk into a single Zarr store containing arrays of many chunks.
@@ -254,8 +254,8 @@ We can see that the resulting combined manifest has two chunks, as expected.
 combined_vds['air'].data.manifest.dict()
 ```
 ```
-{'0.0.0': {'path': 'air1.nc', 'offset': 15419, 'length': 3869000},
- '1.0.0': {'path': 'air2.nc', 'offset': 15419, 'length': 3869000}}
+{'0.0.0': {'path': 'file:///work/data/air1.nc', 'offset': 15419, 'length': 3869000},
+ '1.0.0': {'path': 'file:///work/data/air2.nc', 'offset': 15419, 'length': 3869000}}
 ```
 
 ```{note}
@@ -327,8 +327,6 @@ Loading variables can be useful in a few scenarios:
 
 To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).
 
-
-
 ```python
 vds = open_virtual_dataset(
     'air.nc',
@@ -354,8 +352,6 @@ Attributes:
     title:        4x daily NMC reanalysis (1948)
 ```
 
-
-
 ## Writing virtual stores to disk
 
 Once we've combined references to all the chunks of all our legacy files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.
@@ -443,20 +439,37 @@ This store can however be read by {py:func}`~virtualizarr.xarray.open_virtual_da
 You can open existing Kerchunk `json` or `parquet` references as Virtualizarr virtual datasets. This may be useful for converting existing Kerchunk formatted references to storage formats like [Icechunk](https://icechunk.io/).
 
 ```python
-
 vds = open_virtual_dataset('combined.json', filetype='kerchunk', indexes={})
 # or
 vds = open_virtual_dataset('combined.parquet', filetype='kerchunk', indexes={})
+```
+
+One difference between the kerchunk references format and virtualizarr's internal manifest representation (as well as icechunk's format) is that paths in kerchunk references can be relative paths. Opening kerchunk references that contain relative local filepaths therefore requires supplying another piece of information: the directory of the ``fsspec`` filesystem which the filepath was defined relative to.
+
+You can dis-ambuiguate kerchunk references containing relative paths by passing the ``fs_root`` kwarg to ``virtual_backend_kwargs``.
 
+```python
+# file `relative_refs.json` contains a path like './file.nc'
+
+vds = open_virtual_dataset(
+    'relative_refs.json',
+    filetype='kerchunk',
+    indexes={},
+    virtual_backend_kwargs={'fs_root': 'file:///some_directory/'}
+)
+
+# the path in the virtual dataset will now be 'file:///some_directory/file.nc'
 ```
 
+Note that as the virtualizarr {py:meth}`vds.virtualize.to_kerchunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_kerchunk>` method only writes absolute paths, the only scenario in which you might come across references containing relative paths is if you are opening references that were previously created using the ``kerchunk`` library alone.
+
 ## Rewriting existing manifests
 
 Sometimes it can be useful to rewrite the contents of an already-generated manifest or virtual dataset.
 
 ### Rewriting file paths
 
-You can rewrite the file paths stored in a manifest or virtual dataset without changing the byte range information using the {py:meth}`ds.virtualize.rename_paths <virtualizarr.xarray.VirtualiZarrDatasetAccessor.rename_paths>` accessor method.
+You can rewrite the file paths stored in a manifest or virtual dataset without changing the byte range information using the {py:meth}`vds.virtualize.rename_paths <virtualizarr.xarray.VirtualiZarrDatasetAccessor.rename_paths>` accessor method.
 
 For example, you may want to rename file paths according to a function to reflect having moved the location of the referenced files from local storage to an S3 bucket.
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -59,7 +59,6 @@ test = [
     "virtualizarr[hdf_reader]"
 ]
 
-
 [project.urls]
 Home = "https://github.com/TomNicholas/VirtualiZarr"
 Documentation = "https://github.com/TomNicholas/VirtualiZarr/blob/main/README.md"

diff --git a/virtualizarr/backend.py b/virtualizarr/backend.py
@@ -105,7 +105,7 @@ def automatically_determine_filetype(
 def open_virtual_dataset(
     filepath: str,
     *,
-    filetype: FileType | None = None,
+    filetype: FileType | str | None = None,
     group: str | None = None,
     drop_variables: Iterable[str] | None = None,
     loadable_variables: Iterable[str] | None = None,

diff --git a/virtualizarr/manifests/array.py b/virtualizarr/manifests/array.py
@@ -8,7 +8,6 @@
     _isnan,
 )
 from virtualizarr.manifests.manifest import ChunkManifest
-from virtualizarr.types.kerchunk import KerchunkArrRefs
 from virtualizarr.zarr import ZArray
 
 
@@ -62,24 +61,6 @@ def __init__(
         self._zarray = _zarray
         self._manifest = _chunkmanifest
 
-    @classmethod
-    def _from_kerchunk_refs(cls, arr_refs: KerchunkArrRefs) -> "ManifestArray":
-        from virtualizarr.translators.kerchunk import (
-            fully_decode_arr_refs,
-            parse_array_refs,
-        )
-
-        decoded_arr_refs = fully_decode_arr_refs(arr_refs)
-
-        chunk_dict, zarray, _zattrs = parse_array_refs(decoded_arr_refs)
-        manifest = ChunkManifest._from_kerchunk_chunk_dict(chunk_dict)
-
-        obj = object.__new__(cls)
-        obj._manifest = manifest
-        obj._zarray = zarray
-
-        return obj
-
     @property
     def manifest(self) -> ChunkManifest:
         return self._manifest