Skip to content

Commit

Permalink
Paths as URIs (#243)
Browse files Browse the repository at this point in the history
* validate that paths can be coerced to valid URIs

* add a test that paths are converted to URIs

* added test and better error if local path is not absolute

* raise more informative error if path is not absolute

* test that empty paths are allowed

* add failing test for raising on malformed paths

* fix paths in tests

* fix more tests

* remove the file:/// prefix when writing to kerchunk format

* absolute paths in recently-added tests

* absolute paths in recently-added tests

* fix one more test

* stop wrapping specific error in less useful one

* moved remaining kerchunk parsing logic out into translator file

* add fs_root parameter to validation fn

* demote ChunkEntry to a TypedDict to separate validation fn

* actually instead add new constructor method to TypedDict

* test kerchunk writer with absolute paths

* make kerchunk reader tests pass

* try to implement the fs_root concatenation

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* implementation using cloudpathlib working

* check that fs_root doesn't have a filetype suffix

* add cloudpathlib to dependencies

* allow http paths, and require file suffixes

* unit tests for path validation

* test whether we can catch malformed paths

* test fs_root

* add (unimplemented) validate_entries kwarg to .from_arrays

* add .keys(), .values(), .items()

* test validate_paths during .from_arrays

* ensure validation actually normalizes paths to uris

* test .rename_paths correctly validates paths

* some release notes

* remove now-superfluous coercion to URI in icechunk writer

* added link to icechunk writer performance benchmark

* add reader_kwargs argument to open_virtual_dataset, and pass it down to every reader

* ensure relative paths containing .. can be normalized

* ensure HDF5 reader always returns absolute URIs

* ensure HDF reader always returns absolute URIs

* add relative path handling to other kerchunk-based readers

* add dmrpp relative path integration test

* fix kerchunk relative paths test by pluggin through fs_root kwarg

* fix dmrpp tests by using absolute filepaths in DMR++ contents

* clarify new dmrpp test

* test handling of relative filepaths to dmrpp files

* group related tests

* removed cloudpathlib from validation code

* fix bug but restrict fs_root to only handle filesystem paths, not bucket url prefixes

* global list of recognized URI prefixes

* cleanup

* remove cloudpathlib from dependencies

* fix/ignore some typing errors

* rewrite tests to use a new dmrparser_factory

* rewrite using global dict of XML strings

* fix final test by explicitly passing in tmp_path instead of using a fixture which requests tmp_path

* fix bug with not converting Path objects to strings

* dmrpp relative paths tests passing

* fix type hint for filetype kwarg

* user documentation on fs_root

* change example manifests to use URIs

* reminder that rename_paths exists

* update release notes

* remove note about .rename_paths

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: ayushnag <[email protected]>
  • Loading branch information
3 people authored Dec 3, 2024
1 parent 2bfeee3 commit a08b495
Show file tree
Hide file tree
Showing 25 changed files with 711 additions and 252 deletions.
4 changes: 4 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@ Breaking changes

- Minimum required version of Xarray is now v2024.10.0.
(:pull:`284`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Opening kerchunk-formatted references from disk which contain relative paths now requires passing the ``fs_root`` keyword argument via ``virtual_backend_kwargs``.
(:pull:`243`) By `Tom Nicholas <https://github.com/TomNicholas>`_.

Deprecations
~~~~~~~~~~~~
Expand Down Expand Up @@ -50,6 +52,8 @@ Internal Changes
(:pull:`87`) By `Sean Harkins <https://github.com/sharkinsspatial>`_.
- Support downstream type checking by adding py.typed marker file.
(:pull:`306`) By `Max Jones <https://github.com/maxrjones>`_.
- File paths in chunk manifests are now always stored as abolute URIs.
(:pull:`243`) By `Tom Nicholas <https://github.com/TomNicholas>`_.

.. _v1.1.0:

Expand Down
37 changes: 25 additions & 12 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,10 +106,10 @@ manifest = marr.manifest
manifest.dict()
```
```python
{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}}
{'0.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000}}
```

In this case we can see that the `"air"` variable contains only one chunk, the bytes for which live in the `air.nc` file at the location given by the `'offset'` and `'length'` attributes.
In this case we can see that the `"air"` variable contains only one chunk, the bytes for which live in the `file:///work/data/air.nc` file, at the location given by the `'offset'` and `'length'` attributes.

The {py:class}`ChunkManifest <virtualizarr.manifests.ChunkManifest>` class is virtualizarr's internal in-memory representation of this manifest.

Expand Down Expand Up @@ -153,8 +153,8 @@ ManifestArray<shape=(5840, 25, 53), dtype=int16, chunks=(2920, 25, 53)>
concatenated.manifest.dict()
```
```
{'0.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000},
'1.0.0': {'path': 'air.nc', 'offset': 15419, 'length': 7738000}}
{'0.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000},
'1.0.0': {'path': 'file:///work/data/air.nc', 'offset': 15419, 'length': 7738000}}
```

This concatenation property is what will allow us to combine the data from multiple netCDF files on disk into a single Zarr store containing arrays of many chunks.
Expand Down Expand Up @@ -254,8 +254,8 @@ We can see that the resulting combined manifest has two chunks, as expected.
combined_vds['air'].data.manifest.dict()
```
```
{'0.0.0': {'path': 'air1.nc', 'offset': 15419, 'length': 3869000},
'1.0.0': {'path': 'air2.nc', 'offset': 15419, 'length': 3869000}}
{'0.0.0': {'path': 'file:///work/data/air1.nc', 'offset': 15419, 'length': 3869000},
'1.0.0': {'path': 'file:///work/data/air2.nc', 'offset': 15419, 'length': 3869000}}
```

```{note}
Expand Down Expand Up @@ -327,8 +327,6 @@ Loading variables can be useful in a few scenarios:

To correctly decode time variables according to the CF conventions, you need to pass `time` to `loadable_variables` and ensure the `decode_times` argument of `open_virtual_dataset` is set to True (`decode_times` defaults to None).



```python
vds = open_virtual_dataset(
'air.nc',
Expand All @@ -354,8 +352,6 @@ Attributes:
title: 4x daily NMC reanalysis (1948)
```



## Writing virtual stores to disk

Once we've combined references to all the chunks of all our legacy files into one virtual xarray dataset, we still need to write these references out to disk so that they can be read by our analysis code later.
Expand Down Expand Up @@ -443,20 +439,37 @@ This store can however be read by {py:func}`~virtualizarr.xarray.open_virtual_da
You can open existing Kerchunk `json` or `parquet` references as Virtualizarr virtual datasets. This may be useful for converting existing Kerchunk formatted references to storage formats like [Icechunk](https://icechunk.io/).

```python

vds = open_virtual_dataset('combined.json', filetype='kerchunk', indexes={})
# or
vds = open_virtual_dataset('combined.parquet', filetype='kerchunk', indexes={})
```

One difference between the kerchunk references format and virtualizarr's internal manifest representation (as well as icechunk's format) is that paths in kerchunk references can be relative paths. Opening kerchunk references that contain relative local filepaths therefore requires supplying another piece of information: the directory of the ``fsspec`` filesystem which the filepath was defined relative to.

You can dis-ambuiguate kerchunk references containing relative paths by passing the ``fs_root`` kwarg to ``virtual_backend_kwargs``.

```python
# file `relative_refs.json` contains a path like './file.nc'

vds = open_virtual_dataset(
'relative_refs.json',
filetype='kerchunk',
indexes={},
virtual_backend_kwargs={'fs_root': 'file:///some_directory/'}
)

# the path in the virtual dataset will now be 'file:///some_directory/file.nc'
```

Note that as the virtualizarr {py:meth}`vds.virtualize.to_kerchunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_kerchunk>` method only writes absolute paths, the only scenario in which you might come across references containing relative paths is if you are opening references that were previously created using the ``kerchunk`` library alone.

## Rewriting existing manifests

Sometimes it can be useful to rewrite the contents of an already-generated manifest or virtual dataset.

### Rewriting file paths

You can rewrite the file paths stored in a manifest or virtual dataset without changing the byte range information using the {py:meth}`ds.virtualize.rename_paths <virtualizarr.xarray.VirtualiZarrDatasetAccessor.rename_paths>` accessor method.
You can rewrite the file paths stored in a manifest or virtual dataset without changing the byte range information using the {py:meth}`vds.virtualize.rename_paths <virtualizarr.xarray.VirtualiZarrDatasetAccessor.rename_paths>` accessor method.

For example, you may want to rename file paths according to a function to reflect having moved the location of the referenced files from local storage to an S3 bucket.

Expand Down
1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,6 @@ test = [
"virtualizarr[hdf_reader]"
]


[project.urls]
Home = "https://github.com/TomNicholas/VirtualiZarr"
Documentation = "https://github.com/TomNicholas/VirtualiZarr/blob/main/README.md"
Expand Down
2 changes: 1 addition & 1 deletion virtualizarr/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def automatically_determine_filetype(
def open_virtual_dataset(
filepath: str,
*,
filetype: FileType | None = None,
filetype: FileType | str | None = None,
group: str | None = None,
drop_variables: Iterable[str] | None = None,
loadable_variables: Iterable[str] | None = None,
Expand Down
19 changes: 0 additions & 19 deletions virtualizarr/manifests/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@
_isnan,
)
from virtualizarr.manifests.manifest import ChunkManifest
from virtualizarr.types.kerchunk import KerchunkArrRefs
from virtualizarr.zarr import ZArray


Expand Down Expand Up @@ -62,24 +61,6 @@ def __init__(
self._zarray = _zarray
self._manifest = _chunkmanifest

@classmethod
def _from_kerchunk_refs(cls, arr_refs: KerchunkArrRefs) -> "ManifestArray":
from virtualizarr.translators.kerchunk import (
fully_decode_arr_refs,
parse_array_refs,
)

decoded_arr_refs = fully_decode_arr_refs(arr_refs)

chunk_dict, zarray, _zattrs = parse_array_refs(decoded_arr_refs)
manifest = ChunkManifest._from_kerchunk_chunk_dict(chunk_dict)

obj = object.__new__(cls)
obj._manifest = manifest
obj._zarray = zarray

return obj

@property
def manifest(self) -> ChunkManifest:
return self._manifest
Expand Down
Loading

0 comments on commit a08b495

Please sign in to comment.