Merge branch 'main' into test_fsspec_roundtrip

zarr-developers · May 15, 2024 · 0460888 · 0460888
2 parents c2a20ce + 8923b8c
commit 0460888
Show file tree

Hide file tree

Showing 34 changed files with 905 additions and 307 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,6 @@
+<!-- Feel free to remove check-list items aren't relevant to your change -->
+
+- [ ] Closes #xxxx
+- [ ] Tests added
+- [ ] User visible changes (including notable bug fixes) are documented in `changelog.md`
+- [ ] New functions/methods are listed in `api.rst`
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -22,28 +22,22 @@ jobs:
         shell: bash -l {0}
     strategy:
       matrix:
-        python-version: ["3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.10", "3.11", "3.12"]
     steps:
       - uses: actions/checkout@v4
 
-      - name: Create conda environment
-        uses: mamba-org/provision-with-micromamba@main
+      - name: Setup Python
+        id: setup-python
+        uses: actions/setup-python@v5
         with:
-          cache-downloads: true
-          micromamba-version: 'latest'
-          environment-file: ci/environment.yml
-          extra-specs: |
-            python=${{ matrix.python-version }}
-
-      - name: Conda info
-        run: conda info
+          python-version: ${{ matrix.python-version }}
+          cache: pip
+          cache-dependency-path: pyproject.toml
 
       - name: Install virtualizarr
         run: |
            python -m pip install -e  ".[test]"
 
-      - name: Conda list
-        run: conda list
 
       - name: Running Tests
         run: |

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -3,24 +3,24 @@ ci:
   autoupdate_schedule: monthly
 repos:
   - repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.5.0
+    rev: v4.6.0
     hooks:
       - id: trailing-whitespace
       - id: end-of-file-fixer
       - id: check-yaml
 
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: "v0.3.5"
+    # Ruff version.
+    rev: "v0.4.3"
     hooks:
+      # Run the linter.
       - id: ruff
-        args: ["--fix"]
-  # - repo: https://github.com/Carreau/velin
-  #   rev: 0.0.8
-  #   hooks:
-  #     - id: velin
-  #       args: ["--write", "--compact"]
+        args: [ --fix ]
+      # Run the formatter.
+      - id: ruff-format
+
   - repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v1.9.0
+    rev: v1.10.0
     hooks:
       - id: mypy
         # Copied from setup.cfg
@@ -37,10 +37,8 @@ repos:
           ]
   # run this occasionally, ref discussion https://github.com/pydata/xarray/pull/3194
   # - repo: https://github.com/asottile/pyupgrade
-  #   rev: v1.22.1
+  #   rev: v3.15.2
   #   hooks:
   #     - id: pyupgrade
   #       args:
-  #         - "--py3-only"
-  #         # remove on f-strings in Py3.7
-  #         - "--keep-percent-format"
+  #         - "--py310-plus"
diff --git a/ci/doc.yml b/ci/doc.yml
@@ -3,7 +3,7 @@ channels:
   - conda-forge
   - nodefaults
 dependencies:
-  - python>=3.9
+  - python>=3.10
   - "sphinx"
   - pip
   - pip:
@@ -13,4 +13,4 @@ dependencies:
       - "sphinx_design"
       - "sphinx_togglebutton"
       - "sphinx-autodoc-typehints"
-      - -e ..
+      - -e  "..[test]"
diff --git a/ci/environment.yml b/ci/environment.yml
diff --git a/docs/_static/custom.css b/docs/_static/custom.css
@@ -0,0 +1,3 @@
+.bd-sidebar-primary {
+    display: none; !important;
+}
diff --git a/docs/api.rst b/docs/api.rst
@@ -4,6 +4,7 @@ API Reference
 
 .. currentmodule:: virtualizarr
 
+VirtualiZarr has a small API surface, because most of the complexity is handled by xarray functions like ``xarray.concat`` and ``xarray.merge``.
 
 Manifests
 =========
@@ -17,8 +18,8 @@ Manifests
     ManifestArray
 
 
-Xarray
-======
+Reading
+=======
 
 .. currentmodule:: virtualizarr.xarray
 .. autosummary::

diff --git a/docs/conf.py b/docs/conf.py
@@ -28,6 +28,8 @@
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ["_templates"]
 
+# The master toctree document.
+master_doc = "index"
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.
@@ -59,12 +61,15 @@
 }
 html_title = "VirtualiZarr"
 
+# remove sidebar, see GH issue #82
+html_css_files = [
+    'custom.css',
+]
+
 html_logo = "_static/_future_logo.png"
 
 html_static_path = ["_static"]
 
 
 # issues
-# pangeo logo
 # dark mode/lm switch
-# needs to add api ref
diff --git a/docs/dev_status_roadmap.md b/docs/dev_status_roadmap.md
diff --git a/docs/faq.md b/docs/faq.md
@@ -0,0 +1,60 @@
+# FAQ
+
+## How does this work?
+
+I'm glad you asked! We can think of the problem of providing virtualized zarr-like access to a set of legacy files in some other format as a series of steps:
+
+1) **Read byte ranges** - We use the various [kerchunk file format backends](https://fsspec.github.io/kerchunk/reference.html#file-format-backends) to determine which byte ranges within a given legacy file would have to be read in order to get a specific chunk of data we want.
+2) **Construct a representation of a single file (or array within a file)** - Kerchunk's backends return a nested dictionary representing an entire file, but we instead immediately parse this dict and wrap it up into a set of `ManifestArray` objects. The record of where to look to find the file and the byte ranges is stored under the `ManifestArray.manifest` attribute, in a `ChunkManifest` object. Both steps (1) and (2) are handled by the `'virtualizarr'` xarray backend, which returns one `xarray.Dataset` object per file, each wrapping multiple `ManifestArray` instances (as opposed to e.g. numpy/dask arrays).
+3) **Deduce the concatenation order** - The desired order of concatenation can either be inferred from the order in which the datasets are supplied (which is what `xr.combined_nested` assumes), or it can be read from the coordinate data in the files (which is what `xr.combine_by_coords` does). If the ordering information is not present as a coordinate (e.g. because it's in the filename), a pre-processing step might be required.
+4) **Check that the desired concatenation is valid** - Whether called explicitly by the user or implicitly via `xr.combine_nested/combine_by_coords/open_mfdataset`, `xr.concat` is used to concatenate/stack the wrapped `ManifestArray` objects. When doing this xarray will spend time checking that the array objects and any coordinate indexes can be safely aligned and concatenated. Along with opening files, and loading coordinates in step (3), this is the main reason why `xr.open_mfdataset` can take a long time to return a dataset created from a large number of files.
+5) **Combine into one big dataset** - `xr.concat` dispatches to the `concat/stack` methods of the underlying `ManifestArray` objects. These perform concatenation by merging their respective Chunk Manifests. Using xarray's `combine_*` methods means that we can handle multi-dimensional concatenations as well as merging many different variables.
+6) **Serialize the combined result to disk** - The resultant `xr.Dataset` object wraps `ManifestArray` objects which contain the complete list of byte ranges for every chunk we might want to read. We now serialize this information to disk, either using the [kerchunk specification](https://fsspec.github.io/kerchunk/spec.html#version-1), or in future we plan to use [new Zarr extensions](https://github.com/zarr-developers/zarr-specs/issues/287) to write valid Zarr stores directly.
+7) **Open the virtualized dataset from disk** - The virtualized zarr store can now be read from disk, skipping all the work we did above. Chunk reads from this store will be redirected to read the corresponding bytes in the original legacy files.
+
+The above steps would also be performed using the `kerchunk` library alone, but because (3), (4), (5), and (6) are all performed by the `kerchunk.combine.MultiZarrToZarr` function, and no internal abstractions are exposed, kerchunk's design is much less modular, and the use cases are limited by kerchunk's API surface.
+
+## How do VirtualiZarr and Kerchunk compare?
+
+Users of kerchunk may find the following comparison table useful, which shows which features of kerchunk map on to which features of VirtualiZarr.
+| Component / Feature                                                      | Kerchunk                                                                                                                            | VirtualiZarr                                                                                                                                     |
+| ------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ |
+| **Generation of references from archival files (1)**                     |                                                                                                                                     |                                                                                                                                                  |
+| From a netCDF4/HDF5 file                                                 | `kerchunk.hdf.SingleHdf5ToZarr`                                                                                                     | `open_virtual_dataset`, via `kerchunk.hdf.SingleHdf5ToZarr` or potentially `hidefix`                                                             |
+| From a netCDF3 file                                                      | `kerchunk.netCDF3.NetCDF3ToZarr`                                                                                                    | `open_virtual_dataset`, via `kerchunk.netCDF3.NetCDF3ToZarr`                                                                                     |
+| From a COG / tiff file                                                   | `kerchunk.tiff.tiff_to_zarr`                                                                                                        | `open_virtual_dataset`, via `kerchunk.tiff.tiff_to_zarr` or potentially `cog3pio`                                                                |
+| From a Zarr v2 store                                                     | `kerchunk.zarr.ZarrToZarr`                                                                                                          | `open_virtual_dataset`, via `kerchunk.zarr.ZarrToZarr` ?                                                                                        |
+| From a GRIB2 file                                                        | `kerchunk.grib2.scan_grib`                                                                                                          | `open_virtual_datatree`, via `kerchunk.grib2.scan_grib` ?                                                                                        |
+| From a FITS file                                                         | `kerchunk.fits.process_file`                                                                                                        | `open_virtual_dataset`, via `kerchunk.fits.process_file` ?                                                                                       |
+| **In-memory representation (2)**                                         |                                                                                                                                     |                                                                                                                                                  |
+| In-memory representation of byte ranges for single array                 | Part of a "reference `dict`" with keys for each chunk in array                                                                      | `ManifestArray` instance (wrapping a `ChunkManifest` instance)                                                                                   |
+| In-memory representation of actual data values                           | Encoded bytes directly serialized into the "reference `dict`", created on a per-chunk basis using the `inline_threshold` kwarg      | `numpy.ndarray` instances, created on a per-variable basis using the `loadable_variables` kwarg                                                  |
+| In-memory representation of entire file / store                          | Nested "reference `dict`" with keys for each array in file                                                                          | `xarray.Dataset` with variables wrapping `ManifestArray` instances (or `numpy.ndarray` instances)                                                |
+| **Manipulation of in-memory references (3, 4 & 5)**                      |                                                                                                                                     |                                                                                                                                                  |
+| Combining references to multiple arrays representing different variables | `kerchunk.combine.MultiZarrToZarr`                                                                                                  | `xarray.merge`                                                                                                                                   |
+| Combining references to multiple arrays representing the same variable   | `kerchunk.combine.MultiZarrToZarr` using the `concat_dims` kwarg                                                                    | `xarray.concat`                                                                                                                                  |
+| Combining references in coordinate order                                 | `kerchunk.combine.MultiZarrToZarr` using the `coo_map` kwarg                                                                        | `xarray.combine_by_coords` with in-memory xarray indexes created by loading coordinate variables first                                           |
+| Combining along multiple dimensions without coordinate data              | n/a                                                                                                                                 | `xarray.combine_nested`                                                                                                                          |
+| **Parallelization**                                                      |                                                                                                                                     |                                                                                                                                                  |
+| Parallelized generation of references                                    | Wrapping kerchunk's opener inside `dask.delayed`                                                                                    | Wrapping `open_virtual_dataset` inside `dask.delayed` but eventually instead using `xarray.open_mfdataset(..., parallel=True)`                   |
+| Parallelized combining of references (tree-reduce)                       | `kerchunk.combine.auto_dask`                                                                                                        | Wrapping `ManifestArray` objects within `dask.array.Array` objects inside `xarray.Dataset` to use dask's `concatenate`                           |
+| **On-disk serialization (6) and reading (7)**                            |                                                                                                                                     |                                                                                                                                                  |
+| Kerchunk reference format as JSON                                        | `ujson.dumps(h5chunks.translate())` , then read using an `fsspec.filesystem` mapper                                | `ds.virtualize.to_kerchunk('combined.json', format='JSON')` , then read using an `fsspec.filesystem` mapper                                      |
+| Kerchunk reference format as parquet                                     | `df.refs_to_dataframe(out_dict, "combined.parq")`, then read using an `fsspec` `ReferenceFileSystem` mapper | `ds.virtualize.to_kerchunk('combined.parq', format=parquet')` , then read using an `fsspec` `ReferenceFileSystem` mapper |
+| Zarr v3 store with `manifest.json` files                                 | n/a                                                                                                                                 | `ds.virtualize.to_zarr()`, then read via any Zarr v3 reader which implements the manifest storage transformer ZEP                                |
+
+## Why a new project?
+
+The reasons why VirtualiZarr has been developed as separate project rather than by contributing to the Kerchunk library upstream are:
+- Kerchunk aims to support non-Zarr-like formats too [(1)](https://github.com/fsspec/kerchunk/issues/386#issuecomment-1795379571) [(2)](https://github.com/zarr-developers/zarr-specs/issues/287#issuecomment-1944439368), whereas VirtualiZarr is more strictly scoped, and may eventually be very tighted integrated with the Zarr-Python library itself,
+- Once the VirtualiZarr feature list above is complete, it will likely not share any code with the Kerchunk library, nor import it,
+- The API design of VirtualiZarr is deliberately [completely different](https://github.com/fsspec/kerchunk/issues/377#issuecomment-1922688615) to Kerchunk's API, so integration into Kerchunk would have meant duplicated functionality,
+- Refactoring Kerchunk's existing API to maintain backwards compatibility would have been [challenging](https://github.com/fsspec/kerchunk/issues/434).
+
+## What is the Development Status and Roadmap?
+
+VirtualiZarr is ready to use for many of the tasks that we are used to using kerchunk for, but the most general and powerful vision of this library can only be implemented once certain changes upstream in Zarr have occurred.
+
+VirtualiZarr is therefore evolving in tandem with developments in the Zarr Specification, which then need to be implemented in specific Zarr reader implementations (especially the Zarr-Python V3 implementation). There is an [overall roadmap for this integration with Zarr](https://hackmd.io/t9Myqt0HR7O0nq6wiHWCDA), whose final completion requires acceptance of at least two new Zarr Enhancement Proposals (the ["Chunk Manifest"](https://github.com/zarr-developers/zarr-specs/issues/287) and ["Virtual Concatenation"](https://github.com/zarr-developers/zarr-specs/issues/288) ZEPs).
+
+Whilst we wait for these upstream changes, in the meantime VirtualiZarr aims to provide utility in a significant subset of cases, for example by enabling writing virtualized zarr stores out to the existing kerchunk references format, so that they can be read by fsspec today.