Merge branch 'main' into numpy_arrays_manifest

zarr-developers · Jun 10, 2024 · 24f7131 · 24f7131
2 parents 6079198 + cc97112
commit 24f7131
Show file tree

Hide file tree

Showing 22 changed files with 573 additions and 130 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,6 @@
+<!-- Feel free to remove check-list items aren't relevant to your change -->
+
+- [ ] Closes #xxxx
+- [ ] Tests added
+- [ ] Changes are documented in `docs/releases.rst`
+- [ ] New functions/methods are listed in `api.rst`
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -22,7 +22,7 @@ jobs:
         shell: bash -l {0}
     strategy:
       matrix:
-        python-version: ["3.9", "3.10", "3.11", "3.12"]
+        python-version: ["3.10", "3.11", "3.12"]
     steps:
       - uses: actions/checkout@v4
 

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -11,7 +11,7 @@ repos:
 
   - repo: https://github.com/astral-sh/ruff-pre-commit
     # Ruff version.
-    rev: "v0.4.3"
+    rev: "v0.4.7"
     hooks:
       # Run the linter.
       - id: ruff
@@ -37,10 +37,8 @@ repos:
           ]
   # run this occasionally, ref discussion https://github.com/pydata/xarray/pull/3194
   # - repo: https://github.com/asottile/pyupgrade
-  #   rev: v1.22.1
+  #   rev: v3.15.2
   #   hooks:
   #     - id: pyupgrade
   #       args:
-  #         - "--py3-only"
-  #         # remove on f-strings in Py3.7
-  #         - "--keep-percent-format"
+  #         - "--py310-plus"
diff --git a/ci/doc.yml b/ci/doc.yml
@@ -3,7 +3,7 @@ channels:
   - conda-forge
   - nodefaults
 dependencies:
-  - python>=3.9
+  - python>=3.10
   - "sphinx"
   - pip
   - pip:
@@ -13,4 +13,4 @@ dependencies:
       - "sphinx_design"
       - "sphinx_togglebutton"
       - "sphinx-autodoc-typehints"
-      - -e ..
+      - -e  "..[test]"
diff --git a/docs/index.md b/docs/index.md
@@ -82,5 +82,5 @@ installation
 usage
 faq
 api
-
+releases
 ```
diff --git a/docs/releases.rst b/docs/releases.rst
@@ -0,0 +1,33 @@
+Release notes
+=============
+
+.. _v0.1:
+
+v0.1 (unreleased)
+-----------------
+
+v0.1 is the first release of VirtualiZarr!! It contains functionality for using kerchunk to find byte ranges in netCDF files,
+constructing an xarray.Dataset containing ManifestArray objects, then writing out such a dataset to kerchunk references as either json or parquet.
+
+New Features
+~~~~~~~~~~~~
+
+
+Breaking changes
+~~~~~~~~~~~~~~~~
+
+
+Deprecations
+~~~~~~~~~~~~
+
+
+Bug fixes
+~~~~~~~~~
+
+
+Documentation
+~~~~~~~~~~~~~
+
+
+Internal Changes
+~~~~~~~~~~~~~~~~
diff --git a/docs/usage.md b/docs/usage.md
@@ -27,6 +27,7 @@ vds = open_virtual_dataset('air.nc')
 
 (Notice we did not have to explicitly indicate the file format, as {py:func}`open_virtual_dataset <virtualizarr.xarray.open_virtual_dataset>` will attempt to automatically infer it.)
 
+
 ```{note}
 In future we would like for it to be possible to just use `xr.open_dataset`, e.g.
 
@@ -61,6 +62,15 @@ Attributes:
 
 These {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects are each a virtual reference to some data in the `air.nc` netCDF file, with the references stored in the form of "Chunk Manifests".
 
+### Opening remote files
+
+To open remote files as virtual datasets pass the `reader_options` options, e.g.
+
+```python
+aws_credentials = {"key": ..., "secret": ...}
+vds = open_virtual_dataset("s3://some-bucket/file.nc", reader_options={'storage_options': aws_credentials})
+```
+
 ## Chunk Manifests
 
 In the Zarr model N-dimensional arrays are stored as a series of compressed chunks, each labelled by a chunk key which indicates its position in the array. Whilst conventionally each of these Zarr chunks are a separate compressed binary file stored within a Zarr Store, there is no reason why these chunks could not actually already exist as part of another file (e.g. a netCDF file), and be loaded by reading a specific byte range from this pre-existing file.
@@ -311,27 +321,38 @@ Once we've combined references to all the chunks of all our legacy files into on
 
 The [kerchunk library](https://github.com/fsspec/kerchunk) has its own [specification](https://fsspec.github.io/kerchunk/spec.html) for how byte range references should be serialized (either as a JSON or parquet file).
 
-To write out all the references in the virtual dataset as a single kerchunk-compliant JSON file, you can use the {py:meth}`ds.virtualize.to_kerchunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_kerchunk>` accessor method.
+To write out all the references in the virtual dataset as a single kerchunk-compliant JSON or parquet file, you can use the {py:meth}`ds.virtualize.to_kerchunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_kerchunk>` accessor method.
 
 ```python
 combined_vds.virtualize.to_kerchunk('combined.json', format='json')
 ```
 
-These references can now be interpreted like they were a Zarr store by [fsspec](https://github.com/fsspec/filesystem_spec), using kerchunk's built-in xarray backend (so you need kerchunk to be installed to use `engine='kerchunk'`).
+These references can now be interpreted like they were a Zarr store by [fsspec](https://github.com/fsspec/filesystem_spec), using kerchunk's built-in xarray backend (kerchunk must be installed to use `engine='kerchunk'`).
 
 ```python
-import fsspec
+combined_ds = xr.open_dataset('combined.json', engine="kerchunk")
+```
+
+In-memory ("loadable") variables backed by numpy arrays can also be written out to kerchunk reference files, with the values serialized as bytes. This is equivalent to kerchunk's concept of "inlining", but done on a per-array basis using the `loadable_variables` kwarg rather than a per-chunk basis using kerchunk's `inline_threshold` kwarg.
 
-fs = fsspec.filesystem("reference", fo=f"combined.json")
-mapper = fs.get_mapper("")
+```{note}
+Currently you can only serialize in-memory variables to kerchunk references if they do not have any encoding.
+```
 
-combined_ds = xr.open_dataset(mapper, engine="kerchunk")
+When you have many chunks, the reference file can get large enough to be unwieldy as json. In that case the references can be instead stored as parquet. Again this uses kerchunk internally.
+
+```python
+combined_vds.virtualize.to_kerchunk('combined.parq', format='parquet')
 ```
 
-```{note}
-Currently you can only serialize virtual variables backed by `ManifestArray` objects to kerchunk reference files, not real in-memory numpy-backed variables.
+And again we can read these references using the "kerchunk" backend as if it were a regular Zarr store
+
+```python
+combined_ds = xr.open_dataset('combined.parq', engine="kerchunk")
 ```
 
+By default references are placed in separate parquet file when the total number of references exceeds `record_size`. If there are fewer than `categorical_threshold` unique urls referenced by a particular variable, url will be stored as a categorical variable.
+
 ### Writing as Zarr
 
 Alternatively, we can write these references out as an actual Zarr store, at least one that is compliant with the [proposed "Chunk Manifest" ZEP](https://github.com/zarr-developers/zarr-specs/issues/287). To do this we simply use the {py:meth}`ds.virtualize.to_zarr <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_zarr>` accessor method.

diff --git a/pyproject.toml b/pyproject.toml
@@ -14,12 +14,11 @@ classifiers = [
     "License :: OSI Approved :: Apache Software License",
     "Operating System :: OS Independent",
     "Programming Language :: Python",
-    "Programming Language :: Python :: 3.9",
     "Programming Language :: Python :: 3.10",
     "Programming Language :: Python :: 3.11",
     "Programming Language :: Python :: 3.12",
 ]
-requires-python = ">=3.9"
+requires-python = ">=3.10"
 dynamic = ["version"]
 dependencies = [
     "xarray>=2024.5.0",
@@ -29,19 +28,23 @@ dependencies = [
     "numpy>=2.0.0rc1",
     "ujson",
     "packaging",
+    "universal-pathlib",
 ]
 
 [project.optional-dependencies]
 test = [
     "codecov",
     "pre-commit",
+    "ruff",
     "pytest-mypy",
     "pytest-cov",
     "pytest",
-    "scipy",
     "pooch",
-    "ruff",
-
+    "scipy",
+    "netcdf4",
+    "fsspec",
+    "s3fs",
+    "fastparquet",
 ]
-Original file line number
+Diff line change
@@ Expand Up / @@ -82,5 +82,5 @@ installation @@
     usage
     faq
     api
+    releases
     ```