Skip to content

Commit

Permalink
Merge branch 'main' into numpy_arrays_manifest
Browse files Browse the repository at this point in the history
  • Loading branch information
TomNicholas committed Jun 10, 2024
2 parents 6079198 + cc97112 commit 24f7131
Show file tree
Hide file tree
Showing 22 changed files with 573 additions and 130 deletions.
6 changes: 6 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
<!-- Feel free to remove check-list items aren't relevant to your change -->

- [ ] Closes #xxxx
- [ ] Tests added
- [ ] Changes are documented in `docs/releases.rst`
- [ ] New functions/methods are listed in `api.rst`
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
shell: bash -l {0}
strategy:
matrix:
python-version: ["3.9", "3.10", "3.11", "3.12"]
python-version: ["3.10", "3.11", "3.12"]
steps:
- uses: actions/checkout@v4

Expand Down
8 changes: 3 additions & 5 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ repos:

- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: "v0.4.3"
rev: "v0.4.7"
hooks:
# Run the linter.
- id: ruff
Expand All @@ -37,10 +37,8 @@ repos:
]
# run this occasionally, ref discussion https://github.com/pydata/xarray/pull/3194
# - repo: https://github.com/asottile/pyupgrade
# rev: v1.22.1
# rev: v3.15.2
# hooks:
# - id: pyupgrade
# args:
# - "--py3-only"
# # remove on f-strings in Py3.7
# - "--keep-percent-format"
# - "--py310-plus"
4 changes: 2 additions & 2 deletions ci/doc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ channels:
- conda-forge
- nodefaults
dependencies:
- python>=3.9
- python>=3.10
- "sphinx"
- pip
- pip:
Expand All @@ -13,4 +13,4 @@ dependencies:
- "sphinx_design"
- "sphinx_togglebutton"
- "sphinx-autodoc-typehints"
- -e ..
- -e "..[test]"
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,5 +82,5 @@ installation
usage
faq
api
releases
```
33 changes: 33 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
Release notes
=============

.. _v0.1:

v0.1 (unreleased)
-----------------

v0.1 is the first release of VirtualiZarr!! It contains functionality for using kerchunk to find byte ranges in netCDF files,
constructing an xarray.Dataset containing ManifestArray objects, then writing out such a dataset to kerchunk references as either json or parquet.

New Features
~~~~~~~~~~~~


Breaking changes
~~~~~~~~~~~~~~~~


Deprecations
~~~~~~~~~~~~


Bug fixes
~~~~~~~~~


Documentation
~~~~~~~~~~~~~


Internal Changes
~~~~~~~~~~~~~~~~
37 changes: 29 additions & 8 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ vds = open_virtual_dataset('air.nc')

(Notice we did not have to explicitly indicate the file format, as {py:func}`open_virtual_dataset <virtualizarr.xarray.open_virtual_dataset>` will attempt to automatically infer it.)


```{note}
In future we would like for it to be possible to just use `xr.open_dataset`, e.g.
Expand Down Expand Up @@ -61,6 +62,15 @@ Attributes:

These {py:class}`ManifestArray <virtualizarr.manifests.ManifestArray>` objects are each a virtual reference to some data in the `air.nc` netCDF file, with the references stored in the form of "Chunk Manifests".

### Opening remote files

To open remote files as virtual datasets pass the `reader_options` options, e.g.

```python
aws_credentials = {"key": ..., "secret": ...}
vds = open_virtual_dataset("s3://some-bucket/file.nc", reader_options={'storage_options': aws_credentials})
```

## Chunk Manifests

In the Zarr model N-dimensional arrays are stored as a series of compressed chunks, each labelled by a chunk key which indicates its position in the array. Whilst conventionally each of these Zarr chunks are a separate compressed binary file stored within a Zarr Store, there is no reason why these chunks could not actually already exist as part of another file (e.g. a netCDF file), and be loaded by reading a specific byte range from this pre-existing file.
Expand Down Expand Up @@ -311,27 +321,38 @@ Once we've combined references to all the chunks of all our legacy files into on

The [kerchunk library](https://github.com/fsspec/kerchunk) has its own [specification](https://fsspec.github.io/kerchunk/spec.html) for how byte range references should be serialized (either as a JSON or parquet file).

To write out all the references in the virtual dataset as a single kerchunk-compliant JSON file, you can use the {py:meth}`ds.virtualize.to_kerchunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_kerchunk>` accessor method.
To write out all the references in the virtual dataset as a single kerchunk-compliant JSON or parquet file, you can use the {py:meth}`ds.virtualize.to_kerchunk <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_kerchunk>` accessor method.

```python
combined_vds.virtualize.to_kerchunk('combined.json', format='json')
```

These references can now be interpreted like they were a Zarr store by [fsspec](https://github.com/fsspec/filesystem_spec), using kerchunk's built-in xarray backend (so you need kerchunk to be installed to use `engine='kerchunk'`).
These references can now be interpreted like they were a Zarr store by [fsspec](https://github.com/fsspec/filesystem_spec), using kerchunk's built-in xarray backend (kerchunk must be installed to use `engine='kerchunk'`).

```python
import fsspec
combined_ds = xr.open_dataset('combined.json', engine="kerchunk")
```

In-memory ("loadable") variables backed by numpy arrays can also be written out to kerchunk reference files, with the values serialized as bytes. This is equivalent to kerchunk's concept of "inlining", but done on a per-array basis using the `loadable_variables` kwarg rather than a per-chunk basis using kerchunk's `inline_threshold` kwarg.

fs = fsspec.filesystem("reference", fo=f"combined.json")
mapper = fs.get_mapper("")
```{note}
Currently you can only serialize in-memory variables to kerchunk references if they do not have any encoding.
```

combined_ds = xr.open_dataset(mapper, engine="kerchunk")
When you have many chunks, the reference file can get large enough to be unwieldy as json. In that case the references can be instead stored as parquet. Again this uses kerchunk internally.

```python
combined_vds.virtualize.to_kerchunk('combined.parq', format='parquet')
```

```{note}
Currently you can only serialize virtual variables backed by `ManifestArray` objects to kerchunk reference files, not real in-memory numpy-backed variables.
And again we can read these references using the "kerchunk" backend as if it were a regular Zarr store

```python
combined_ds = xr.open_dataset('combined.parq', engine="kerchunk")
```

By default references are placed in separate parquet file when the total number of references exceeds `record_size`. If there are fewer than `categorical_threshold` unique urls referenced by a particular variable, url will be stored as a categorical variable.

### Writing as Zarr

Alternatively, we can write these references out as an actual Zarr store, at least one that is compliant with the [proposed "Chunk Manifest" ZEP](https://github.com/zarr-developers/zarr-specs/issues/287). To do this we simply use the {py:meth}`ds.virtualize.to_zarr <virtualizarr.xarray.VirtualiZarrDatasetAccessor.to_zarr>` accessor method.
Expand Down
13 changes: 8 additions & 5 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,11 @@ classifiers = [
"License :: OSI Approved :: Apache Software License",
"Operating System :: OS Independent",
"Programming Language :: Python",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
]
requires-python = ">=3.9"
requires-python = ">=3.10"
dynamic = ["version"]
dependencies = [
"xarray>=2024.5.0",
Expand All @@ -29,19 +28,23 @@ dependencies = [
"numpy>=2.0.0rc1",
"ujson",
"packaging",
"universal-pathlib",
]

[project.optional-dependencies]
test = [
"codecov",
"pre-commit",
"ruff",
"pytest-mypy",
"pytest-cov",
"pytest",
"scipy",
"pooch",
"ruff",

"scipy",
"netcdf4",
"fsspec",
"s3fs",
"fastparquet",
]


Expand Down
Loading

0 comments on commit 24f7131

Please sign in to comment.