From 534ae0174474b2948b802e2411ab9e2456ecdbb4 Mon Sep 17 00:00:00 2001 From: Tom Nicholas Date: Tue, 22 Oct 2024 14:39:49 -0600 Subject: [PATCH] FAQ updates (#266) * faq question about already having kerchunked data * note compatibility with icechunk * move more basic usage questions to the bottom * q about custom readers * split API into User API and Developer API * note about manifest classes --- docs/api.rst | 41 +++++++++++++++++++++++++---------------- docs/faq.md | 28 ++++++++++++++++++++++++++++ 2 files changed, 53 insertions(+), 16 deletions(-) diff --git a/docs/api.rst b/docs/api.rst index 755713d0..fef8f2f0 100644 --- a/docs/api.rst +++ b/docs/api.rst @@ -5,21 +5,13 @@ API Reference .. currentmodule:: virtualizarr VirtualiZarr has a small API surface, because most of the complexity is handled by xarray functions like ``xarray.concat`` and ``xarray.merge``. +Users can use xarray for every step apart from reading and serializing virtual references. -Manifests -========= - -.. currentmodule:: virtualizarr.manifests -.. autosummary:: - :nosignatures: - :toctree: generated/ - - ChunkManifest - ManifestArray - +User API +======== Reading -======= +------- .. currentmodule:: virtualizarr.backend .. autosummary:: @@ -30,7 +22,7 @@ Reading Serialization -============= +------------- .. currentmodule:: virtualizarr.accessor .. autosummary:: @@ -41,9 +33,8 @@ Serialization VirtualiZarrDatasetAccessor.to_zarr VirtualiZarrDatasetAccessor.to_icechunk - Rewriting -============= +--------- .. currentmodule:: virtualizarr.accessor .. autosummary:: @@ -52,9 +43,27 @@ Rewriting VirtualiZarrDatasetAccessor.rename_paths +Developer API +============= + +If you want to write a new reader to create virtual references pointing to a custom file format, you will need to use VirtualiZarr's internal classes. + +Manifests +--------- + +VirtualiZarr uses these classes to store virtual references internally. + +.. currentmodule:: virtualizarr.manifests +.. autosummary:: + :nosignatures: + :toctree: generated/ + + ChunkManifest + ManifestArray + Array API -========= +--------- VirtualiZarr's :py:class:`~virtualizarr.ManifestArray` objects support a limited subset of the Python Array API standard in :py:mod:`virtualizarr.manifests.array_api`. diff --git a/docs/faq.md b/docs/faq.md index d273a529..81f55aa3 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -68,3 +68,31 @@ We have a lot of ideas, including: - [Generating references without kerchunk](https://github.com/zarr-developers/VirtualiZarr/issues/78) If you see other opportunities then we would love to hear your ideas! + +## Is this compatible with Icechunk? + +Yes! VirtualiZarr allows you to ingest data as virtual references and write those references into an Icechunk Store. See the [Icechunk documentation on creating virtaul datasets.](https://icechunk.io/icechunk-python/virtual/#creating-a-virtual-dataset-with-virtualizarr) + +## I already have Kerchunked data, do I have to redo that work? + +No - you can simply open the Kerchunk-formatted references you already have into VirtualiZarr directly. Then you can re-save them into a new format, e.g. [Icechunk](https://icechunk.io/) like so: + +```python +from virtualizarr import open_virtual_dataset + +vds = open_virtual_dataset('refs.json') +# vds = open_virtual_dataset('refs.parq') # kerchunk parquet files are supported too + +vds.virtualize.to_icechunk(icechunkstore) +``` + +## Can I add a new reader for my custom file format? + +There are a lot of legacy file formats which could potentially be represented as virtual zarr references (see [this issue](https://github.com/zarr-developers/VirtualiZarr/issues/218) for some examples). VirtualiZarr ships with some readers for common formats (e.g. netCDF/HDF5), but you may want to write your own reader for some other file format. + +VirtualiZarr is designed in a way to make this as straightforward as possible. If you want to do this then [this comment](https://github.com/zarr-developers/VirtualiZarr/issues/262#issuecomment-2429968244 +) will be helpful. + +You can also use this approach to write a reader that starts from a kerchunk-formatted virtual references dict. + +Currently if you want to call your new reader from `virtualizarr.open_virtual_dataset` you would need to open a PR to this repository, but we plan to generalize this system to allow 3rd party libraries to plug in via an entrypoint (see [issue #245](https://github.com/zarr-developers/VirtualiZarr/issues/245)).