diff --git a/docs/releases.rst b/docs/releases.rst index dfe5a31d..ec057807 100644 --- a/docs/releases.rst +++ b/docs/releases.rst @@ -25,6 +25,7 @@ New Features Breaking changes ~~~~~~~~~~~~~~~~ + - Serialize valid ZarrV3 metadata and require full compressor numcodec config (for :pull:`193`) By `Gustavo Hidalgo `_. - VirtualiZarr's `ZArray`, `ChunkEntry`, and `Codec` no longer subclass @@ -49,6 +50,10 @@ Bug fixes Documentation ~~~~~~~~~~~~~ +- Adds virtualizarr + coiled serverless example notebook (:pull`223`) + By `Raphael Hagen `_. + + Internal Changes ~~~~~~~~~~~~~~~~ diff --git a/examples/coiled/terraclimate.ipynb b/examples/coiled/terraclimate.ipynb new file mode 100644 index 00000000..205f094b --- /dev/null +++ b/examples/coiled/terraclimate.ipynb @@ -0,0 +1,248 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Virtualizarr and Coiled - Building a virtual dataset of Terraclimate\n", + "\n", + "This notebook is an example of using Virtualizarr together with the Python distributed processing framework [Coiled](https://www.coiled.io/) to generate references using [serverless functions](https://docs.coiled.io/user_guide/functions.html). \n", + "- **Note:** running this notebook requires a coiled account.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## The dataset\n", + "For this example, we are going to create a virtual zarr store from the [Terraclimate](https://www.climatologylab.org/terraclimate.html) dataset. Terraclimate is a monthly dataset spanning 66 years and containing 14 climate and water balance variables. It is made up of 924 individual NetCDF4 files. When represented as an Xarray dataset, it is over 1TB in size." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Parallelizing `virtualizarr` reference generation with coiled serverless functions\n", + "Coiled serverless functions allow us to easily spin up hundreds of small compute instances, which are great for individual file reference generation. We were able to process 924 netCDF files into a 1TB virtual xarray dataset in 9 minutes for ~$0.24." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Installation and environment\n", + "\n", + "You should install the Python requirements in a clean virtual environment of your choice. Each coiled serverless function will re-use this environment, so it's best to start with a clean slate.\n", + "\n", + "```bash\n", + "pip install virtualizarr coiled xarray fastparquet ipykernel\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Imports\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import coiled\n", + "import numpy as np\n", + "import xarray as xr\n", + "\n", + "from virtualizarr import open_virtual_dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Create the Terraclimate variable and year url combinations \n", + "`14 variables * 66 years = 924 NetCDF files`\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "tvars = [\n", + " \"aet\",\n", + " \"def\",\n", + " \"pet\",\n", + " \"ppt\",\n", + " \"q\",\n", + " \"soil\",\n", + " \"srad\",\n", + " \"swe\",\n", + " \"tmax\",\n", + " \"tmin\",\n", + " \"vap\",\n", + " \"ws\",\n", + " \"vpd\",\n", + " \"PDSI\",\n", + "]\n", + "min_year = 1958\n", + "max_year = 2023\n", + "time_list = np.arange(min_year, max_year + 1, 1)\n", + "\n", + "combinations = [\n", + " f\"https://climate.northwestknowledge.net/TERRACLIMATE-DATA/TerraClimate_{var}_{year}.nc\"\n", + " for year in time_list\n", + " for var in tvars\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Define the coiled serverless function\n", + "\n", + "### Serverless function setup notes:\n", + "- This coiled function is tailored to AWS\n", + "- `vm_type=[\"t4g.small\"]` - This is a small instance, you shouldn't need large machines for reference generation\n", + "- `spot_policy=\"spot_with_fallback\"` is cheaper, but might have unintended consequences\n", + "- `arm=True` uses VMs with ARM architecture, which is cheaper\n", + "- `idle_timeout=\"10 minutes\"` workers will shut down after 10 minutes of inactivity \n", + "- `n_workers=[100, 300]` adaptive scaling between 100 & 300 workers\n", + "- `name` [optional] if you want to keep track of your cluster in the coiled dashboard\n", + "\n", + "More details can be found in the [serverless function API](https://docs.coiled.io/user_guide/functions.html#api)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "@coiled.function(\n", + " region=\"us-west-2\",\n", + " vm_type=[\"t4g.small\"],\n", + " spot_policy=\"spot_with_fallback\",\n", + " arm=True,\n", + " idle_timeout=\"10 minutes\",\n", + " n_workers=[100, 300],\n", + " name=\"parallel_reference_generation\",\n", + ")\n", + "def process(filename):\n", + " vds = open_virtual_dataset(\n", + " filename,\n", + " decode_times=True,\n", + " loadable_variables=[\"time\", \"lat\", \"lon\", \"crs\"],\n", + " filetype=\"netcdf4\",\n", + " indexes={},\n", + " )\n", + " return vds\n", + "\n", + "\n", + "# process.map distributes out the input file urls to coiled functions\n", + "# retires=10 allows for individual task retires, which can be useful for inconsistent server behavior\n", + "results = process.map(combinations, retries=10)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Combine references into virtual dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# extract generator values into a list\n", + "vds_list = [result for result in results]\n", + "\n", + "# combine individual refs into a virtual Xarray dataset\n", + "mds = xr.combine_by_coords(\n", + " vds_list, coords=\"minimal\", compat=\"override\", combine_attrs=\"drop_conflicts\"\n", + ")\n", + "mds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(str(\"{0:.2f}\".format(mds.nbytes / 1e12)), \" TB\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Save the reference to disk\n", + "\n", + "Now that we have this virtual dataset, we can save the combined reference file for future use. The resulting reference parquet file is only 2.6MB!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "mds.virtualize.to_kerchunk(\"terraclimate.parquet\", format=\"parquet\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Open the reference file and load into Xarray\n", + "You can now open up the reference file with Xarray and Kerchunk. This will now behave similarly to a normal Xarray dataset. \n", + "\n", + "**Warning:** Calling `to_zarr` on this dataset will try to write out 1TB of data.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "combined_ds = xr.open_dataset(\"terraclimate.parquet\", engine=\"kerchunk\", chunks={})\n", + "combined_ds" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.0" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}