-
Notifications
You must be signed in to change notification settings - Fork 25
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Virtualizarr + Coiled Serverless Example Notebook (#233)
* terraclimate_coiled ex * adds example to releases.rst * adds fastparquet to ex deps for writing refs * removed FileType
- Loading branch information
1 parent
18c5a10
commit 708d168
Showing
2 changed files
with
253 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,248 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Virtualizarr and Coiled - Building a virtual dataset of Terraclimate\n", | ||
"\n", | ||
"This notebook is an example of using Virtualizarr together with the Python distributed processing framework [Coiled](https://www.coiled.io/) to generate references using [serverless functions](https://docs.coiled.io/user_guide/functions.html). \n", | ||
"- **Note:** running this notebook requires a coiled account.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## The dataset\n", | ||
"For this example, we are going to create a virtual zarr store from the [Terraclimate](https://www.climatologylab.org/terraclimate.html) dataset. Terraclimate is a monthly dataset spanning 66 years and containing 14 climate and water balance variables. It is made up of 924 individual NetCDF4 files. When represented as an Xarray dataset, it is over 1TB in size." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Parallelizing `virtualizarr` reference generation with coiled serverless functions\n", | ||
"Coiled serverless functions allow us to easily spin up hundreds of small compute instances, which are great for individual file reference generation. We were able to process 924 netCDF files into a 1TB virtual xarray dataset in 9 minutes for ~$0.24." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Installation and environment\n", | ||
"\n", | ||
"You should install the Python requirements in a clean virtual environment of your choice. Each coiled serverless function will re-use this environment, so it's best to start with a clean slate.\n", | ||
"\n", | ||
"```bash\n", | ||
"pip install virtualizarr coiled xarray fastparquet ipykernel\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Imports\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import coiled\n", | ||
"import numpy as np\n", | ||
"import xarray as xr\n", | ||
"\n", | ||
"from virtualizarr import open_virtual_dataset" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Create the Terraclimate variable and year url combinations \n", | ||
"`14 variables * 66 years = 924 NetCDF files`\n", | ||
"\n", | ||
"\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"tvars = [\n", | ||
" \"aet\",\n", | ||
" \"def\",\n", | ||
" \"pet\",\n", | ||
" \"ppt\",\n", | ||
" \"q\",\n", | ||
" \"soil\",\n", | ||
" \"srad\",\n", | ||
" \"swe\",\n", | ||
" \"tmax\",\n", | ||
" \"tmin\",\n", | ||
" \"vap\",\n", | ||
" \"ws\",\n", | ||
" \"vpd\",\n", | ||
" \"PDSI\",\n", | ||
"]\n", | ||
"min_year = 1958\n", | ||
"max_year = 2023\n", | ||
"time_list = np.arange(min_year, max_year + 1, 1)\n", | ||
"\n", | ||
"combinations = [\n", | ||
" f\"https://climate.northwestknowledge.net/TERRACLIMATE-DATA/TerraClimate_{var}_{year}.nc\"\n", | ||
" for year in time_list\n", | ||
" for var in tvars\n", | ||
"]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Define the coiled serverless function\n", | ||
"\n", | ||
"### Serverless function setup notes:\n", | ||
"- This coiled function is tailored to AWS\n", | ||
"- `vm_type=[\"t4g.small\"]` - This is a small instance, you shouldn't need large machines for reference generation\n", | ||
"- `spot_policy=\"spot_with_fallback\"` is cheaper, but might have unintended consequences\n", | ||
"- `arm=True` uses VMs with ARM architecture, which is cheaper\n", | ||
"- `idle_timeout=\"10 minutes\"` workers will shut down after 10 minutes of inactivity \n", | ||
"- `n_workers=[100, 300]` adaptive scaling between 100 & 300 workers\n", | ||
"- `name` [optional] if you want to keep track of your cluster in the coiled dashboard\n", | ||
"\n", | ||
"More details can be found in the [serverless function API](https://docs.coiled.io/user_guide/functions.html#api)." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"@coiled.function(\n", | ||
" region=\"us-west-2\",\n", | ||
" vm_type=[\"t4g.small\"],\n", | ||
" spot_policy=\"spot_with_fallback\",\n", | ||
" arm=True,\n", | ||
" idle_timeout=\"10 minutes\",\n", | ||
" n_workers=[100, 300],\n", | ||
" name=\"parallel_reference_generation\",\n", | ||
")\n", | ||
"def process(filename):\n", | ||
" vds = open_virtual_dataset(\n", | ||
" filename,\n", | ||
" decode_times=True,\n", | ||
" loadable_variables=[\"time\", \"lat\", \"lon\", \"crs\"],\n", | ||
" filetype=\"netcdf4\",\n", | ||
" indexes={},\n", | ||
" )\n", | ||
" return vds\n", | ||
"\n", | ||
"\n", | ||
"# process.map distributes out the input file urls to coiled functions\n", | ||
"# retires=10 allows for individual task retires, which can be useful for inconsistent server behavior\n", | ||
"results = process.map(combinations, retries=10)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"## Combine references into virtual dataset" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# extract generator values into a list\n", | ||
"vds_list = [result for result in results]\n", | ||
"\n", | ||
"# combine individual refs into a virtual Xarray dataset\n", | ||
"mds = xr.combine_by_coords(\n", | ||
" vds_list, coords=\"minimal\", compat=\"override\", combine_attrs=\"drop_conflicts\"\n", | ||
")\n", | ||
"mds" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"print(str(\"{0:.2f}\".format(mds.nbytes / 1e12)), \" TB\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Save the reference to disk\n", | ||
"\n", | ||
"Now that we have this virtual dataset, we can save the combined reference file for future use. The resulting reference parquet file is only 2.6MB!\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"mds.virtualize.to_kerchunk(\"terraclimate.parquet\", format=\"parquet\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Open the reference file and load into Xarray\n", | ||
"You can now open up the reference file with Xarray and Kerchunk. This will now behave similarly to a normal Xarray dataset. \n", | ||
"\n", | ||
"**Warning:** Calling `to_zarr` on this dataset will try to write out 1TB of data.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"combined_ds = xr.open_dataset(\"terraclimate.parquet\", engine=\"kerchunk\", chunks={})\n", | ||
"combined_ds" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.0" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |