Archive the project (#539)

This PR prepares the project for archiving. The change adds a number of data files to git LFS for posterity. If the project is restarted, they should ideally be removed from the project and the `.gitattributes` should be deleted.
NOAA-GSL · Sep 30, 2024 · cbd1785 · cbd1785
2 parents 61f350a + 4e19e87
commit cbd1785
Show file tree

Hide file tree

Showing 201 changed files with 824 additions and 1 deletion.
diff --git a/.gitattributes b/.gitattributes
@@ -0,0 +1,2 @@
+*.nc4.gz filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
diff --git a/.prettierignore b/.prettierignore
@@ -1,6 +1,9 @@
 tmp/
 .github/
 
+# Ignore the data_samples
+data_samples/
+
 # FIXME: Maybe we should lint/format the k8s files?
 kubernetes/
 

diff --git a/README.md b/README.md
@@ -1,6 +1,8 @@
 # Unified Graphics
 
-An experimental visualization system for 3D-RTMA &amp; RRFS model output.
+An experimental visualization system for 3D-RTMA & RRFS model output. This project ended Sept 2024.
+
+An example of the input data and internal application data can be found in the `data_samples` directory.
 
 ## Get in Touch
 

diff --git a/data_samples/README.md b/data_samples/README.md
@@ -0,0 +1,106 @@
+# Data Sample
+
+## Input Data
+
+The input data directory mimics the structure of our input S3 bucket. The application was designed to consume NetCDF file output from [Gridpoint Statistical Interpoliation (GSI) diagnostic files](https://web.archive.org/web/20240521023824/https://dtcenter.org/sites/default/files/community-code/gsi/docs/users-guide/html_v3.7/gsi_ch3.html#gsi-analysis-result-files-in-run-directory).
+
+## App Data
+
+This directory contains outlines of the Zarr data structure, a dump from our SQL database, and the outlines of the Parquet data structure.
+
+- `app_data/s3_diag_zarr_inventory.*.txt` - a few files showing the bucket layout of the Zarr array in S3. One focus is on the general group structure, the other file is focused on data for a particular day. The full array was to big to preserve and it was difficult to extract a subset.
+- `app_data/*.sql` - the SQL DB dump
+- `app_data/HRRR_RTMA_.../...` - the parquet store with sample data
+
+## Parquet Data Structure
+
+We attempted to use Parquet tables to get around some limitations we were finding with Zarr arrays. The jury is still out if a tabular or array form is better for accessing the data. There's a chance our Zarr array was just too large (we only had one for all data) and we could have benefitted by creating numerous Zarr arrays with consolidated metadata.
+
+```python
+>>> import pyarrow.parquet as pq
+>>> import pandas as pd
+>>> parquet_table = pq.read_table('RTMA_HRRR_WCOSS_CONUS_REALTIME/t/loop=anl/0004ee89f24c4f4784f76d1384e23588-0.parquet')
+>>> df = parquet_table.to_pandas()
+>>> print(df)
+nobs    observation  forecast_unadjusted  forecast_adjusted  obs_minus_forecast_unadjusted  obs_minus_forecast_adjusted   latitude   longitude  is_used initialization_time
+0       226.350006           225.855835         225.855835                       0.494174                     0.494174  32.733002 -125.932999    False 2024-03-07 09:00:00
+1       225.949997           226.027313         226.027313                      -0.077320                    -0.077320  32.910000 -125.223007    False 2024-03-07 09:00:00
+2       226.149994           225.890976         225.890976                       0.259025                     0.259025  32.813000 -125.639999    False 2024-03-07 09:00:00
+3       226.350006           225.801147         225.801147                       0.548863                     0.548863  32.715000 -126.057999    False 2024-03-07 09:00:00
+4       226.550003           226.019058         226.019058                       0.530942                     0.530942  32.615002 -126.477005    False 2024-03-07 09:00:00
+...            ...                  ...                ...                            ...                          ...        ...         ...      ...                 ...
+73958   273.750000           273.555115         273.586578                       0.194870                     0.163422  46.250832  -63.334167     True 2024-03-07 09:00:00
+73959   273.750000           273.594238         273.586578                       0.155760                     0.163422  46.250832  -63.334167     True 2024-03-07 09:00:00
+73960   272.549988           273.157928         273.146973                      -0.607946                    -0.596997  46.299500  -63.176331     True 2024-03-07 09:00:00
+73961   273.149994           273.147095         273.147003                       0.002895                     0.003003  46.299500  -63.176331     True 2024-03-07 09:00:00
+73962   273.149994           273.060333         273.147003                       0.089654                     0.003003  46.299500  -63.176331     True 2024-03-07 09:00:00
+
+[73963 rows x 9 columns]
+>>> parquet_table = pq.read_table('RTMA_HRRR_WCOSS_CONUS_REALTIME/t/loop=anl/0019cd0014f94635aad5b82c4c4b7cb4-0.parquet')
+>>> df = parquet_table.to_pandas()
+>>> print(df)
+nobs   observation  forecast_unadjusted  forecast_adjusted  obs_minus_forecast_unadjusted  obs_minus_forecast_adjusted   latitude   longitude  is_used initialization_time
+0       226.449997           226.298874         226.298874                       0.151119                     0.151119  31.202000 -123.207993     True 2023-11-13 20:00:00
+1       226.149994           226.220901         226.220901                      -0.070908                    -0.070908  31.150000 -123.460999     True 2023-11-13 20:00:00
+2       225.949997           226.098465         226.098465                      -0.148462                    -0.148462  31.118999 -123.718002    False 2023-11-13 20:00:00
+3       225.949997           226.062836         226.062836                      -0.112842                    -0.112842  31.118000 -123.742004     True 2023-11-13 20:00:00
+4       226.149994           225.935959         225.935959                       0.214042                     0.214042  31.098000 -124.001999     True 2023-11-13 20:00:00
+...            ...                  ...                ...                            ...                          ...        ...         ...      ...                 ...
+92779   275.350006           275.182159         275.144073                       0.167840                     0.205931  46.250832  -63.334167     True 2023-11-13 20:00:00
+92780   275.350006           275.256195         275.408539                       0.093802                    -0.058537  46.299500  -63.176331    False 2023-11-13 20:00:00
+92781   275.350006           275.271393         275.408539                       0.078619                    -0.058537  46.299500  -63.176331    False 2023-11-13 20:00:00
+92782   275.350006           275.271393         275.408539                       0.078619                    -0.058537  46.299500  -63.176331     True 2023-11-13 20:00:00
+92783   274.850006           275.256195         275.408539                      -0.406198                    -0.558537  46.299500  -63.176331     True 2023-11-13 20:00:00
+>>> parquet_table = pq.read_table('RTMA_HRRR_WCOSS_CONUS_REALTIME/t/loop=ges/001760e535744f958328694b4b0a68b3-0.parquet')
+>>> pd = parquet_table.to_pandas()
+>>> print(pd)
+nobs   observation  forecast_unadjusted  forecast_adjusted  obs_minus_forecast_unadjusted  obs_minus_forecast_adjusted   latitude   longitude  is_used initialization_time
+0       221.149994           220.868866         220.868866                       0.281134                     0.281134  35.980000 -125.253006    False 2024-04-23 14:00:00
+1       220.149994           220.890320         220.890320                      -0.740329                    -0.740329  36.037998 -125.147003    False 2024-04-23 14:00:00
+2       282.850006           282.224060         282.342255                       0.625944                     0.507754  46.143330 -131.089996    False 2024-04-23 14:00:00
+3       282.950012           282.224060         282.342255                       0.725944                     0.607754  46.143330 -131.089996    False 2024-04-23 14:00:00
+4       282.950012           282.224060         282.342255                       0.725944                     0.607754  46.143330 -131.089996    False 2024-04-23 14:00:00
+...            ...                  ...                ...                            ...                          ...        ...         ...      ...                 ...
+96622   278.750000           278.530273         279.071014                       0.219722                    -0.321001  46.250832  -63.334167    False 2024-04-23 14:00:00
+96623   279.250000           278.215424         278.905365                       1.034572                     0.344643  46.299500  -63.176331    False 2024-04-23 14:00:00
+96624   279.250000           278.215424         278.905365                       1.034572                     0.344643  46.299500  -63.176331    False 2024-04-23 14:00:00
+96625   279.250000           278.197266         278.905365                       1.052734                     0.344643  46.299500  -63.176331    False 2024-04-23 14:00:00
+96626   279.250000           278.215424         278.905365                       1.034572                     0.344643  46.299500  -63.176331    False 2024-04-23 14:00:00
+
+[96627 rows x 9 columns]
+```
+
+## Zarr Data Structure
+
+To give an idea of the shape of the data in a Zarr array:
+
+```python
+>>> import xarray
+>>> import zarr
+>>> ds = xarray.open_dataset("s3://osti-modeling-dev-rtma-vis-prod/diagnostics.zarr/RTMA/WCOSS/CONUS/HRRR/REALTIME/t/2024-09-18T00:00/anl/", engine="zarr")
+>>> print(ds)
+<xarray.Dataset> Size: 3MB
+Dimensions:                        (nobs: 116004)
+Coordinates:
+    is_used                        (nobs) bool 116kB ...
+    latitude                       (nobs) float32 464kB ...
+    longitude                      (nobs) float32 464kB ...
+Dimensions without coordinates: nobs
+Data variables:
+    forecast_adjusted              (nobs) float32 464kB ...
+    forecast_unadjusted            (nobs) float32 464kB ...
+    obs_minus_forecast_adjusted    (nobs) float32 464kB ...
+    obs_minus_forecast_unadjusted  (nobs) float32 464kB ...
+    observation                    (nobs) float32 464kB ...
+Attributes:
+    background:           HRRR
+    domain:               CONUS
+    frequency:            REALTIME
+    initialization_time:  2024-09-18T00:00
+    loop:                 anl
+    model:                RTMA
+    name:                 t
+    system:               WCOSS
+```
+
+The "first guess" (ges) file has a similar structure.
diff --git a/...data/RTMA_HRRR_WCOSS_CONUS_REALTIME/t/loop=anl/0019cd0014f94635aad5b82c4c4b7cb4-0.parquet b/...data/RTMA_HRRR_WCOSS_CONUS_REALTIME/t/loop=anl/0019cd0014f94635aad5b82c4c4b7cb4-0.parquet
diff --git a/...data/RTMA_HRRR_WCOSS_CONUS_REALTIME/t/loop=ges/001760e535744f958328694b4b0a68b3-0.parquet b/...data/RTMA_HRRR_WCOSS_CONUS_REALTIME/t/loop=ges/001760e535744f958328694b4b0a68b3-0.parquet
diff --git a/data_samples/app_data/rtma-vis.sql b/data_samples/app_data/rtma-vis.sql
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		*.nc4.gz filter=lfs diff=lfs merge=lfs -text
		*.parquet filter=lfs diff=lfs merge=lfs -text