Skip to content

Commit

Permalink
Parallel Computations (#39)
Browse files Browse the repository at this point in the history

* improve assignment table functions (#38)

* update assign logging and force dtypes before merging

* new parallel assignment by propagation functions

* map propagate and resolve propagate functions

* cache prop tables, add docstrings and todos

* copy rows in prop for speed and lower memory req

* correct false checks for empty rows and naming in assign by clusters

* correct bugs in df filters, map assign ungauged

* revise gis generating functions for new column names, logging

* incrememnt version number
  • Loading branch information
rileyhales authored Sep 22, 2022
1 parent af0245d commit e5c988c
Show file tree
Hide file tree
Showing 22 changed files with 453 additions and 438 deletions.
3 changes: 3 additions & 0 deletions docs/api/assign.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `saber.assign`

::: saber.assign
3 changes: 3 additions & 0 deletions docs/api/cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `saber.cluster`

::: saber.cluster
3 changes: 3 additions & 0 deletions docs/api/gis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `saber.gis`

::: saber.gis
8 changes: 7 additions & 1 deletion docs/api/index.md
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
# API Documentation
# `saber-hbc` API

* [`saber.assign`](assign.md)
* [`saber.cluster`](cluster.md)
* [`saber.gis`](gis.md)
* [`saber.prep`](prep.md)
* [`saber.validate`](validate.md)
3 changes: 3 additions & 0 deletions docs/api/prep.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `saber.prep`

::: saber.prep
3 changes: 3 additions & 0 deletions docs/api/validate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# `saber.validate`

::: saber.validate
41 changes: 41 additions & 0 deletions docs/data/discharge-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Required Hydrological Datasets

1. Hindcast/Retrospective discharge for every stream segment (reporting point) in the model. This is a time series of
discharge, e.g. hydrograph, for each stream segment. The data should be saved in parquet format and named
`hindcast_series_table.parquet`. The DataFrame should have:
1. An index named `datetime` of type `datetime`. Contains the datetime stamp for the simulated values (rows)
2. 1 column per stream, column name is the stream's model ID and is type string, containing the discharge for each
time step.
2. Observed discharge data for each gauge. 1 file per gauge named `{gauge_id}.csv`. The DataFrame should have:
1. `datetime`: The datetime stamp for the measurements
2. A column whose name is the unique `gauge_id` containing the discharge for each time step.

The `hindcast_series_table.parquet` should look like this:

| datetime | model_id_1 | model_id_2 | model_id_3 | ... |
|------------|------------|------------|------------|-----|
| 1985-01-01 | 50 | 50 | 50 | ... |
| 1985-01-02 | 60 | 60 | 60 | ... |
| 1985-01-03 | 70 | 70 | 70 | ... |
| ... | ... | ... | ... | ... |

Each gauge's csv file should look like this:

| datetime | discharge |
|------------|-----------|
| 1985-01-01 | 50 |
| 1985-01-02 | 60 |
| 1985-01-03 | 70 |
| ... | ... |

## Things to check

Be sure that both datasets:

- Are in the same units (e.g. m3/s)
- Are in the same time zone (e.g. UTC)
- Are in the same time step (e.g. daily average)
- Do not contain any non-numeric values (e.g. ICE, none, etc.)
- Do not contain rows with missing values (e.g. NaN or blank cells)
- Have been cleaned of any incorrect values (e.g. no negative values)
- Do not contain any duplicate rows
46 changes: 46 additions & 0 deletions docs/data/gis-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Required GIS Datasets

1. Drainage lines (usually delineated center lines) with at least the following attributes (columns)
for each feature:
- `model_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice
- `downstream_model_id`: The ID of the next downstream reach
- `strahler_order`: The strahler stream order of each reach
- `model_drain_area`: Cumulative upstream drainage area
- `x`: The x coordinate of the centroid of each feature (precalculated for faster results later)
- `y`: The y coordinate of the centroid of each feature (precalculated for faster results later)

2. Points representing the location of each of the river gauging station available with at least the
following attributes (columns) for each feature:
- `gauge_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice.
- `model_id`: The ID of the stream segment which corresponds to that gauge.

The `drain_table.parquet` should look like this:

| downstream_model_id | model_id | model_area | strahler_order | x | y |
|---------------------|-----------------|--------------|----------------|-----|-----|
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
| ... | ... | ... | ... | ... | ... |

The `gauge_table.parquet` should look like this:

| model_id | gauge_id | gauge_area |
|-------------------|------------------|--------------|
| unique_stream_num | unique_gauge_num | area in km^2 |
| unique_stream_num | unique_gauge_num | area in km^2 |
| unique_stream_num | unique_gauge_num | area in km^2 |
| ... | ... | ... |


## Things to check

Be sure that both datasets:

- Are in the same projected coordinate system
- Only contain gauges and reaches within the area of interest. Clip/delete anything else.

Other things to consider:

- You may find it helpful to also have the catchments, adjoint catchments, and a watershed boundary polygon for
visualization purposes.
58 changes: 5 additions & 53 deletions docs/data/index.md
Original file line number Diff line number Diff line change
@@ -1,55 +1,13 @@
# Required Datasets

## GIS Datasets
SABER requires [GIS Datasets](./gis-data.md) and [Hydrological Datasets](./discharge-data.md).

1. Drainage lines (usually delineated center lines) with at least the following attributes (columns)
for each feature:
- `model_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice
- `downstream_model_id`: The ID of the next downstream reach
- `strahler_order`: The strahler stream order of each reach
- `model_drain_area`: Cumulative upstream drainage area
- `x`: The x coordinate of the centroid of each feature (precalculated for faster results later)
- `y`: The y coordinate of the centroid of each feature (precalculated for faster results later)
2. Points representing the location of each of the river gauging station available with at least the
following attributes (columns) for each feature:
- `gauge_id`: A unique identifier/ID, any alphanumeric utf-8 string will suffice.
- `model_id`: The ID of the stream segment which corresponds to that gauge.
These datasets ***need to be prepared independently before using `saber-hbc` functions***. You should organize the datasets in a working
directory that contains 3 subdirectories, as shown below. SABER will expect your inputs to be in the `tables` directory
with the correct names and will generate many files to populate the `gis` and `clusters` directories.

Be sure that both datasets:
Example project directory structure:

- Are in the same projected coordinate system
- Only contain gauges and reaches within the area of interest. Clip/delete anything else.

Other things to consider:

- You may find it helpful to also have the catchments, adjoint catchments, and a watershed boundary polygon for
visualization purposes.

## Hydrological Datasets

1. Hindcast/Retrospective/Historical Simulation for every stream segment (reporting point) in the model. This is a time
series of discharge (Q) for each stream segment. The data should be in a tabular format that can be read by `pandas`.
The data should have two columns:
1. `datetime`: The datetime stamp for the measurements
2. A column whose name is the unique `model_id` containing the discharge for each time step.
2. Observed discharge data for each gauge
1. `datetime`: The datetime stamp for the measurements
2. A column whose name is the unique `gauge_id` containing the discharge for each time step.

Be sure that both datasets:

- Are in the same units (e.g. m3/s)
- Are in the same time zone (e.g. UTC)
- Are in the same time step (e.g. daily average)
- Do not contain any non-numeric values (e.g. ICE, none, etc.)
- Do not contain rows with missing values (e.g. NaN or blank cells)
- Have been cleaned of any incorrect values (e.g. no negative values)
- Do not contain any duplicate rows

## Working Directory

SABER is designed to read and write many files in a working directory.

tables/
# This directory contains all the input datasets
drain_table.parquet
Expand All @@ -64,9 +22,3 @@ SABER is designed to read and write many files in a working directory.
gis/
# this directory contains outputs from the SABER commands
...

`drain_table.parquet` is a table of the attribute table from the drainage lines GIS dataset. It can be generated with
`saber.prep.gis_tables()`.

`gauge_table.parquet` is a table of the attribute table from the drainage lines GIS dataset. It can be generated with
`saber.prep.gis_tables()`.
3 changes: 2 additions & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
mkdocs==1.3
mkdocs-material==8.4
mkdocs-material==8.4
mkdocstrings-python==0.7.1
86 changes: 18 additions & 68 deletions docs/user-guide/data_preparation.md
Original file line number Diff line number Diff line change
@@ -1,80 +1,30 @@
# Prepare Spatial Data (scripts not provided)
This step instructs you to collect 3 gis files and use them to generate 2 tables. All 5 files (3 gis files and 2
tables) should go in the `gis_inputs` directory
# Processing Input Data

1. Clip model drainage lines and catchments shapefile to extents of the region of interest.
For speed/efficiency, merge their attribute tables and save as a csv.
- read drainage line shapefile and with GeoPandas
- delete all columns ***except***: NextDownID, COMID, Tot_Drain_, order_
- rename the columns:
- NextDownID -> downstream_model_id
- COMID -> model_id
- Tot_Drain -> drainage_area
- order_ -> stream_order
- compute the x and y coordinates of the centroid of each feature (needs the geometry column)
- delete geometry column
- save as `drain_table.csv` in the `gis_inputs` directory
Before following these steps, you should have prepared the required datasets and organized them in a working directory.
Refer to the [Required Datasets](../data/index.md) page for more information.

Tip to compute the x and y coordinates using geopandas
***Prereqs:***

1. Create a working directory and subdirectories
2. Prepare the `drain_table` and `gauge_table` files.
3. Prepare the `hindcast_series_table` file.

Your table should look like this:
## Prepare Flow Duration Curve Data

| downstream_model_id | model_id | model_drain_area | stream_order | x | y |
|---------------------|-----------------|------------------|--------------|-----|-----|
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
| unique_stream_# | unique_stream_# | area in km^2 | stream_order | ## | ## |
| ... | ... | ... | ... | ... | ... |

1. Prepare a csv of the attribute table of the gauge locations shapefile.
- You need the columns:
- model_id
- gauge_id
- drainage_area (if known)

Your table should look like this (column order is irrelevant):

| model_id | gauge_drain_area | gauge_id |
|-------------------|------------------|------------------|
| unique_stream_num | area in km^2 | unique_gauge_num |
| unique_stream_num | area in km^2 | unique_gauge_num |
| unique_stream_num | area in km^2 | unique_gauge_num |
| ... | ... | ... |

# Prepare Discharge Data

This step instructs you to gather simulated data and observed data. The raw simulated data (netCDF) and raw observed
data (csvs) should be included in the `data_inputs` folder. You may keep them in another location and provide the path
as an argument in the functions that need it. These datasets are used to generate several additional csv files which
are stored in the `data_processed` directory and are used in later steps. The netCDF file may have any name and the
directory of observed data csvs should be called `obs_csvs`.

Use the dat

1. Create a single large csv of the historical simulation data with a datetime column and 1 column per stream segment labeled by the stream's ID number.

| datetime | model_id_1 | model_id_2 | model_id_3 |
|------------|------------|------------|------------|
| 1979-01-01 | 50 | 50 | 50 |
| 1979-01-02 | 60 | 60 | 60 |
| 1979-01-03 | 70 | 70 | 70 |
| ... | ... | ... | ... |

2. Process the large simulated discharge csv to create a 2nd csv with the flow duration curve on each segment (script provided).
Process the `hindcast_series_table` to create a 2nd table with the flow duration curve on each segment.

| p_exceed | model_id_1 | model_id_2 | model_id_3 |
|----------|------------|------------|------------|
| 100 | 0 | 0 | 0 |
| 99 | 10 | 10 | 10 |
| 98 | 20 | 20 | 20 |
| 97.5 | 10 | 10 | 10 |
| 95 | 20 | 20 | 20 |
| ... | ... | ... | ... |

3. Process the large historical discharge csv to create a 3rd csv with the monthly averages on each segment (script provided).
Then process the FDC data to create a 3rd table with scaled/transformed FDC data for each segment.

| month | model_id_1 | model_id_2 | model_id_3 |
|-------|------------|------------|------------|
| 1 | 60 | 60 | 60 |
| 2 | 30 | 30 | 30 |
| 3 | 70 | 70 | 70 |
| ... | ... | ... | ... |
| model_id | Q100 | Q97.5 | Q95 |
|----------|------|-------|-----|
| 1 | 60 | 50 | 40 |
| 2 | 60 | 50 | 40 |
| 3 | 60 | 50 | 40 |
| ... | ... | ... | ... |
7 changes: 6 additions & 1 deletion docs/user-guide/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# User Guide

We anticipate the primary usage of `saber-hbc` will be in scripts or workflows that process data in isolated environments,
While following this guide, you may also want to refer to the [API Documentation](../api).

We anticipate the primary usage of `saber` will be in scripts or workflows that process data in isolated environments,
such as web servers or interactively in notebooks, rather than using the api in an app. The package's API is designed with
many modular, compartmentalized functions intending to create flexibility for running specific portions of the SABER process
or repeating certain parts if workflows fail or parameters need to be adjusted.
Expand All @@ -20,3 +22,6 @@ logging.basicConfig(
format='%(asctime)s: %(name)s - %(message)s'
)
```

## Example Script

2 changes: 1 addition & 1 deletion docs/user-guide/validation.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ obs_data_dir = '/path/to/obs/data/directory' # optional - if data not in workdi

saber.validate.sample_gauges(workdir)
saber.validate.run_series(workdir, drain_shape, obs_data_dir)
```
```
17 changes: 15 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@ repo_url: https://github.com/rileyhales/saber-hbc/
theme: material
nav:
- Home: index.md
- Required Datasets: data/index.md
- Required Datasets:
- Summary: data/index.md
- GIS Datasets: data/gis-data.md
- Discharge Datasets: data/discharge-data.md
- User Guide:
- Using SABER: user-guide/index.md
- Data Preparation: user-guide/data_preparation.md
Expand All @@ -17,5 +20,15 @@ nav:
- Bias Correction: user-guide/bias_correction.md
- Validation: user-guide/validation.md
- Demonstration: demo/index.md
- API Docs: api/index.md
- API Docs:
- API Reference: api/index.md
- saber.prep: api/prep.md
- saber.cluster: api/cluster.md
- saber.assign: api/assign.md
- saber.gis: api/gis.md
- saber.validate: api/validate.md
- Cite SABER: cite/index.md

plugins:
- search
- mkdocstrings
2 changes: 1 addition & 1 deletion saber/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,5 @@
]

__author__ = 'Riley C. Hales'
__version__ = '0.5.0'
__version__ = '0.6.0'
__license__ = 'BSD 3 Clause Clear'
Loading

0 comments on commit e5c988c

Please sign in to comment.