Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement cross-platform trajectory output #108

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
35 changes: 19 additions & 16 deletions docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,13 @@ The following parameters are found in `r/run_stilt.r` and are used to configure

### Parallel simulation settings

| Arg | Description |
| --------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `n_nodes` | If using SLURM for job submission, number of nodes to utilize |
| `n_cores` | Number of cores per node to parallelize simulations by receptor locations and times |
| `slurm` | Logical indicating the use of rSLURM to submit job(s). When using SLURM, a `<stilt_wd>/_rslurm` directory is created to contain the SLURM submission scripts and node-specific log files. |
| `slurm_options` | Named list of options passed to `sbatch` using `rslurm::slurm_apply()`. This typically includes `time`, `account`, and `partition` values |
| Arg | Description |
| -------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `n_nodes` | If using SLURM for job submission, number of nodes to utilize |
| `n_cores` | Number of cores per node to parallelize simulations by receptor locations and times |
| `processes_per_node` | Number of processes to run on each node. Can be set higher than n_cores for nodes which support [hyperthreading](https://scicomp.ethz.ch/wiki/Using_hyperthreading) |
| `slurm` | Logical indicating the use of rSLURM to submit job(s). When using SLURM, a `<stilt_wd>/_rslurm` directory is created to contain the SLURM submission scripts and node-specific log files. |
| `slurm_options` | Named list of options passed to `sbatch` using `rslurm::slurm_apply()`. This typically includes `time`, `account`, and `partition` values |

### Receptor placement

Expand Down Expand Up @@ -63,6 +64,7 @@ str(receptors)
| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `met_path` | Absolute path to ARL compatible meteorological data files |
| `met_file_format` | String detailing file naming convention for meteorological data files using a mixture of datetime and regex syntax. The formatting string accepts `grep` compatible regular expressions (`.\*.arl`), `strftime` compatible datetime strings (`%Y%m%d%H`) or any combination of the two. Datetime syntax is expanded to all unique combinations required for the receptor and simulation duration and the intersection between the requested files and files available in `met_path` is determined with `grep`, allowing partial matching and compatible regular expressions to be used to identify the relevant data. Matching does not require the full format to be specified - e.g. `\*.arl`, `%Y`, `%Y%m%d`, `%Y%m%d_d0.*.arl` would all match with a file named `20180130_d01.arl`. |
| `n_hours_per_met_file` | Number of hours per meteorological data file. To determine the number of hours in an ARL compatible meteorological data file, refer to the README, including the file naming convention, provided by the data source. For example, the [NOAA HRRR README](https://www.ready.noaa.gov/data/archives/hrrr/README.TXT) specifies a "6 hour data file beginning with 00z - 05z in the first file of the day". Defaults to 6 |
| `met_subgrid_buffer` | Percent to extend footprint area for meteorological subdomain when using `met_subgrid_enable`. Defaults to 0.1 (10%) |
| `met_subgrid_enable` | Enables extraction of spatial subdomains from files in `met_path` using HYSPLIT's `xtrct_grid` binary prior to executing simulations. If enabled, will create files in `<output_wd>/met/`. This can substantially accelerate simulation speed at the cost of increased disk usage. Defaults to disabled |
| `met_subgrid_levels` | If set, extracts the defined number of vertical levels from the meteorological data files to further accelerate simulations. Defaults to `NA`, which includes all vertical levels available |
Expand All @@ -72,16 +74,17 @@ str(receptors)

### Model control

| Arg | Description |
| --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `n_hours` | Number of hours to run each simulation; negative indicates backward in time |
| `numpar` | number of particles to be run; defaults to 200 |
| `rm_dat` | Logical indicating whether to delete `PARTICLE.DAT` after each simulation. Default to TRUE to reduce disk space since all of the trajectory information is also stored in `STILT_OUTPUT.rds` alongside the calculated upstream influence footprint |
| `run_foot` | Logical indicating whether to produce footprints. If FALSE, `run_trajec` must be TRUE. This can be useful when calculating trajectories separate from footprints |
| `run_trajec` | Logical indicating whether to produce new trajectories with `hycs_std`. If FALSE, will try to load the previous trajectory outputs. This is often useful for regridding purposes |
| `simulation_id` | Unique identifier for each simulation; defaults to NA which determines a unique identifier for each simulation by hashing the time and receptor location |
| `timeout` | number of seconds to allow `hycs_std` to complete before sending SIGTERM and moving to the next simulation; defaults to 3600 (1 hour) |
| `varsiwant` | character vector of 4-letter `hycs_std` variables. Defaults to the minimum required variables including `'time', 'indx', 'long', 'lati', 'zagl', 'foot', 'mlht', 'dens', 'samt', 'sigw', 'tlgr'`. Can optionally include options listed below. |
| Arg | Description |
| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `n_hours` | Number of hours to run each simulation; negative indicates backward in time |
| `numpar` | number of particles to be run; defaults to 200 |
| `rm_dat` | Logical indicating whether to delete `PARTICLE.DAT` after each simulation. Default to TRUE to reduce disk space since all of the trajectory information is also stored in `STILT_OUTPUT.rds` alongside the calculated upstream influence footprint |
| `run_foot` | Logical indicating whether to produce footprints. If FALSE, `run_trajec` must be TRUE. This can be useful when calculating trajectories separate from footprints |
| `run_trajec` | Logical indicating whether to produce new trajectories with `hycs_std`. If FALSE, will try to load the previous trajectory outputs. This is often useful for regridding purposes |
| `trajec_fmt` | File extension for trajectory output files. Defaults to `rds` (serialized R data). Can be set to `parquet` for cross-platform output or as an empty string `''` to disable writing trajectory outputs. The `arrow` package must be manually installed to write `parquet` output |
| `simulation_id` | Unique identifier for each simulation; defaults to NA which determines a unique identifier for each simulation by hashing the time and receptor location |
| `timeout` | number of seconds to allow `hycs_std` to complete before sending SIGTERM and moving to the next simulation; defaults to 3600 (1 hour) |
| `varsiwant` | character vector of 4-letter `hycs_std` variables. Defaults to the minimum required variables including `'time', 'indx', 'long', 'lati', 'zagl', 'foot', 'mlht', 'dens', 'samt', 'sigw', 'tlgr'`. Can optionally include options listed below. |

#### Optional `varsiwant` arguments

Expand Down
2 changes: 1 addition & 1 deletion docs/execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ Rscript r/run_stilt.r

![Parallel simulations with SLURM](static/img/chart-parallel.png)

If `slurm = TRUE` STILT will distribute the simulations across `n_nodes` using `n_cores` on each node (total parallel worker count is `n_nodes * n_cores`). This will create a `<stilt_wd>/_rslurm` directory which contains SLURM submission scripts and logs from each node.
If `slurm = TRUE` STILT will distribute the simulations across `n_nodes` using `n_cores` on each node (total parallel worker count is `n_nodes * n_cores`). This will create a `<stilt_wd>/_rslurm` directory which contains SLURM submission scripts and logs from each node. For nodes which support [hyperthreading](https://scicomp.ethz.ch/wiki/Using_hyperthreading), the job allocation per node can be increased beyond the number of cores per node via `processes_per_node`.

```bash
Rscript r/run_stilt.r
Expand Down
35 changes: 30 additions & 5 deletions docs/output-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,26 @@

The model outputs can be found in the directory configured with `output_wd` (defaults to `<stilt_wd>/out/`, see [project structure](http://localhost:3000/#/project-structure)). STILT outputs two files for analysis -

- a `<simulation_id>_traj.rds` file containing the trajectories of the particle ensemble
- a `<simulation_id>_traj.<trajec_fmt>` file containing the trajectories of the particle ensemble
- a `<simulation_id>_foot.nc` file containing gridded footprint values and metadata

Simulation identifiers follow a `yyyymmddHHMM_lati_long_zagl` convention, see [project structure](project-structure.md?id=outby-id).

## Particle trajectories

Particle trajectories and simulation configuration information are packaged and saved in a compressed `.rds` file ([serialized single R object](https://stat.ethz.ch/R-manual/R-devel/library/base/html/readRDS.html)) with the naming convention with the naming convention `<simulation_id>_traj.rds`. Preserving the particle trajectories enables regridding the footprints at a later time without the computational cost of recalculating particle trajectories.
Particle trajectories and simulation configuration information are packaged and saved with the naming convention `<simulation_id>_traj.<trajec_fmt>`. Preserving the particle trajectories enables regridding the footprints at a later time without the computational cost of recalculating particle trajectories.

This object can be loaded with `readRDS(<path>)` and is structured as
Different output formats are available for the particle trajectories as specified by the `trajec_fmt` parameter. The default format is `rds` which is a [serialized R data](https://stat.ethz.ch/R-manual/R-devel/library/base/html/readRDS.html) object. Other options include `parquet` for [Apache Parquet](https://parquet.apache.org/) or an empty string `''` to disable writing trajectory outputs.
> Formats other than `rds` will be less efficient, but `parquet` offers the advantage of being a widely supported format by other software packages.

This object can be loaded using `R` with `read_traj(<path>)` and is structured as

```r
traj <- readRDS('<simulation_id>_traj.rds')
source('r/dependencies.r')
traj <- read_traj('<simulation_id>_traj.<trajec_fmt>')
str(traj)
# List of 4
# $ file : chr "<stilt_wd>/out/by-id/<simulation_id>/<simulation_id>_traj.rds
# $ file : chr "<stilt_wd>/out/by-id/<simulation_id>/<simulation_id>_traj.<trajec_fmt>"
# $ receptor:List of 4
# ..$ run_time: POSIXct[1:1], format: "1983-09-18 21:00:00"
# ..$ lati : num 39.6
Expand All @@ -40,6 +44,27 @@ str(traj)

The `traj$receptor` object is a named list with the time and location of the release point for the simulation. The `traj$particle` object is a data frame containing each particle's position and characteristics over time.

Or `parquet` formatted files may be read using `Python` with the `pyarrow` package.

```python
import pandas as pd
import pyarrow.parquet as pq

# Read the parquet file into a pandas DataFrame
particle = pd.read_parquet('<simulation_id>_traj.parquet')
particle.head()
# | | time | indx | long | lati | zagl | foot | mlht | dens | samt | sigw | tlgr | foot_no_hnf_dilution |
# |---:|-------:|-------:|-------:|-------:|-------:|--------:|-------:|-------:|-------:|-------:|-------:|-----------------------:|
# | 0 | -1 | 1 | -80.4 | 39.6 | 5.63 | 0.0626 | 1459 | 1.1 | 1 | 1.08 | 1.67 | 0.00224 |
# | 1 | -1 | 2 | -80.4 | 39.6 | 41.37 | 0.043 | 1459 | 1.1 | 1 | 1.09 | 4.98 | 0.00224 |
# | 2 | -1 | 3 | -80.4 | 39.6 | 30.95 | 0.043 | 1459 | 1.1 | 1 | 1.09 | 4.98 | 0.00224 |
# | 3 | -1 | 4 | -80.4 | 39.6 | 3.32 | 0.0626 | 1459 | 1.1 | 1 | 1.08 | 1.67 | 0.00224 |
# | 4 | -1 | 5 | -80.4 | 39.6 | 28.88 | 0.0627 | 1459 | 1.1 | 1 | 1.08 | 1.67 | 0.00224 |

# Receptor & config parameter metadata is stored in the parquet `FileMetaData.metadata` attribute
metadata = pq.read_metadata('<simulation_id>_traj.parquet').metadata
```

## Gridded footprints

Footprints are packaged and saved in a compressed NetCDF file using [Climate and Forecast (CF)](http://cfconventions.org) compliant metadata with the naming convention `<simulation_id>_foot.nc`. This object contains information about the model domain, the grid resolution, and footprint values. This object is typically a three dimensional array with dimensions ordered (_x_, _y_, _t_). However, the object will only have dimensions (_x_, _y_) for time integrated footprints.
Expand Down
Loading
Loading