Skip to content

Commit

Permalink
Merge pull request #361 from HopkinsIDD/documentation-gitbook
Browse files Browse the repository at this point in the history
10/25/2024 Sync GitBook From documentation-gitbook Into main
  • Loading branch information
jcblemai authored Oct 26, 2024
2 parents ff11c02 + f441ab8 commit fee8c3a
Show file tree
Hide file tree
Showing 8 changed files with 272 additions and 13 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion documentation/gitbook/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
* [(OLD) Configuration setup](model-inference/inference-implementation/old-configuration-setup.md)
* [Code structure](model-inference/inference-implementation/code-structure.md)
* [Inference Model Output](model-inference/inference-model-output.md)
* [Inference with EMCEE](model-inference/inference-with-emcee.md)

## 🖥️ More

Expand All @@ -50,8 +51,8 @@
* [Advanced run guides](how-to-run/advanced-run-guides/README.md)
* [Running with Docker locally 🛳](how-to-run/advanced-run-guides/running-with-docker-locally.md)
* [Running locally in a conda environment 🐍](how-to-run/advanced-run-guides/quick-start-guide-conda.md)
* [Running on SLURM HPC](how-to-run/advanced-run-guides/slurm-submission-on-marcc.md)
* [Running on AWS 🌳](how-to-run/advanced-run-guides/running-on-aws.md)
* [Running On A HPC With Slurm](how-to-run/advanced-run-guides/running-on-a-hpc-with-slurm.md)
* [Common errors](how-to-run/common-errors.md)
* [Useful commands](how-to-run/useful-commands.md)
* [Tips, tricks, FAQ](how-to-run/tips-tricks-faq.md)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,11 @@ For longer inference runs across multiple slots, we provide instructions and scr

## Running longer inference runs across multiple slots

{% content-ref url="slurm-submission-on-marcc.md" %}
[slurm-submission-on-marcc.md](slurm-submission-on-marcc.md)
{% endcontent-ref %}

{% content-ref url="running-on-aws.md" %}
[running-on-aws.md](running-on-aws.md)
{% endcontent-ref %}

{% content-ref url="running-on-a-hpc-with-slurm.md" %}
[running-on-a-hpc-with-slurm.md](running-on-a-hpc-with-slurm.md)
{% endcontent-ref %}

Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
---
description: Tutorial on how to install and run flepiMoP on a supported HPC with slurm.
---

# Running On A HPC With Slurm

These details cover how to install and initialize `flepiMoP` on an HPC environment and submit a job with slurm.

{% hint style="warning" %}
Currently only JHU's Rockfish and UNC's Longleaf HPC clusters are supported. If you need support for a new HPC cluster please file an issue in [the `flepiMoP` GitHub repository](https://github.com/HopkinsIDD/flepiMoP/issues).
{% endhint %}

## Installing `flepiMoP`

This task needs to be ran once to do the initial install of `flepiMoP`.

{% hint style="info" %}
On JHU's Rockfish you'll need to run these steps in a slurm interactive job. This can be launched with `/data/apps/helpers/interact -n 4 -m 12GB -t 4:00:00`, but please consult the [Rockfish user guide](https://www.arch.jhu.edu/guide/) for up to date information.
{% endhint %}

Obtain a temporary clone of the `flepiMoP` repository. The install script will place a permanent clone in the correct location once ran. You may need to take necessary steps to setup git on the HPC cluster being used first before running this step.

```
$ git clone [email protected]:HopkinsIDD/flepiMoP.git --depth 1
Cloning into 'flepiMoP'...
remote: Enumerating objects: 487, done.
remote: Counting objects: 100% (487/487), done.
remote: Compressing objects: 100% (424/424), done.
remote: Total 487 (delta 59), reused 320 (delta 34), pack-reused 0 (from 0)
Receiving objects: 100% (487/487), 84.04 MiB | 41.45 MiB/s, done.
Resolving deltas: 100% (59/59), done.
Updating files: 100% (411/411), done.
```

Run the `hpc_install_or_update.sh` script, substituting `<cluster-name>` with either `rockfish` or `longleaf`. This script will prompt the user asking for the location to place the `flepiMoP` clone and the name of the conda environment that it will create. If this is your first time using this script accepting the defaults is the quickest way to get started. Also, expect this script to take a while the first time that you run it.

```
$ ./flepiMoP/build/hpc_install_or_update.sh <cluster-name>
```

Remove the temporary clone of the `flepiMoP` repository created before. This step is not required, but does help alleviate confusion later.

```
$ rm -rf flepiMoP/
```

## Updating `flepiMoP`

Updating `flepiMoP` is designed to work just the same as installing `flepiMoP`. Make sure that your clone of the `flepiMoP` repository is set to the branch your working with (if doing development or operations work) and then run the `hpc_install_or_update.sh` script, substituting `<cluster-name>` with either `rockfish` or `longleaf`.

```
$ ./flepiMoP/build/hpc_install_or_update.sh <cluster-name>
```

## Initialize The Created `flepiMoP` Environment

These steps to initialize the environment need to run on a per run or as needed basis.

Change directory to where a full clone of the `flepiMoP` repository was placed (it will state the location in the output of the script above). And then run the `hpc_init.sh` script, substituting `<cluster-name>` with either `rockfish` or `longleaf`. This script will assume the same defaults as the script before for where the `flepiMoP` clone is and the name of the conda environment. This script will also ask about a project directory and config, if this is your first time initializing `flepiMoP` it might be helpful to clone [the `flepimop_sample` GitHub repository](https://github.com/HopkinsIDD/flepimop\_sample) to the same directory to use as a test.

```
$ source batch/hpc_init.sh <cluster-name>
```

Upon completing this script it will output a sample set of commands to run to quickly test if the installation/initialization has gone okay.&#x20;

## Submitting A Batch Inference Job To Slurm

When an inference batch job is launched, a few post processing scripts are called to run automatically `postprocessing-scripts.sh.` You can manually change what you want to run by editing this script.

A batch job can can be submitted after this by running the following:

<pre><code><strong>$ cd $PROJECT_PATH
</strong><strong>$ python $FLEPI_PATH/batch/inference_job_launcher.py --slurm 2>&#x26;1 | tee $FLEPI_RUN_INDEX_submission.log
</strong></code></pre>

This launches a batch job to your HPC, with each slot on a separate node. This command attempts to infer the required arguments from your environment variables (i.e. if there is a resume or not, what is the run\_id, etc.). The part after the "2" makes sure this file output is redirected to a script for logging, but has no impact on your submission.

If you'd like to have more control, you can specify the arguments manually:

```
$ python $FLEPI_PATH/batch/inference_job_launcher.py --slurm \
-c $CONFIG_PATH \
-p $FLEPI_PATH \
--data-path $DATA_PATH \
--upload-to-s3 True \
--id $FLEPI_RUN_INDEX \
--fs-folder /scratch4/primary-user/flepimop-runs \
--restart-from-location $RESUME_LOCATION
```

More detailed arguments and advanced usage of the `inference_job_launcher.py` script please refer to the `--help`.&#x20;

After the job is successfully submitted, you will now be in a new branch of the project repository. For documentation purposes, we recommend committing the ground truth data files to the branch on GitHub substituting `<your-commit-message>` with a description of the contents:

<pre><code><strong>$ git add data/
</strong>$ git commit -m "&#x3C;your-commit-message>"
$ git push --set-upstream origin $( git rev-parse --abbrev-ref HEAD )
</code></pre>

## Monitoring Submitted Jobs

During an inference batch run, log files will show the progress of each array/slot. These log files will show up in your project directory and have the file name structure:

```
log_{scenario}_{FLEPI_RUN_INDEX}_{JOB_NAME}_{seir_modifier_scenario}_{outcome_modifiers_scenario}_{array number}.txt
```

To view these as they are being written, type:

```
cat log_{scenario}_{FLEPI_RUN_INDEX}_{JOB_NAME}_{seir_modifier_scenario}_{outcome_modifiers_scenario}_{array number}.txt
```

or your file viewing command of choice. Other commands that are helpful for monitoring the status of your runs (note that `<Job ID>` here is the SLURM job ID, _not_ the `JOB_NAME` set by flepiMoP):

| SLURM command | What does it do? |
| ------------------ | --------------------------------------------------------------------------------------------------------------- |
| `squeue -u $USER` | Displays the names and statuses of all jobs submitted by the user. Job status might be: R: running, P: pending. |
| `seff <Job ID>` | Displays information related to the efficiency of resource usage by the job |
| `sacct` | Displays accounting data for all jobs and job steps |
| `scancel <Job ID>` | This cancels a job. If you want to cancel/kill all jobs submitted by a user, you can type `scancel -u $USER` |



## Other Tips & Tricks

### Moving files to your local computer <a href="#moving-files-to-your-local-computer" id="moving-files-to-your-local-computer"></a>

Often you'll need to move files back and forth between your HPC and your local computer. To do this, your HPC might suggest [Filezilla](https://filezilla-project.org/) or [Globus file manager](https://www.globus.org/). You can also use commands `scp` or `rsync` (check what works for your HPC).

```
# To get files from HPC to local computer
scp -r <user>@<data transfer node>:"<file path of what you want>" <where you want to put it in your local>
rsync
# To get files from local computer to HPC
rsync local-file user@remote-host:remote-file
```

### Other helpful commands <a href="#other-helpful-commands" id="other-helpful-commands"></a>

If your system is approaching a file number quota, you can find subfolders that contain a large number of files by typing:

```
find . -maxdepth 1 -type d | while read -r dir
do printf "%s:\t" "$dir"; find "$dir" -type f | wc -l; done
```
26 changes: 20 additions & 6 deletions documentation/gitbook/how-to-run/quick-start-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ If you just want to [run a forward simulation](quick-start-guide.md#non-inferenc
To [run an inference run](quick-start-guide.md#inference-run) and to explore your model outputs using provided post-processing functionality, there are some packages you'll need to **install in R**. Open your **R terminal** (at the bottom of RStudio, or in the R IDE), and run the following command to install the necessary R packages:

<pre class="language-r" data-overflow="wrap"><code class="lang-r"><strong># while in R
</strong><strong>install.packages(c("readr","sf","lubridate","tidyverse","gridExtra","reticulate","truncnorm","xts","ggfortify","flextable","doParallel","foreach","optparse","arrow","devtools","cowplot","ggraph","data.table"))
</strong><strong>>install.packages(c("readr","sf","lubridate","tidyverse","gridExtra","reticulate","truncnorm","xts","ggfortify","flextable","doParallel","foreach","optparse","arrow","devtools","cowplot","ggraph","data.table"))
</strong></code></pre>

{% hint style="info" %}
Expand All @@ -128,6 +128,22 @@ Rscript build/local_install.R # Install R packages
```
{% endcode %}

After installing the _flepiMoP_ R packages, we need to do one more step to install the command line tools for the inference package. If you are not running in a conda environment, you need to point this installation step to a location that is on your executable search path (i.e., whenever you call a command from the terminal, the places that are searched to find that executable). To find a consistent location, type

```
>which gempyor-simulate
```

The location that is returned will be of the form `EXECUTABLE_SEARCH_PATH/gempyor-simulate`. Then run the following in an R terminal:

```r
# While in R
>library(inference)
>inference::install_cli("EXECUTABLE_SEARCH_PATH")
```

To install the inference package's CLI tools.

Each installation step may take a few minutes to run.

{% hint style="info" %}
Expand Down Expand Up @@ -189,7 +205,7 @@ An inference run requires a configuration file that has the `inference` section.

{% code overflow="wrap" %}
```bash
flepimop-inference-main.R -c config_sample_2pop_inference.yml
flepimop-inference-main -c config_sample_2pop_inference.yml
```
{% endcode %}

Expand All @@ -208,7 +224,7 @@ The last few lines visible on the command prompt should be:
If you want to quickly do runs with options different from those encoded in the configuration file, you can do that from the command line, for example

```bash
flepimop-inference-main.R -j 1 -n 1 -k 1 -c config_inference.yml
flepimop-inference-main -j 1 -n 1 -k 1 -c config_inference.yml
```

where:
Expand Down Expand Up @@ -241,9 +257,7 @@ Rscript $FLEPI_PATH/flepimop/main_scripts/inference_main.R -c config_inference_n

## 📈 Examining model output

If your run is successful, you should see your output files in the model\_output folder. The structure of the files in this folder is described in the [Model Output](../gempyor/output-files.md) section. By default, all the output files are .parquet format (a compressed format which can be imported as dataframes using R's arrow package `arrow::read_parquet` or using the free desktop application [Tad ](https://www.tadviewer.com/)for quick viewing). However, you can add the option `--write-csv` to the end of the commands to run the code (e.g., `> gempyor-simulate -c config.yml --write-csv)` to have everything saved as .csv files instead ;


If your run is successful, you should see your output files in the model\_output folder. The structure of the files in this folder is described in the [Model Output](../gempyor/output-files.md) section. By default, all the output files are .parquet format (a compressed format which can be imported as dataframes using R's arrow package `arrow::read_parquet` or using the free desktop application [Tad ](https://www.tadviewer.com/)for quick viewing). However, you can add the option `--write-csv` to the end of the commands to run the code (e.g., `> gempyor-simulate -c config.yml --write-csv)` to have everything saved as .csv files instead ;

## 🪜 Next steps

Expand Down
11 changes: 10 additions & 1 deletion documentation/gitbook/how-to-run/tips-tricks-faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ description: All the little things to save you time on the clusters

# Tips, tricks, FAQ

### Deleting `model_output/` (or any big folder) is too long on the cluster
## Deleting `model_output/` (or any big folder) is too long on the cluster

Yes, it takes ages because IO can be so slow, and there are many small files. If you are in a hurry, you can do

Expand All @@ -14,3 +14,12 @@ rm -r model_output_old &
```

The first command rename/move `model_output`, it is instantaneous. You can now re-run something. To delete the renamed folder, run the second command. the `&` at the end makes it execute in the background.

## Use `seff` to analyze a job
After a job has run (either to completion or got terminated/fail), you may run:

```bash
seff JOB_ID
```

to know how much ressources your job used in your node, what was the cause for termination and so on. If you don't remember the `JOB_ID`, look for the number in the filename of the slurm log (`slurm_{JOB_ID}.out`).
86 changes: 86 additions & 0 deletions documentation/gitbook/model-inference/inference-with-emcee.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Inference with EMCEE

{% hint style="warning" %}
For now this only work from branch emcee\_batch
{% endhint %}

### Config changes w.r.t classical inference

You need, under inference, to add `method: emcee` and modify the `statistics:` as shown in the diff below (basically: all resampling goes to one subsection, with some minor changes to names).&#x20;

<figure><img src="../.gitbook/assets/Screenshot 2024-10-25 at 15.19.02.png" alt=""><figcaption><p>left: classical inference config, right: new EMCEE config</p></figcaption></figure>

To see which llik options and regularization (e.g do you want to weigh more the last weeks for forecasts, or do you want to add the sum of all subpop) see files `statistics.py.`

### Test on your computer

Install gempyor from branch emcee\_batch . Test your config by running:

```bash
flepimop-calibrate -c config_emcee.yml --nwalkers 5 --jobs 5 --niterations 10 --nsamples 5 --id my_rim_id
```

on your laptop. If it works, it should produce:

* plots of simulation directly from your config
* &#x20;plots after the fits with the fits and the parameter chains
* and h5 file with all the chains
* and in model\_output, the final hosp/snpi/seir/... files in the flepiMoP structure.

It will output something like

\`\`\`

```
gempyor >> Running ***DETERMINISTIC*** simulation;
gempyor >> ModelInfo USA_inference_all; index: 1; run_id: SMH_Rdisparity_phase_one_phase1_blk1_fixprojnpis_CA-NC_emcee,
gempyor >> prefix: USA_inference_all/SMH_Rdisparity_phase_one_phase1_blk1_fixprojnpis_CA-NC_emcee/;
Loaded subpops in loaded relative probablity file: 51 Intersect with seir simulation: 2 kept
Running Gempyor Inference
LogLoss: 6 statistics and 92 data points,number of NA for each statistic:
incidD_latino 46
incidD_other 0
incidD_asian 0
incidD_black 0
incidD_white 0
incidC_white 24
incidC_black 24
incidC_other 24
incidC_asian 24
incidC_latino 61
incidC 24
incidD 0
dtype: int64
InferenceParameters: with 92 parameters:
seir_modifiers: 84 parameters
outcome_modifiers: 8 parameters
```

Here, it says the config fits 92 parameters, we'll keep that in mind and choose a number of walkers greater than (ideally 2 times) this number of parameters.

### Run on cluster

Install gempyor on the cluster. test it with the above line, then modify this script:

```bash
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem=450g
#SBATCH -c 256
#SBATCH -t 00-20:00:00
flepimop-calibrate -c config_NC_emcee.yml --nwalkers 500 --jobs 256 --niterations 2000 --nsamples 250 --id my_id > out_fit256.out 2>&1
```

so you need to have:

* &#x20;`-c` (number of core) equal to **roughly half the number of walkers** (slots/parallel chains)
* mem to be around two times the number of walkers. Look at the computes nodes you have access to and make something that can be prioritized fast enough.&#x20;
* nsamples is the number of final results you want, but it's fine not to care about it, I rerun the sampling from my computer.
* To resume from an existing run, add the previous line `--resume` and it 'll start from the last parameter values in the h5 files.

### Postprocess EMCEE

To analyze run `postprocessing/emcee_postprocess.ipynb`\
First, this plots the chains and then it runs nsamples (you can choose it) projection with the end of the chains and does the plot of the fit, with and without projections
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ description: These scripts are run automatically after an inference run

Some information to consider if you'd like your script to be run automatically after an inference run ;

* Most R/python packages are installed already installed. Try to run your script on the conda environment defined on the [submission page](../../how-to-run/advanced-run-guides/slurm-submission-on-marcc.md) (or easier if you are not set up on MARCC, ask me)
* Most R/python packages are installed already installed. Try to run your script on the conda environment defined on the [submission page](../../how-to-run/advanced-run-guides/running-on-a-hpc-with-slurm.md) (or easier if you are not set up on MARCC, ask me)
* There will be some variables set in the environment. These variables are:
* `$CONFIG_PATH` the path to the configuration fil ;
* `$FLEPI_RUN_INDEX` the run id for this run (e.g \``CH_R3_highVE_pesImm_2022_Jan29`\`
Expand Down

0 comments on commit fee8c3a

Please sign in to comment.