From 01d05b55c10de3fc1304bf7b6013476d20fea057 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 26 Sep 2023 13:39:49 +0100 Subject: [PATCH 01/29] started guide on SLEAP module --- docs/source/data_analysis/HPC-module-SLEAP.md | 1 + docs/source/data_analysis/index.md | 1 + 2 files changed, 2 insertions(+) create mode 100644 docs/source/data_analysis/HPC-module-SLEAP.md diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md new file mode 100644 index 0000000..7e971a3 --- /dev/null +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -0,0 +1 @@ +# Use the SLEAP module on the HPC cluster diff --git a/docs/source/data_analysis/index.md b/docs/source/data_analysis/index.md index 0e020c6..ce094e5 100644 --- a/docs/source/data_analysis/index.md +++ b/docs/source/data_analysis/index.md @@ -5,4 +5,5 @@ Guides related to the analysis of neuroscientific data, spanning a wide range of ```{toctree} :maxdepth: 1 +HPC-module-SLEAP ``` From dca90d35e08ceda2a83c936f4c9b03710bb33d25 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 26 Sep 2023 16:48:13 +0100 Subject: [PATCH 02/29] add and configure sphinx-copybutton --- docs/requirements.txt | 1 + docs/source/conf.py | 25 +++++++++++++++---------- 2 files changed, 16 insertions(+), 10 deletions(-) diff --git a/docs/requirements.txt b/docs/requirements.txt index e6f6522..403d7ff 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -5,3 +5,4 @@ numpydoc pydata-sphinx-theme sphinx sphinx-design +sphinx-copybutton diff --git a/docs/source/conf.py b/docs/source/conf.py index 2392d2a..dbb0368 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -31,16 +31,17 @@ # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ - "sphinx.ext.githubpages", - "sphinx.ext.autodoc", - "sphinx.ext.autosummary", - "sphinx.ext.viewcode", - "sphinx.ext.intersphinx", - "sphinx.ext.napoleon", - "sphinx_design", - "myst_parser", - "numpydoc", - "nbsphinx", + 'sphinx.ext.githubpages', + 'sphinx.ext.autodoc', + 'sphinx.ext.autosummary', + 'sphinx.ext.viewcode', + 'sphinx.ext.intersphinx', + 'sphinx.ext.napoleon', + 'sphinx_design', + 'sphinx_copybutton', + 'myst_parser', + 'numpydoc', + 'nbsphinx', ] # Configure the myst parser to enable cool markdown features @@ -134,3 +135,7 @@ # Hide the "Show Source" button html_show_sourcelink = False + +# Configure the code block copy button +# don't copy line numbers, prompts, or console outputs +copybutton_exclude = ".linenos, .gp, .go" From 85a98f4a331c67f7d1d3c6551eae4d38b4b2d640 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 26 Sep 2023 17:32:21 +0100 Subject: [PATCH 03/29] added complete draft for the SLEAP guide --- docs/source/data_analysis/HPC-module-SLEAP.md | 634 ++++++++++++++++++ 1 file changed, 634 insertions(+) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 7e971a3..dad6d3c 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -1 +1,635 @@ # Use the SLEAP module on the HPC cluster + +```{role} bash(code) +:language: bash +``` +```{role} python(code) +:language: python +``` + +This guide explains how to use the [SLEAP](https://sleap.ai/) module that is +installed on the SWC's HPC cluster to run training and/or inference jobs. + +:::{warning} +Some links withing this document point to the +[SWC internal wiki](https://wiki.ucl.ac.uk/display/SI/SWC+Intranet), +which is only accessible from within the SWC network. +::: + +:::{dropdown} Intepreting code blocks wihin this document +:color: info +:icon: info + +Shell commands will be shown in code blocks like this +(with the `$` sign indicating the shell prompt): +```{code-block} console +$ echo "Hello world!" +``` + +Similarly, Python code blocks will appear with the `>>>` sign indicating the +Python interpreter prompt: +```{code-block} pycon +>>> print("Hello world!") +``` + +The expected outputs of both shell and Python commands will be shown without +any prompt: +```{code-block} console +Hello world! +``` +::: + +## Abbreviations +| Acronym | Meaning | +| --- | --- | +| SLEAP | Social LEAP Estimates Animal Poses | +| SWC | Sainsbury Wellcome Centre | +| HPC | High Performance Computing | +| SLURM | Simple Linux Utility for Resource Management | +| GUI | Graphical User Interface | + +## Prerequisites + +### Access to the HPC cluster and SLEAP module +Verify that you can access HPC gateway node (typing your `` both times when prompted): +```{code-block} console +$ ssh @ssh.swc.ucl.ac.uk +$ ssh hpc-gw1 +``` + +If you are wondering about the two SSH commands, see the Appendix for +[Why do we SSH twice?](#why-do-we-ssh-twice). + + +SLEAP should be listed among the available modules: + +```{code-block} console +$ module avail +SLEAP/2023-08-01 +SLEAP/2023-03-13 +``` + +`SLEAP/2023-03-13` corresponds to `sleap v.1.2.9` whereas `SLEAP/2023-08-01` is `v1.3.1`. +We recommend using the latter. + +You can load the latest version by running: + +```{code-block} console +$ module load SLEAP +``` +If you want to load a specific version, you can do so by typing the full module name, +including the date e.g. `module load SLEAP/2023-03-13` + +If a module has been successfully loaded, it will be listed when you run `module list`, +along with other modules it may depend on: + +```{code-block} console +$ module list +Currently Loaded Modulefiles: + 1) cuda/11.8 2) SLEAP/2023-08-01 +``` + +If you have troubles with loading the SLEAP module, see the +[Troubleshooting section](#problems-with-the-sleap-module). + + +### Install SLEAP on your local PC/laptop +While you can delegate the GPU-intensive work to the HPC cluster, +you will still need to do some steps, such as labelling frames, via the SLEAP GUI. +Thus, you also need to install SLEAP on your local PC/laptop. + +We recommend following the official [SLEAP installation guide](https://sleap.ai/installation.html). +To be on the safe side, ensure that the local installation version matches the one on the cluster. + +### Mount the SWC filesystem on your local PC/laptop +The rest of this guide assumes that you have mounted the SWC filesystem on your local PC/laptop. +If you have not done so, please follow the relevant instructions on the +[SWC internal wiki](https://wiki.ucl.ac.uk/display/SSC/SWC+Storage+Platform+Overview). + +We will also assume that the data you are working with are stored in a `ceph` or `winstor` +directory to which you have access to. In the rest of this guide, we will use the path +`/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data` which contains a SLEAP project +for test purposes. You should replace this with the path to your own data. + +:::{dropdown} Data storage location matters +:color: warning +:icon: alert-fill + +The cluster has fast acess to data stored on the `ceph` and `winstor` filesystems. +If your data is stored elsewhere, make sure to transfer it to `ceph` or `winstor` +before running the job. You can use tools such as [`rsync`](https://linux.die.net/man/1/rsync) +to copy data from your local machine to `ceph` via an ssh connection. For example: + +```{code-block} console +$ rsync -avz @ssh.swc.ucl.ac.uk:/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data +``` +::: + +## Model training +This will consist of two parts - [preparing a training job](#prepare-the-training-job) +(on your local SLEAP installation) and [running a training job](#run-the-training-job) +(on the HPC cluster's SLEAP module). Some evaluation metrics for the trained models +can be [viewed via the SLEAP GUI](#evaluate-the-trained-models) on your local SLEAP installation. + +### Prepare the training job +Follow the SLEAP instructions for [Creating a Project](https://sleap.ai/tutorials/new-project.html) +and [Initial Labelling](https://sleap.ai/tutorials/initial-labeling.html). +Ensure that the project file (e.g. `labels.v001.slp`) is saved in the mounted SWC filesystem +(as opposed to your local filesystem). + +Next, follow the instructions in [Remote Training](https://sleap.ai/guides/remote.html#remote-training), +i.e. "Predict" -> "Run Training…" -> "Export Training Job Package…". +- For selecting the right configuration parameters, see [Configuring Models](https://sleap.ai/guides/choosing-models.html#) and [Troubleshooting Workflows](https://sleap.ai/guides/troubleshooting-workflows.html) +- Set the "Predict On" parameter to "nothing". Remote training and inference (prediction) are easiest to run separately on the HPC Cluster. Also unselect "visualize predictions" in training settings, if it's enabled by default. +- If you are working with a top-down camera view, set the "Rotation Min Angle" and "Rotation Max Angle" to -180 and 180 respectively in the "Augmentation" section. +- Make sure to save the exported training job package (e.g. `labels.v001.slp.training_job.zip`) in the mounted SWC filesystem, ideally in the same directory as the project file. +- Unzip the training job package. This will create a folder with the same name (minus the `.zip` extension). This folder contains everything needed to run the training job on the HPC cluster. + +### Run the training job +Login to the HPC cluster as described above. +```{code-block} console +$ ssh @ssh.swc.ucl.ac.uk +$ ssh hpc-gw1 +``` +Navigate to the training job folder (replace with your own path) and list its contents: +```{code-block} console +$ cd /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data +$ cd labels.v001.slp.training_job +$ ls -1 +centered_instance.json +centroid.json +inference-script.sh +jobs.yaml +labels.v001.pkg.slp +labels.v001.slp.predictions.slp +slurm_train_script.sh +swc-hpc-pose-estimation +train-script.sh +``` +There should be a `train-script.sh` file created by SLEAP, which already contains the +commands to run the training. You can see the contents of the file by running `cat train-script.sh`: +```{code-block} bash +:caption: train-script.sh +#!/bin/bash +sleap-train centroid.json labels.v001.pkg.slp +sleap-train centered_instance.json labels.v001.pkg.slp +``` +The precise commands will depend on the model configuration you chose in SLEAP. +Here we see two separate training calls, one for the "centroid" and another for +the "centered_instance" model. That's because in this example we have chosen +the ["Top-Down"](https://sleap.ai/tutorials/initial-training.html#training-options) +configuration, which consists of two neural networks - the first for isolating +the animal instances (by finding their centroids) and the second for predicting +all the body parts per instance. + +![Top-Down model configuration](https://sleap.ai/_images/topdown_approach.jpg) + +:::{dropdown} More on "Top-Down" vs "Bottom-Up" models +:color: info +:icon: info + +Although the "Top-Down" configuration was designed with multiple animals in mind, +it can also be used for single-animal videos. It makes sense to use it for videos +where the animal occupies a relatively small portion of the frame - see +[Troubleshooting Workflows](https://sleap.ai/guides/troubleshooting-workflows.html) for more info. +::: + +Next you need to create a SLURM batch script, which will schedule the training job +on the HPC cluster. Create a new file called `slurm_train_script.sh` +(You can do this in the terminal with `nano`/`vim` or in a text editor of +your choice on your local PC/laptop). Here we create the script in the same folder +as the training job, but you can save it anywhere you want, or even keep track of it with `git`. + +```{code-block} console +$ nano slurm_train_script.sh +``` + +An example is provided below, followed by explanations. +```{code-block} bash +:caption: slurm_train_script.sh +#!/bin/bash + +#SBATCH -p gpu # partition +#SBATCH -N 1 # number of nodes +#SBATCH --mem 12G # memory pool for all cores +#SBATCH -n 2 # number of cores +#SBATCH -t 0-04:00 # time (D-HH:MM) +#SBATCH --gres gpu:1 # request 1 GPU (of any kind) +#SBATCH -o slurm.%N.%j.out # write STDOUT +#SBATCH -e slurm.%N.%j.err # write STDERR +#SBATCH --mail-type=ALL +#SBATCH --mail-user=name@domain.com + +# Load the SLEAP module +module load SLEAP + +# Define directories for data and exported training job +DATA_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data +JOB_DIR=$DATA_DIR/labels.v001.slp.training_job +# Go to the job directory +cd $JOB_DIR + +# Run the training script generated by SLEAP +./train-script.sh +``` + +In `nano`, you can save the file by pressing `Ctrl+O` and exit by pressing `Ctrl+X`. + +:::{dropdown} Explanation of the batch script +:color: info +:icon: info +- The `#SBATCH` lines are SLURM directives. They specify the resources needed +for the job, such as the number of nodes, CPUs, memory, etc. +A primer on the most useful SLURM arguments is provided in the [appendix](#slurm-arguments-primer). +For more information see the [SLURM documentation](https://slurm.schedmd.com/sbatch.html). + +- The `#` lines are comments. They are not executed by SLURM, but they are useful +for explaining the script to your future self and others. + +- The `module load SLEAP` line loads the latest SLEAP module and any other modules +it may depend on. + +- The `cd` line changes the working directory to the training job folder. +This is necessary because the `train-script.sh` file contains relative paths +to the model configuration and the project file. + +- The `./train-script.sh` line runs the training job (executes the contained commands). +::: + +Now you can submit the batch script via running the following command +(in the same directory as the script): +```{code-block} console +$ sbatch slurm_train_script.sh +Submitted batch job 3445652 +``` +:::{warning} +If you are getting a permission error, make the script files executable +by running in the terminal: + +```{code-block} console +$ chmod +x train-script.sh +$ chmod +x slurm_train_script.sh +``` + +If the scripts are not in the same folder, you will need to specify the full path: +`chmod +x /path/to/script.sh` +::: + +You may monitor the progress of the job in various ways: + +::::{tab-set} + +:::{tab-item} squeue + +View the status of the queued/running jobs with [`squeue`](https://slurm.schedmd.com/squeue.html): + +```{code-block} console +$ squeue -u +JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) +3445652 gpu slurm_ba sirmpila R 23:11 1 gpu-sr670-20 +``` +::: + +:::{tab-item} sacct + +View status of running/completed jobs with [`sacct`](https://slurm.schedmd.com/sacct.html): + +```{code-block} console +$ sacct -u +JobID JobName Partition Account AllocCPUS State ExitCode +------------ ---------- ---------- ---------- ---------- ---------- -------- +3445652 slurm_bat+ gpu swc-ac 2 COMPLETED 0:0 +3445652.bat+ batch swc-ac 2 COMPLETED 0:0 +``` +Run `sacct` with some more helpful arguments +(view jobs from the last 24 hours, including the time elapsed): + +```{code-block} console +$ sacct -u nstest \ +--starttime $(date -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \ +--endtime $(date +%Y-%m-%dT%H:%M:%S) \ +--format=JobID,JobName,Partition,AllocCPUS,State,Start,End,Elapsed,MaxRSS +``` +::: + +:::{tab-item} view the logs + +View the contents of standard output and error +(the node name and job ID will differ in each case): +```{code-block} console +$ cat slurm.gpu-sr670-20.3445652.out +$ cat slurm.gpu-sr670-20.3445652.err +``` +::: + +:::: + +### Evaluate the trained models +Upon successful completion of the training job, a `models` folder will have +been created in the training job directory. It contains one subfolder per +training run (by defalut prefixed with the date and time of the run). + +```{code-block} console +$ cd /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data +$ cd labels.v001.slp.training_job +$ cd models +$ ls -1 +230509_141357.centered_instance +230509_141357.centroid +``` + +Each subfolder holds the trained model files (e.g. `best_model.h5`), +their configurations (`training_config.json`) and some evaluation metrics. + +```{code-block} console +$ cd 230509_141357.centered_instance +$ ls -1 +best_model.h5 +initial_config.json +labels_gt.train.slp +labels_gt.val.slp +labels_pr.train.slp +labels_pr.val.slp +metrics.train.npz +metrics.val.npz +training_config.json +training_log.csv +``` +The SLEAP GUI on your local machine can be used to quickly evaluate the trained models. + +- Select "Predict" -> "Evaluation Metrics for Trained Models..." +- Click on "Add Trained Models(s)" and select the subfolder(s) containing the model(s) you want to evaluate (e.g. `230509_141357.centered_instance`). +- You can view the basic metrics on the shown table or you can also view a more detailed report (including plots) by clicking "View Metrics". + +## Model inference +By inference, we mean using a trained model to predict the labels on new frames/videos. +SLEAP provides the `sleap-track` command line utility for running inference +on a single video or a folder of videos. + +Below is an example SLURM batch script that contains a `sleap-track` call. +```{code-block} bash +:caption: slurm_infer_script.sh +#!/bin/bash + +#SBATCH -p gpu # partition +#SBATCH -N 1 # number of nodes +#SBATCH --mem 12G # memory pool for all cores +#SBATCH -n 2 # number of cores +#SBATCH -t 0-01:00 # time (D-HH:MM) +#SBATCH --gres gpu:1 # request 1 GPU (of any kind) +#SBATCH -o slurm.%N.%j.out # write STDOUT +#SBATCH -e slurm.%N.%j.err # write STDERR +#SBATCH --mail-type=ALL +#SBATCH --mail-user=name@domain.com + +# Load the SLEAP module +module load SLEAP + +# Define directories for data and exported training job +DATA_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data +JOB_DIR=$DATA_DIR/labels.v001.slp.training_job +# Go to the job directory +cd $JOB_DIR + +# Run the inference command +sleap-track $DATA_DIR/videos/M708149_EPM_20200317_165049331-converted.mp4 \ + -m $JOB_DIR/models/230509_141357.centroid/training_config.json \ + -m $JOB_DIR/models/230509_141357.centered_instance/training_config.json \ + --gpu auto \ + --tracking.tracker none \ + -o labels.v001.slp.predictions.slp \ + --verbosity json \ + --no-empty-frames +``` +The script is very similar to the training script, with the following differences: +- The time limit `-t` is set lower, since inference is normally faster than training. This will however depend on the size of the video and the number of models used. +- The `./train-script.sh` line is replaced by the `sleap-track` command. +- The `\` character is used to split the long `sleap-track` command into multiple lines for readability. It is not necessary if the command is written on a single line. + +::: {dropdown} Explanation of the sleap-track arguments +:color: info +:icon: info + + Some important command line arguments are explained below. + You can view a full list of the available arguments by running `sleap-track --help`. +- The first argument is the path to the video file to be processed. +- The `-m` option is used to specify the path to the model configuration file(s) to be used for inference. In this example we use the two models that were trained above. +- The `--gpu` option is used to specify the GPU to be used for inference. The `auto` value will automatically select the GPU with the highes percentage of available memory (of the GPUs that are available on the machine/node) +- The `--tracking.tracker` option is used to specify the tracker for inference. Since in this example we only have one animal, we set it to "none". +- The `-o` option is used to specify the path to the output file containing the predictions. +- The above script will predict all the frames in the video. You may select specific frames via the `--frames` option. For example: `--frames 1-50` or `--frames 1,3,5,7,9`. +::: + +You can submit and monitor the inference job in the same way as the training job. +```{code-block} console +$ sbatch slurm_infer_script.sh +$ squeue -u +``` +Upon completion, a `labels.v001.slp.predictions.slp` file will have been created in the job directory. + +You can use the SLEAP GUI on your local machine to load and view the predictions: +"File" -> "Open Project..." -> select the `labels.v001.slp.predictions.slp` file. + +## The training-inference cycle +Now that you have some predictions, you can keep improving your models by repeating +the training-inference cycle. The basic steps are: +- Manually correct some of the predictions: see [Prediction-assisted labeling](https://sleap.ai/tutorials/assisted-labeling.html) +- Merge corrected labels into the initial training set: see [Merging guide](https://sleap.ai/guides/merging.html) +- Save the merged training set as `labels.v002.slp` +- Export a new training job `labels.v002.slp.training_job` (you may reuse the training configurations from `v001`) +- Repeat the training-inference cycle until satisfied + +## Troubleshooting + +### Problems with the SLEAP module + +In this section, we will describe how to test that the SLEAP module is loaded +correctly for you and that it can use the available GPUs. + +Login to the HPC cluster as described [above](#access-to-the-hpc-cluster-and-sleap-module). + +Start an interactive job on a GPU node. This step is necessary, because we need +to test the module's access to the GPU. +```{code-block} console +$ srun -p fast --gres=gpu:1 --pty bash -i +``` +:::{dropdown} Explain the above command +:color: info +:icon: info + +The `-i` stands for "interactive", while `--pty` is short for "pseudo-terminal". +Taken together, the above command will start an interactive bash terminal session +on a node of the "fast" partition, equipped with 1 GPU. +::: + +Load the SLEAP module. +```{code-block} console +$ module load SLEAP +``` + +To verify that the module was loaded successfully: +```{code-block} console +$ module list +Currently Loaded Modulefiles: + 1) SLEAP/2023-08-01 +``` +You can essentially think of the module as a centrally installed conda environment. +When it is loaded, you should be using a particular Python executable. +You can verify this by running: + +```{code-block} console +$ which python +/ceph/apps/ubuntu-20/packages/SLEAP/2023-08-01/bin/python +``` + +Finally we will verify that the `sleap` python package can be imported and can +"see" the GPU. We will mostly just follow the +[relevant SLEAP instructions](https://sleap.ai/installation.html#testing-that-things-are-working). +First, start a Python interpreter: +```{code-block} console +$ python +``` +Next, run the following Python commands: +```{code-block} pycon +>>> import sleap + +>>> sleap.versions() +SLEAP: 1.3.1 +TensorFlow: 2.8.4 +Numpy: 1.21.6 +Python: 3.7.12 +OS: Linux-5.4.0-109-generic-x86_64-with-debian-bullseye-sid + +>>> sleap.system_summary() +GPUs: 1/1 available + Device: /physical_device:GPU:0 + Available: True + Initalized: False + Memory growth: None + +>>> import tensorflow as tf + +>>> print(tf.config.list_physical_devices('GPU')) +[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] + +>>> tf.constant("Hello world!") + +``` + +::: {warning} +The `import sleap` command may take some time to run (more than a minute). +This is normal. Subsequent imports should be faster. +::: + +If all is as expected, you can exit the Python interpreter, and then exit the GPU node +```{code-block} pycon +>>> exit() +``` +```{code-block} console +$ exit() +``` +To completely exit the HPC cluster, you will need to logout of the SSH session twice: +```bash +$ logout +$ logout +``` +See [Why do we SSH twice?](#why-do-we-ssh-twice) in the Appendix for an explanation. + +## Appendix + +### SLURM arguments primer + +Here are the most important SLURM arguments used in the above examples +in conjunction with `sbatch` or `srun`. + +**Partition (Queue)** +- Name: `--partition` +- Alias: `-p` +- Description: Specifies the partition (or queue) to submit the job to. In this case, the job will be submitted to the "gpu" partition. +- Example values: `gpu`, `cpu`, `fast`, `medium` + +**Job Name** +- Name: `--job-name` +- Alias: `-J` +- Description: Specifies a name for the job, which will appear in various SLURM commands and logs, making it easier to identify the job (especially when you have multiple jobs queued up) +- Example values: `training_run_24` + +**Number of Nodes** +- Name: `--nodes` +- Alias: `-N` +- Description: Defines the number of nodes required for the job. +- Example values: `1` +- Note: This should always be `1`, unless you really know what you're doing + +**Number of Cores** +- Name: `--ntasks` +- Alias: `-n` +- Description: Defines the number of cores (or tasks) required for the job. +- Example values: `1`, `4`, `8` + +**Memory Pool for All Cores** +- Name: `--mem` +- Description: Specifies the total amount of memory (RAM) required for the job across all cores (per node) +- Example values: `8G`, `16G`, `32G` + +**Time Limit** +- Name: `--time` +- Alias: `-t` +- Description: Sets the maximum time the job is allowed to run. The format is D-HH:MM, where D is days, HH is hours, and MM is minutes. +- Example values: `0-01:00` (1 hour), `0-04:00` (4 hours), `1-00:00` (1 day). +- Note: If the job exceeds the time limit, it will be terminated by SLURM. On the other hand, avoid requesting way more time than what your job needs, as this may delay its scheduling (depending on resource availability). + +**Generic Resources (GPUs)** +* Name: `--gres` +* Description: Requests generic resources, such as GPUs. +* Example values: `gpu:1`, `gpu:rtx2080:1`, `gpu:rtx5000:1`, `gpu:a100_2g.10gb:1` +* Note: No GPU will be allocated to you unless you specify it via the `--gres` argument (ecen if you are on the "GPU" partition. To request 1 GPU of any kind, use `--gres gpu:1`. To request a specific GPU type, you have to include its name, e.g. `--gres gpu:rtx2080:1`. You can view the available GPU types on the [SWC internal wiki](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture). + +**Standard Output File** +- Name: `--output` +- Alias: `-o` +- Description: Defines the file where the standard output (STDOUT) will be written. In the examples scripts, it's set to slurm.%N.%j.out, where %N is the node name and %j is the job ID. +- Example values: `slurm.%N.%j.out`, `slurm.MyAwesomeJob.out` +- Note: this file contains the output of the commands executed by the job (i.e. the messages that normally gets printed on the terminal). + +**Standard Error File** +- Name: `--error` +- Alias: `-e` +- Description: Specifies the file where the standard error (STDERR) will be written. In the examples, it's set to slurm.%N.%j.err, where %N is the node name and %j is the job ID. +- Example values: `slurm.%N.%j.err`, `slurm.MyAwesomeJob.err` +- Note: this file is very useful for debugging, as it contains all the error messages produced by the commands executed by the job. + +**Email Notifications** +* Name: `--mail-type` +* Description: Defines the conditions under which the user will be notified by email. +Example values: `ALL`, `BEGIN`, `END`, `FAIL` + +**Email Address** +* Name: `--mail-user` +* Description: Specifies the email address to which notifications will be sent. +* Note: currently this feature does not work on the SWC HPC cluster. + +**Array jobs** +* Name: `--array` +* Description: Job array index values (a list of integers in increasing order). The task index can be accessed via the `SLURM_ARRAY_TASK_ID` environment variable. +* Example values: `--array=1-10`, `--array=1-100%5` (100 jobs, but only 5 of them will be allowed to run in parallel at any given time). +* Note: if an array consists of many jobs, using the `%` syntax to limit the maximum number of parallel jobs is recommended to prevent overloading the cluster. + + +### Why do we SSH twice? + +We first need to distinguish the different types of nodes on the SWC HPC system: + +- the *bastion* node (or "jump host") - `ssh.swc.ucl.ac.uk`. This serves as a single entry point to the cluster from external networks. By funneling all external SSH connections through this node, it's easier to monitor, log, and control access, reducing the attack surface. The *bastion* node has very little processing power. It can be used to submit and monitor SLURM jobs, but it shouldn't be used for anything else. +- the *gateway* node - `hpc-gw1`. This is a more powerful machine and can be used for light processing, such as editing your scripts, creating and copying files etc. However don't use it for anything computationally intensive, since this node's resources are shared across all users. +- the *compute* nodes - `enc1-node10`, `gpu-sr670-21`, etc. These are the machinces that actually run the jobs we submit, either interactively via `srun` or via batch scripts submitted with `sbatch`. + +![](../_static/swc_hpc_access_flowchart.png) + +The home directory, as well as the locations where filesystems like `ceph` are mounted, are shared across all of the nodes. + +The first `ssh` command - `ssh @ssh.swc.ucl.ac.uk` only takes you to the *bastion* node. A second command - `ssh hpc-gw1` - is needed to reach the *gateway* node. + +Similarly, if you are on the *gateway* node, typing `logout` once will only get you one layer outo the *bastion* node. You need to type `logout` again to exit the *bastion* node and return to your local machine. + +The *compute* nodes should only be accessed via the SLURM `srun` or `sbatch` commands. This can be done from either the *bastion* or the *gateway* nodes. If you are running an interactive job on one of the *compute* nodes, you can terminate it by typing `exit`. This will return you to the node from which you entered. From 389932e8c3b7509b0b3e16430701751d8767f98f Mon Sep 17 00:00:00 2001 From: niksirbi Date: Wed, 27 Sep 2023 10:48:42 +0100 Subject: [PATCH 04/29] corrected typos and clarified some spots --- docs/requirements.txt | 2 +- docs/source/conf.py | 2 +- docs/source/data_analysis/HPC-module-SLEAP.md | 233 +++++++++++------- 3 files changed, 141 insertions(+), 96 deletions(-) diff --git a/docs/requirements.txt b/docs/requirements.txt index 403d7ff..18b6962 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -4,5 +4,5 @@ nbsphinx numpydoc pydata-sphinx-theme sphinx -sphinx-design sphinx-copybutton +sphinx-design diff --git a/docs/source/conf.py b/docs/source/conf.py index dbb0368..0da9be4 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -138,4 +138,4 @@ # Configure the code block copy button # don't copy line numbers, prompts, or console outputs -copybutton_exclude = ".linenos, .gp, .go" +copybutton_exclude = ".linenos, .gp, .go" diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index dad6d3c..4ddcf62 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -7,25 +7,25 @@ :language: python ``` -This guide explains how to use the [SLEAP](https://sleap.ai/) module that is +This guide explains how to use the [SLEAP](https://sleap.ai/) module that is installed on the SWC's HPC cluster to run training and/or inference jobs. :::{warning} -Some links withing this document point to the +Some links within this document point to the [SWC internal wiki](https://wiki.ucl.ac.uk/display/SI/SWC+Intranet), which is only accessible from within the SWC network. ::: -:::{dropdown} Intepreting code blocks wihin this document +:::{dropdown} Interpreting code blocks wihin this document :color: info :icon: info -Shell commands will be shown in code blocks like this +Shell commands will be shown in code blocks like this (with the `$` sign indicating the shell prompt): ```{code-block} console $ echo "Hello world!" ``` - + Similarly, Python code blocks will appear with the `>>>` sign indicating the Python interpreter prompt: ```{code-block} pycon @@ -57,7 +57,7 @@ $ ssh @ssh.swc.ucl.ac.uk $ ssh hpc-gw1 ``` -If you are wondering about the two SSH commands, see the Appendix for +If you are wondering about the two SSH commands, see the Appendix for [Why do we SSH twice?](#why-do-we-ssh-twice). @@ -77,7 +77,7 @@ You can load the latest version by running: ```{code-block} console $ module load SLEAP ``` -If you want to load a specific version, you can do so by typing the full module name, +If you want to load a specific version, you can do so by typing the full module name, including the date e.g. `module load SLEAP/2023-03-13` If a module has been successfully loaded, it will be listed when you run `module list`, @@ -86,38 +86,38 @@ along with other modules it may depend on: ```{code-block} console $ module list Currently Loaded Modulefiles: - 1) cuda/11.8 2) SLEAP/2023-08-01 + 1) cuda/11.8 2) SLEAP/2023-08-01 ``` -If you have troubles with loading the SLEAP module, see the +If you have troubles with loading the SLEAP module, see the [Troubleshooting section](#problems-with-the-sleap-module). ### Install SLEAP on your local PC/laptop -While you can delegate the GPU-intensive work to the HPC cluster, -you will still need to do some steps, such as labelling frames, via the SLEAP GUI. +While you can delegate the GPU-intensive work to the HPC cluster, +you will need to use the SLEAP GUI for some steps, such as labelling frames. Thus, you also need to install SLEAP on your local PC/laptop. We recommend following the official [SLEAP installation guide](https://sleap.ai/installation.html). To be on the safe side, ensure that the local installation version matches the one on the cluster. ### Mount the SWC filesystem on your local PC/laptop -The rest of this guide assumes that you have mounted the SWC filesystem on your local PC/laptop. -If you have not done so, please follow the relevant instructions on the +The rest of this guide assumes that you have mounted the SWC filesystem on your local PC/laptop. +If you have not done so, please follow the relevant instructions on the [SWC internal wiki](https://wiki.ucl.ac.uk/display/SSC/SWC+Storage+Platform+Overview). -We will also assume that the data you are working with are stored in a `ceph` or `winstor` -directory to which you have access to. In the rest of this guide, we will use the path -`/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data` which contains a SLEAP project +We will also assume that the data you are working with are stored in a `ceph` or `winstor` +directory to which you have access to. In the rest of this guide, we will use the path +`/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data` which contains a SLEAP project for test purposes. You should replace this with the path to your own data. :::{dropdown} Data storage location matters :color: warning :icon: alert-fill -The cluster has fast acess to data stored on the `ceph` and `winstor` filesystems. -If your data is stored elsewhere, make sure to transfer it to `ceph` or `winstor` -before running the job. You can use tools such as [`rsync`](https://linux.die.net/man/1/rsync) +The cluster has fast access to data stored on the `ceph` and `winstor` filesystems. +If your data is stored elsewhere, make sure to transfer it to `ceph` or `winstor` +before running the job. You can use tools such as [`rsync`](https://linux.die.net/man/1/rsync) to copy data from your local machine to `ceph` via an ssh connection. For example: ```{code-block} console @@ -126,23 +126,23 @@ $ rsync -avz @ssh.swc.ucl.ac.uk:/ceph/scratch/neuroinf ::: ## Model training -This will consist of two parts - [preparing a training job](#prepare-the-training-job) -(on your local SLEAP installation) and [running a training job](#run-the-training-job) -(on the HPC cluster's SLEAP module). Some evaluation metrics for the trained models +This will consist of two parts - [preparing a training job](#prepare-the-training-job) +(on your local SLEAP installation) and [running a training job](#run-the-training-job) +(on the HPC cluster's SLEAP module). Some evaluation metrics for the trained models can be [viewed via the SLEAP GUI](#evaluate-the-trained-models) on your local SLEAP installation. ### Prepare the training job -Follow the SLEAP instructions for [Creating a Project](https://sleap.ai/tutorials/new-project.html) -and [Initial Labelling](https://sleap.ai/tutorials/initial-labeling.html). -Ensure that the project file (e.g. `labels.v001.slp`) is saved in the mounted SWC filesystem +Follow the SLEAP instructions for [Creating a Project](https://sleap.ai/tutorials/new-project.html) +and [Initial Labelling](https://sleap.ai/tutorials/initial-labeling.html). +Ensure that the project file (e.g. `labels.v001.slp`) is saved in the mounted SWC filesystem (as opposed to your local filesystem). -Next, follow the instructions in [Remote Training](https://sleap.ai/guides/remote.html#remote-training), +Next, follow the instructions in [Remote Training](https://sleap.ai/guides/remote.html#remote-training), i.e. "Predict" -> "Run Training…" -> "Export Training Job Package…". - For selecting the right configuration parameters, see [Configuring Models](https://sleap.ai/guides/choosing-models.html#) and [Troubleshooting Workflows](https://sleap.ai/guides/troubleshooting-workflows.html) - Set the "Predict On" parameter to "nothing". Remote training and inference (prediction) are easiest to run separately on the HPC Cluster. Also unselect "visualize predictions" in training settings, if it's enabled by default. - If you are working with a top-down camera view, set the "Rotation Min Angle" and "Rotation Max Angle" to -180 and 180 respectively in the "Augmentation" section. -- Make sure to save the exported training job package (e.g. `labels.v001.slp.training_job.zip`) in the mounted SWC filesystem, ideally in the same directory as the project file. +- Make sure to save the exported training job package (e.g. `labels.v001.slp.training_job.zip`) in the mounted SWC filesystem, for example, in the same directory as the project file. - Unzip the training job package. This will create a folder with the same name (minus the `.zip` extension). This folder contains everything needed to run the training job on the HPC cluster. ### Run the training job @@ -162,24 +162,26 @@ inference-script.sh jobs.yaml labels.v001.pkg.slp labels.v001.slp.predictions.slp -slurm_train_script.sh +slurm-train-script.sh swc-hpc-pose-estimation train-script.sh ``` -There should be a `train-script.sh` file created by SLEAP, which already contains the +There should be a `train-script.sh` file created by SLEAP, which already contains the commands to run the training. You can see the contents of the file by running `cat train-script.sh`: ```{code-block} bash :caption: train-script.sh +:name: train-script-sh +:linenos: #!/bin/bash sleap-train centroid.json labels.v001.pkg.slp sleap-train centered_instance.json labels.v001.pkg.slp ``` The precise commands will depend on the model configuration you chose in SLEAP. -Here we see two separate training calls, one for the "centroid" and another for -the "centered_instance" model. That's because in this example we have chosen -the ["Top-Down"](https://sleap.ai/tutorials/initial-training.html#training-options) -configuration, which consists of two neural networks - the first for isolating -the animal instances (by finding their centroids) and the second for predicting +Here we see two separate training calls, one for the "centroid" and another for +the "centered_instance" model. That's because in this example we have chosen +the ["Top-Down"](https://sleap.ai/tutorials/initial-training.html#training-options) +configuration, which consists of two neural networks - the first for isolating +the animal instances (by finding their centroids) and the second for predicting all the body parts per instance. ![Top-Down model configuration](https://sleap.ai/_images/topdown_approach.jpg) @@ -188,26 +190,28 @@ all the body parts per instance. :color: info :icon: info -Although the "Top-Down" configuration was designed with multiple animals in mind, -it can also be used for single-animal videos. It makes sense to use it for videos -where the animal occupies a relatively small portion of the frame - see +Although the "Top-Down" configuration was designed with multiple animals in mind, +it can also be used for single-animal videos. It makes sense to use it for videos +where the animal occupies a relatively small portion of the frame - see [Troubleshooting Workflows](https://sleap.ai/guides/troubleshooting-workflows.html) for more info. ::: -Next you need to create a SLURM batch script, which will schedule the training job -on the HPC cluster. Create a new file called `slurm_train_script.sh` -(You can do this in the terminal with `nano`/`vim` or in a text editor of -your choice on your local PC/laptop). Here we create the script in the same folder +Next you need to create a SLURM batch script, which will schedule the training job +on the HPC cluster. Create a new file called `slurm-train-script.sh` +(You can do this in the terminal with `nano`/`vim` or in a text editor of +your choice on your local PC/laptop). Here we create the script in the same folder as the training job, but you can save it anywhere you want, or even keep track of it with `git`. ```{code-block} console -$ nano slurm_train_script.sh +$ nano slurm-train-script.sh ``` An example is provided below, followed by explanations. ```{code-block} bash -:caption: slurm_train_script.sh -#!/bin/bash +:caption: slurm-train-script.sh +:name: slurm-train-script-sh +:linenos: +#!/bin/bash #SBATCH -p gpu # partition #SBATCH -N 1 # number of nodes @@ -218,7 +222,7 @@ An example is provided below, followed by explanations. #SBATCH -o slurm.%N.%j.out # write STDOUT #SBATCH -e slurm.%N.%j.err # write STDERR #SBATCH --mail-type=ALL -#SBATCH --mail-user=name@domain.com +#SBATCH --mail-user=name@domain.com # Load the SLEAP module module load SLEAP @@ -227,7 +231,7 @@ module load SLEAP DATA_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data JOB_DIR=$DATA_DIR/labels.v001.slp.training_job # Go to the job directory -cd $JOB_DIR +cd $JOB_DIR # Run the training script generated by SLEAP ./train-script.sh @@ -238,40 +242,40 @@ In `nano`, you can save the file by pressing `Ctrl+O` and exit by pressing `Ctrl :::{dropdown} Explanation of the batch script :color: info :icon: info -- The `#SBATCH` lines are SLURM directives. They specify the resources needed -for the job, such as the number of nodes, CPUs, memory, etc. +- The `#SBATCH` lines are SLURM directives. They specify the resources needed +for the job, such as the number of nodes, CPUs, memory, etc. A primer on the most useful SLURM arguments is provided in the [appendix](#slurm-arguments-primer). For more information see the [SLURM documentation](https://slurm.schedmd.com/sbatch.html). -- The `#` lines are comments. They are not executed by SLURM, but they are useful +- The `#` lines are comments. They are not executed by SLURM, but they are useful for explaining the script to your future self and others. - + - The `module load SLEAP` line loads the latest SLEAP module and any other modules it may depend on. - + - The `cd` line changes the working directory to the training job folder. This is necessary because the `train-script.sh` file contains relative paths to the model configuration and the project file. - + - The `./train-script.sh` line runs the training job (executes the contained commands). ::: -Now you can submit the batch script via running the following command +Now you can submit the batch script via running the following command (in the same directory as the script): ```{code-block} console -$ sbatch slurm_train_script.sh +$ sbatch slurm-train-script.sh Submitted batch job 3445652 ``` :::{warning} -If you are getting a permission error, make the script files executable +If you are getting a permission error, make the script files executable by running in the terminal: ```{code-block} console $ chmod +x train-script.sh -$ chmod +x slurm_train_script.sh +$ chmod +x slurm-train-script.sh ``` -If the scripts are not in the same folder, you will need to specify the full path: +If the scripts are not in the same folder, you will need to specify the full path: `chmod +x /path/to/script.sh` ::: @@ -281,7 +285,7 @@ You may monitor the progress of the job in various ways: :::{tab-item} squeue -View the status of the queued/running jobs with [`squeue`](https://slurm.schedmd.com/squeue.html): +View the status of the queued/running jobs with [`squeue`](https://slurm.schedmd.com/squeue.html): ```{code-block} console $ squeue -u @@ -296,12 +300,12 @@ View status of running/completed jobs with [`sacct`](https://slurm.schedmd.com/s ```{code-block} console $ sacct -u -JobID JobName Partition Account AllocCPUS State ExitCode ------------- ---------- ---------- ---------- ---------- ---------- -------- -3445652 slurm_bat+ gpu swc-ac 2 COMPLETED 0:0 +JobID JobName Partition Account AllocCPUS State ExitCode +------------ ---------- ---------- ---------- ---------- ---------- -------- +3445652 slurm_bat+ gpu swc-ac 2 COMPLETED 0:0 3445652.bat+ batch swc-ac 2 COMPLETED 0:0 ``` -Run `sacct` with some more helpful arguments +Run `sacct` with some more helpful arguments (view jobs from the last 24 hours, including the time elapsed): ```{code-block} console @@ -314,7 +318,7 @@ $ sacct -u nstest \ :::{tab-item} view the logs -View the contents of standard output and error +View the contents of standard output and error (the node name and job ID will differ in each case): ```{code-block} console $ cat slurm.gpu-sr670-20.3445652.out @@ -325,9 +329,9 @@ $ cat slurm.gpu-sr670-20.3445652.err :::: ### Evaluate the trained models -Upon successful completion of the training job, a `models` folder will have -been created in the training job directory. It contains one subfolder per -training run (by defalut prefixed with the date and time of the run). +Upon successful completion of the training job, a `models` folder will have +been created in the training job directory. It contains one subfolder per +training run (by default prefixed with the date and time of the run). ```{code-block} console $ cd /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data @@ -338,7 +342,7 @@ $ ls -1 230509_141357.centroid ``` -Each subfolder holds the trained model files (e.g. `best_model.h5`), +Each subfolder holds the trained model files (e.g. `best_model.h5`), their configurations (`training_config.json`) and some evaluation metrics. ```{code-block} console @@ -363,13 +367,15 @@ The SLEAP GUI on your local machine can be used to quickly evaluate the trained ## Model inference By inference, we mean using a trained model to predict the labels on new frames/videos. -SLEAP provides the `sleap-track` command line utility for running inference +SLEAP provides the `sleap-track` command line utility for running inference on a single video or a folder of videos. Below is an example SLURM batch script that contains a `sleap-track` call. ```{code-block} bash -:caption: slurm_infer_script.sh -#!/bin/bash +:caption: slurm-infer-script.sh +:name: slurm-infer-script-sh +:linenos: +#!/bin/bash #SBATCH -p gpu # partition #SBATCH -N 1 # number of nodes @@ -380,7 +386,7 @@ Below is an example SLURM batch script that contains a `sleap-track` call. #SBATCH -o slurm.%N.%j.out # write STDOUT #SBATCH -e slurm.%N.%j.err # write STDERR #SBATCH --mail-type=ALL -#SBATCH --mail-user=name@domain.com +#SBATCH --mail-user=name@domain.com # Load the SLEAP module module load SLEAP @@ -389,7 +395,7 @@ module load SLEAP DATA_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data JOB_DIR=$DATA_DIR/labels.v001.slp.training_job # Go to the job directory -cd $JOB_DIR +cd $JOB_DIR # Run the inference command sleap-track $DATA_DIR/videos/M708149_EPM_20200317_165049331-converted.mp4 \ @@ -410,11 +416,11 @@ The script is very similar to the training script, with the following difference :color: info :icon: info - Some important command line arguments are explained below. + Some important command line arguments are explained below. You can view a full list of the available arguments by running `sleap-track --help`. - The first argument is the path to the video file to be processed. - The `-m` option is used to specify the path to the model configuration file(s) to be used for inference. In this example we use the two models that were trained above. -- The `--gpu` option is used to specify the GPU to be used for inference. The `auto` value will automatically select the GPU with the highes percentage of available memory (of the GPUs that are available on the machine/node) +- The `--gpu` option is used to specify the GPU to be used for inference. The `auto` value will automatically select the GPU with the highest percentage of available memory (of the GPUs that are available on the machine/node) - The `--tracking.tracker` option is used to specify the tracker for inference. Since in this example we only have one animal, we set it to "none". - The `-o` option is used to specify the path to the output file containing the predictions. - The above script will predict all the frames in the video. You may select specific frames via the `--frames` option. For example: `--frames 1-50` or `--frames 1,3,5,7,9`. @@ -422,16 +428,16 @@ The script is very similar to the training script, with the following difference You can submit and monitor the inference job in the same way as the training job. ```{code-block} console -$ sbatch slurm_infer_script.sh +$ sbatch slurm-infer-script.sh $ squeue -u ``` -Upon completion, a `labels.v001.slp.predictions.slp` file will have been created in the job directory. +Upon completion, a `labels.v001.slp.predictions.slp` file will have been created in the job directory. -You can use the SLEAP GUI on your local machine to load and view the predictions: +You can use the SLEAP GUI on your local machine to load and view the predictions: "File" -> "Open Project..." -> select the `labels.v001.slp.predictions.slp` file. ## The training-inference cycle -Now that you have some predictions, you can keep improving your models by repeating +Now that you have some predictions, you can keep improving your models by repeating the training-inference cycle. The basic steps are: - Manually correct some of the predictions: see [Prediction-assisted labeling](https://sleap.ai/tutorials/assisted-labeling.html) - Merge corrected labels into the initial training set: see [Merging guide](https://sleap.ai/guides/merging.html) @@ -443,7 +449,7 @@ the training-inference cycle. The basic steps are: ### Problems with the SLEAP module -In this section, we will describe how to test that the SLEAP module is loaded +In this section, we will describe how to test that the SLEAP module is loaded correctly for you and that it can use the available GPUs. Login to the HPC cluster as described [above](#access-to-the-hpc-cluster-and-sleap-module). @@ -457,14 +463,49 @@ $ srun -p fast --gres=gpu:1 --pty bash -i :color: info :icon: info -The `-i` stands for "interactive", while `--pty` is short for "pseudo-terminal". -Taken together, the above command will start an interactive bash terminal session +* `-p fast` requests a node from the "fast" partition. This refers to the queue of nodes with a 3-hour time limit. They are meant for short jobs, such as testing. +* `--gres=gpu:1` requests 1 GPU of any kind +* `--pty` is short for "pseudo-terminal". +* The `-i` stands for "interactive" + +Taken together, the above command will start an interactive bash terminal session on a node of the "fast" partition, equipped with 1 GPU. ::: -Load the SLEAP module. +First, let's verify that you are indeed on a node equipped with a functional +GPU, by typing `nvidia-smi`: +```{code-block} console +$ nvidia-smi +Wed Sep 27 10:34:35 2023 ++-----------------------------------------------------------------------------+ +| NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0 | +|-------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|===============================+======================+======================| +| 0 NVIDIA GeForce ... Off | 00000000:41:00.0 Off | N/A | +| 0% 42C P8 22W / 240W | 1MiB / 8192MiB | 0% Default | +| | | N/A | ++-------------------------------+----------------------+----------------------+ + ++-----------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=============================================================================| +| No running processes found | ++-----------------------------------------------------------------------------+ +``` +Your output should look similar to the above. You will be able to see the GPU +name, temperature, memory usage, etc. If you see an error message instead, +(even though you are on a GPU node) please contact the SWC Scientific Computing team. + +Next, load the SLEAP module. ```{code-block} console $ module load SLEAP +Loading SLEAP/2023-08-01 + Loading requirement: cuda/11.8 ``` To verify that the module was loaded successfully: @@ -482,14 +523,20 @@ $ which python /ceph/apps/ubuntu-20/packages/SLEAP/2023-08-01/bin/python ``` -Finally we will verify that the `sleap` python package can be imported and can -"see" the GPU. We will mostly just follow the +Finally we will verify that the `sleap` python package can be imported and can +"see" the GPU. We will mostly just follow the [relevant SLEAP instructions](https://sleap.ai/installation.html#testing-that-things-are-working). First, start a Python interpreter: ```{code-block} console $ python ``` Next, run the following Python commands: + +::: {warning} +The {python}`import sleap` command may take some time to run (more than a minute). +This is normal. Subsequent imports should be faster. +::: + ```{code-block} pycon >>> import sleap @@ -504,10 +551,10 @@ OS: Linux-5.4.0-109-generic-x86_64-with-debian-bullseye-sid GPUs: 1/1 available Device: /physical_device:GPU:0 Available: True - Initalized: False + Initialized: False Memory growth: None ->>> import tensorflow as tf +>>> import tensorflow as tf >>> print(tf.config.list_physical_devices('GPU')) [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')] @@ -516,11 +563,6 @@ GPUs: 1/1 available ``` -::: {warning} -The `import sleap` command may take some time to run (more than a minute). -This is normal. Subsequent imports should be faster. -::: - If all is as expected, you can exit the Python interpreter, and then exit the GPU node ```{code-block} pycon >>> exit() @@ -528,6 +570,9 @@ If all is as expected, you can exit the Python interpreter, and then exit the GP ```{code-block} console $ exit() ``` +If you encounter troubles with using the SLEAP module, contact the +Niko Sirmpilatze of the SWC [Neuroinformatics Unit](https://neuroinformatics.dev/). + To completely exit the HPC cluster, you will need to logout of the SSH session twice: ```bash $ logout @@ -545,7 +590,7 @@ in conjunction with `sbatch` or `srun`. **Partition (Queue)** - Name: `--partition` - Alias: `-p` -- Description: Specifies the partition (or queue) to submit the job to. In this case, the job will be submitted to the "gpu" partition. +- Description: Specifies the partition (or queue) to submit the job to. In this case, the job will be submitted to the "gpu" partition. To see a list of all partitions/queues, the nodes they contain and their respective time limits, type `sinfo` when logged in to the HPC cluster. - Example values: `gpu`, `cpu`, `fast`, `medium` **Job Name** @@ -558,7 +603,7 @@ in conjunction with `sbatch` or `srun`. - Name: `--nodes` - Alias: `-N` - Description: Defines the number of nodes required for the job. -- Example values: `1` +- Example values: `1` - Note: This should always be `1`, unless you really know what you're doing **Number of Cores** @@ -576,7 +621,7 @@ in conjunction with `sbatch` or `srun`. - Name: `--time` - Alias: `-t` - Description: Sets the maximum time the job is allowed to run. The format is D-HH:MM, where D is days, HH is hours, and MM is minutes. -- Example values: `0-01:00` (1 hour), `0-04:00` (4 hours), `1-00:00` (1 day). +- Example values: `0-01:00` (1 hour), `0-04:00` (4 hours), `1-00:00` (1 day). - Note: If the job exceeds the time limit, it will be terminated by SLURM. On the other hand, avoid requesting way more time than what your job needs, as this may delay its scheduling (depending on resource availability). **Generic Resources (GPUs)** From 8c572bfb3cf3557c33991cc364a0d08eb9de5f4a Mon Sep 17 00:00:00 2001 From: niksirbi Date: Wed, 27 Sep 2023 10:52:11 +0100 Subject: [PATCH 05/29] temporarily enable publishing from this branch for review --- .github/workflows/docs_build_and_deploy.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/docs_build_and_deploy.yml b/.github/workflows/docs_build_and_deploy.yml index 85d3aae..6a7b37f 100644 --- a/.github/workflows/docs_build_and_deploy.yml +++ b/.github/workflows/docs_build_and_deploy.yml @@ -8,7 +8,7 @@ name: Build Sphinx docs and deploy to GitHub Pages on: push: branches: - - main + - sleap-module tags: - '*' pull_request: @@ -38,7 +38,7 @@ jobs: needs: build_sphinx_docs permissions: contents: write - if: github.event_name == 'push' && github.ref_name == 'main' + if: github.event_name == 'push' && github.ref_name == 'sleap-module' runs-on: ubuntu-latest steps: - uses: neuroinformatics-unit/actions/deploy_sphinx_docs@v2 From bd49d714b91d1dc6fc7f4a2958f491f7eba3dd44 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 10 Oct 2023 20:07:05 +0100 Subject: [PATCH 06/29] updated SLEAP module guide --- docs/source/data_analysis/HPC-module-SLEAP.md | 263 ++++++++---------- 1 file changed, 113 insertions(+), 150 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 4ddcf62..862f22c 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -1,43 +1,10 @@ # Use the SLEAP module on the HPC cluster -```{role} bash(code) -:language: bash -``` -```{role} python(code) -:language: python -``` - -This guide explains how to use the [SLEAP](https://sleap.ai/) module that is -installed on the SWC's HPC cluster to run training and/or inference jobs. - -:::{warning} -Some links within this document point to the -[SWC internal wiki](https://wiki.ucl.ac.uk/display/SI/SWC+Intranet), -which is only accessible from within the SWC network. -::: - -:::{dropdown} Interpreting code blocks wihin this document -:color: info -:icon: info - -Shell commands will be shown in code blocks like this -(with the `$` sign indicating the shell prompt): -```{code-block} console -$ echo "Hello world!" +```{include} ../_static/swc-wiki-warning.md ``` -Similarly, Python code blocks will appear with the `>>>` sign indicating the -Python interpreter prompt: -```{code-block} pycon ->>> print("Hello world!") -``` - -The expected outputs of both shell and Python commands will be shown without -any prompt: -```{code-block} console -Hello world! +```{include} ../_static/code-blocks-note.md ``` -::: ## Abbreviations | Acronym | Meaning | @@ -50,35 +17,29 @@ Hello world! ## Prerequisites -### Access to the HPC cluster and SLEAP module +### Access to the HPC cluster Verify that you can access HPC gateway node (typing your `` both times when prompted): ```{code-block} console $ ssh @ssh.swc.ucl.ac.uk $ ssh hpc-gw1 ``` +To learn more about accessing the HPC via SSH, see the [relevant how-to guide](../programming/SSH-SWC-cluster.md). -If you are wondering about the two SSH commands, see the Appendix for -[Why do we SSH twice?](#why-do-we-ssh-twice). - - -SLEAP should be listed among the available modules: +### Access to the SLEAP module +Once you are on the HPC gateway node, SLEAP should be listed among the available modules when you run `module avail`: ```{code-block} console $ module avail -SLEAP/2023-08-01 SLEAP/2023-03-13 +SLEAP/2023-08-01 ``` +- `SLEAP/2023-03-13` corresponds to `sleap v.1.2.9` +- `SLEAP/2023-08-01` corresponds to `sleap v.1.3.1` -`SLEAP/2023-03-13` corresponds to `sleap v.1.2.9` whereas `SLEAP/2023-08-01` is `v1.3.1`. -We recommend using the latter. - -You can load the latest version by running: - -```{code-block} console -$ module load SLEAP -``` -If you want to load a specific version, you can do so by typing the full module name, -including the date e.g. `module load SLEAP/2023-03-13` +We recommend always using the latest version, which is the one loaded by default +when you run `module load SLEAP`. If you want to load a specific version, +you can do so by typing the full module name, +including the date e.g. `module load SLEAP/2023-03-13`. If a module has been successfully loaded, it will be listed when you run `module list`, along with other modules it may depend on: @@ -89,8 +50,8 @@ Currently Loaded Modulefiles: 1) cuda/11.8 2) SLEAP/2023-08-01 ``` -If you have troubles with loading the SLEAP module, see the -[Troubleshooting section](#problems-with-the-sleap-module). +If you have troubles with loading the SLEAP module, +see this guide's [Troubleshooting section](#problems-with-the-sleap-module). ### Install SLEAP on your local PC/laptop @@ -138,10 +99,10 @@ Ensure that the project file (e.g. `labels.v001.slp`) is saved in the mounted SW (as opposed to your local filesystem). Next, follow the instructions in [Remote Training](https://sleap.ai/guides/remote.html#remote-training), -i.e. "Predict" -> "Run Training…" -> "Export Training Job Package…". +i.e. *Predict* -> *Run Training…* -> *Export Training Job Package…*. - For selecting the right configuration parameters, see [Configuring Models](https://sleap.ai/guides/choosing-models.html#) and [Troubleshooting Workflows](https://sleap.ai/guides/troubleshooting-workflows.html) -- Set the "Predict On" parameter to "nothing". Remote training and inference (prediction) are easiest to run separately on the HPC Cluster. Also unselect "visualize predictions" in training settings, if it's enabled by default. -- If you are working with a top-down camera view, set the "Rotation Min Angle" and "Rotation Max Angle" to -180 and 180 respectively in the "Augmentation" section. +- Set the *Predict On* parameter to *nothing*. Remote training and inference (prediction) are easiest to run separately on the HPC Cluster. Also unselect *Visualize Predictions During Training* in training settings, if it's enabled by default. +- If you are working with a top-down camera view, set the *Rotation Min Angle* and *Rotation Max Angle* to -180 and 180 respectively in the *Augmentation* section. - Make sure to save the exported training job package (e.g. `labels.v001.slp.training_job.zip`) in the mounted SWC filesystem, for example, in the same directory as the project file. - Unzip the training job package. This will create a folder with the same name (minus the `.zip` extension). This folder contains everything needed to run the training job on the HPC cluster. @@ -153,6 +114,7 @@ $ ssh hpc-gw1 ``` Navigate to the training job folder (replace with your own path) and list its contents: ```{code-block} console +:emphasize-lines: 12 $ cd /ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data $ cd labels.v001.slp.training_job $ ls -1 @@ -162,14 +124,14 @@ inference-script.sh jobs.yaml labels.v001.pkg.slp labels.v001.slp.predictions.slp -slurm-train-script.sh +train_slurm.sh swc-hpc-pose-estimation train-script.sh ``` There should be a `train-script.sh` file created by SLEAP, which already contains the commands to run the training. You can see the contents of the file by running `cat train-script.sh`: ```{code-block} bash -:caption: train-script.sh +:caption: labels.v001.slp.training_job/train-script.sh :name: train-script-sh :linenos: #!/bin/bash @@ -177,61 +139,64 @@ sleap-train centroid.json labels.v001.pkg.slp sleap-train centered_instance.json labels.v001.pkg.slp ``` The precise commands will depend on the model configuration you chose in SLEAP. -Here we see two separate training calls, one for the "centroid" and another for -the "centered_instance" model. That's because in this example we have chosen -the ["Top-Down"](https://sleap.ai/tutorials/initial-training.html#training-options) +Here we see two separate training calls, one for the 'centroid' and another for +the 'centered_instance' model. That's because in this example we have chosen +the ['Top-Down'](https://sleap.ai/tutorials/initial-training.html#training-options) configuration, which consists of two neural networks - the first for isolating the animal instances (by finding their centroids) and the second for predicting all the body parts per instance. ![Top-Down model configuration](https://sleap.ai/_images/topdown_approach.jpg) -:::{dropdown} More on "Top-Down" vs "Bottom-Up" models +:::{dropdown} More on 'Top-Down' vs 'Bottom-Up' models :color: info :icon: info -Although the "Top-Down" configuration was designed with multiple animals in mind, +Although the 'Top-Down' configuration was designed with multiple animals in mind, it can also be used for single-animal videos. It makes sense to use it for videos where the animal occupies a relatively small portion of the frame - see [Troubleshooting Workflows](https://sleap.ai/guides/troubleshooting-workflows.html) for more info. ::: Next you need to create a SLURM batch script, which will schedule the training job -on the HPC cluster. Create a new file called `slurm-train-script.sh` -(You can do this in the terminal with `nano`/`vim` or in a text editor of +on the HPC cluster. Create a new file called `train_slurm.sh` +(you can do this in the terminal with `nano`/`vim` or in a text editor of your choice on your local PC/laptop). Here we create the script in the same folder as the training job, but you can save it anywhere you want, or even keep track of it with `git`. ```{code-block} console -$ nano slurm-train-script.sh +$ nano train_slurm.sh ``` An example is provided below, followed by explanations. ```{code-block} bash -:caption: slurm-train-script.sh -:name: slurm-train-script-sh +:caption: train_slurm.sh +:name: train-slurm-sh :linenos: #!/bin/bash -#SBATCH -p gpu # partition +#SBATCH -J slp_train # job name +#SBATCH -p gpu # partition (queue) #SBATCH -N 1 # number of nodes #SBATCH --mem 12G # memory pool for all cores -#SBATCH -n 2 # number of cores -#SBATCH -t 0-04:00 # time (D-HH:MM) +#SBATCH -n 4 # number of cores +#SBATCH -t 0-06:00 # time (D-HH:MM) #SBATCH --gres gpu:1 # request 1 GPU (of any kind) -#SBATCH -o slurm.%N.%j.out # write STDOUT -#SBATCH -e slurm.%N.%j.err # write STDERR +#SBATCH -o slurm.%N.%j.out # STDOUT +#SBATCH -e slurm.%N.%j.err # STDERR #SBATCH --mail-type=ALL -#SBATCH --mail-user=name@domain.com +#SBATCH --mail-user=user@domain.com # Load the SLEAP module module load SLEAP -# Define directories for data and exported training job -DATA_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data -JOB_DIR=$DATA_DIR/labels.v001.slp.training_job +# Define directories for SLEAP project and exported training job +SLP_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data +SLP_JOB_NAME=labels.v001.slp.training_job +SLP_JOB_DIR=$SLP_DIR/$SLP_JOB_NAME + # Go to the job directory -cd $JOB_DIR +cd $SLP_JOB_DIR # Run the training script generated by SLEAP ./train-script.sh @@ -263,7 +228,7 @@ to the model configuration and the project file. Now you can submit the batch script via running the following command (in the same directory as the script): ```{code-block} console -$ sbatch slurm-train-script.sh +$ sbatch train_slurm.sh Submitted batch job 3445652 ``` :::{warning} @@ -272,7 +237,7 @@ by running in the terminal: ```{code-block} console $ chmod +x train-script.sh -$ chmod +x slurm-train-script.sh +$ chmod +x train_slurm.sh ``` If the scripts are not in the same folder, you will need to specify the full path: @@ -288,9 +253,9 @@ You may monitor the progress of the job in various ways: View the status of the queued/running jobs with [`squeue`](https://slurm.schedmd.com/squeue.html): ```{code-block} console -$ squeue -u +$ squeue --me JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) -3445652 gpu slurm_ba sirmpila R 23:11 1 gpu-sr670-20 +3445652 gpu slp_train sirmpila R 23:11 1 gpu-sr670-20 ``` ::: @@ -299,20 +264,29 @@ JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) View status of running/completed jobs with [`sacct`](https://slurm.schedmd.com/sacct.html): ```{code-block} console -$ sacct -u +$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- -3445652 slurm_bat+ gpu swc-ac 2 COMPLETED 0:0 +3445652 slp_train gpu swc-ac 2 COMPLETED 0:0 3445652.bat+ batch swc-ac 2 COMPLETED 0:0 ``` -Run `sacct` with some more helpful arguments -(view jobs from the last 24 hours, including the time elapsed): +Run `sacct` with some more helpful arguments. +For example, you can view jobs from the last 24 hours, displaying the time +elapsed and the peak memory usage in KB (MaxRSS): ```{code-block} console -$ sacct -u nstest \ ---starttime $(date -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \ ---endtime $(date +%Y-%m-%dT%H:%M:%S) \ ---format=JobID,JobName,Partition,AllocCPUS,State,Start,End,Elapsed,MaxRSS +$ sacct \ + --starttime $(date -d '24 hours ago' +%Y-%m-%dT%H:%M:%S) \ + --endtime $(date +%Y-%m-%dT%H:%M:%S) \ + --format=JobID,JobName,Partition,State,Start,Elapsed,MaxRSS + +JobID JobName Partition State Start Elapsed MaxRSS +------------ ---------- ---------- ---------- ------------------- ---------- ---------- +4043595 slp_infer gpu FAILED 2023-10-10T18:14:31 00:00:35 +4043595.bat+ batch FAILED 2023-10-10T18:14:31 00:00:35 271104K +4043603 slp_infer gpu FAILED 2023-10-10T18:27:32 00:01:37 +4043603.bat+ batch FAILED 2023-10-10T18:27:32 00:01:37 423476K +4043611 slp_infer gpu PENDING Unknown 00:00:00 ``` ::: @@ -361,9 +335,9 @@ training_log.csv ``` The SLEAP GUI on your local machine can be used to quickly evaluate the trained models. -- Select "Predict" -> "Evaluation Metrics for Trained Models..." -- Click on "Add Trained Models(s)" and select the subfolder(s) containing the model(s) you want to evaluate (e.g. `230509_141357.centered_instance`). -- You can view the basic metrics on the shown table or you can also view a more detailed report (including plots) by clicking "View Metrics". +- Select *Predict* -> *Evaluation Metrics for Trained Models...* +- Click on *Add Trained Models(s)* and select the subfolder(s) containing the model(s) you want to evaluate (e.g. `230509_141357.centered_instance`). +- You can view the basic metrics on the shown table or you can also view a more detailed report (including plots) by clicking *View Metrics*. ## Model inference By inference, we mean using a trained model to predict the labels on new frames/videos. @@ -372,38 +346,45 @@ on a single video or a folder of videos. Below is an example SLURM batch script that contains a `sleap-track` call. ```{code-block} bash -:caption: slurm-infer-script.sh -:name: slurm-infer-script-sh +:caption: infer_slurm.sh +:name: infer-slurm-sh :linenos: #!/bin/bash +#SBATCH -J slp_infer # job name #SBATCH -p gpu # partition #SBATCH -N 1 # number of nodes #SBATCH --mem 12G # memory pool for all cores #SBATCH -n 2 # number of cores -#SBATCH -t 0-01:00 # time (D-HH:MM) +#SBATCH -t 0-02:00 # time (D-HH:MM) #SBATCH --gres gpu:1 # request 1 GPU (of any kind) #SBATCH -o slurm.%N.%j.out # write STDOUT #SBATCH -e slurm.%N.%j.err # write STDERR #SBATCH --mail-type=ALL -#SBATCH --mail-user=name@domain.com +#SBATCH --mail-user=user@domain.com # Load the SLEAP module module load SLEAP -# Define directories for data and exported training job -DATA_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data -JOB_DIR=$DATA_DIR/labels.v001.slp.training_job +# Define directories for SLEAP project and exported training job +SLP_DIR=/ceph/scratch/neuroinformatics-dropoff/SLEAP_HPC_test_data +VIDEO_DIR=$SLP_DIR/videos +SLP_JOB_NAME=labels.v001.slp.training_job +SLP_JOB_DIR=$SLP_DIR/$SLP_JOB_NAME + # Go to the job directory -cd $JOB_DIR +cd $SLP_JOB_DIR +# Make a directory to store the predictions +mkdir -p predictions # Run the inference command -sleap-track $DATA_DIR/videos/M708149_EPM_20200317_165049331-converted.mp4 \ - -m $JOB_DIR/models/230509_141357.centroid/training_config.json \ - -m $JOB_DIR/models/230509_141357.centered_instance/training_config.json \ +sleap-track $VIDEO_DIR/videos/M708149_EPM_20200317_165049331-converted.mp4 \ + -m $SLP_JOB_DIR/models/231010_164307.centroid/training_config.json \ + -m $SLP_JOB_DIR/models/231010_164307.centered_instance/training_config.json \ --gpu auto \ - --tracking.tracker none \ - -o labels.v001.slp.predictions.slp \ + --tracking.tracker simple \ + --tracking.post_connect_single_breaks 1 \ + -o predictions/labels.v001.slp.predictions.slp \ --verbosity json \ --no-empty-frames ``` @@ -421,20 +402,20 @@ The script is very similar to the training script, with the following difference - The first argument is the path to the video file to be processed. - The `-m` option is used to specify the path to the model configuration file(s) to be used for inference. In this example we use the two models that were trained above. - The `--gpu` option is used to specify the GPU to be used for inference. The `auto` value will automatically select the GPU with the highest percentage of available memory (of the GPUs that are available on the machine/node) -- The `--tracking.tracker` option is used to specify the tracker for inference. Since in this example we only have one animal, we set it to "none". +- The options starting with `--tracking` specify parameters used for tracking the detected instances (animals) across frames. See SLEAP's guide on [tracking methods](https://sleap.ai/guides/proofreading.html#tracking-method-details) for more info. - The `-o` option is used to specify the path to the output file containing the predictions. - The above script will predict all the frames in the video. You may select specific frames via the `--frames` option. For example: `--frames 1-50` or `--frames 1,3,5,7,9`. ::: You can submit and monitor the inference job in the same way as the training job. ```{code-block} console -$ sbatch slurm-infer-script.sh -$ squeue -u +$ sbatch infer_slurm.sh +$ squeue --me ``` Upon completion, a `labels.v001.slp.predictions.slp` file will have been created in the job directory. You can use the SLEAP GUI on your local machine to load and view the predictions: -"File" -> "Open Project..." -> select the `labels.v001.slp.predictions.slp` file. +*File* -> *Open Project...* -> select the `labels.v001.slp.predictions.slp` file. ## The training-inference cycle Now that you have some predictions, you can keep improving your models by repeating @@ -452,7 +433,7 @@ the training-inference cycle. The basic steps are: In this section, we will describe how to test that the SLEAP module is loaded correctly for you and that it can use the available GPUs. -Login to the HPC cluster as described [above](#access-to-the-hpc-cluster-and-sleap-module). +Login to the HPC cluster as described [above](#access-to-the-hpc-cluster). Start an interactive job on a GPU node. This step is necessary, because we need to test the module's access to the GPU. @@ -463,13 +444,13 @@ $ srun -p fast --gres=gpu:1 --pty bash -i :color: info :icon: info -* `-p fast` requests a node from the "fast" partition. This refers to the queue of nodes with a 3-hour time limit. They are meant for short jobs, such as testing. +* `-p fast` requests a node from the 'fast' partition. This refers to the queue of nodes with a 3-hour time limit. They are meant for short jobs, such as testing. * `--gres=gpu:1` requests 1 GPU of any kind -* `--pty` is short for "pseudo-terminal". -* The `-i` stands for "interactive" +* `--pty` is short for 'pseudo-terminal'. +* The `-i` stands for 'interactive' Taken together, the above command will start an interactive bash terminal session -on a node of the "fast" partition, equipped with 1 GPU. +on a node of the 'fast' partition, equipped with 1 GPU. ::: First, let's verify that you are indeed on a node equipped with a functional @@ -524,7 +505,7 @@ $ which python ``` Finally we will verify that the `sleap` python package can be imported and can -"see" the GPU. We will mostly just follow the +'see' the GPU. We will mostly just follow the [relevant SLEAP instructions](https://sleap.ai/installation.html#testing-that-things-are-working). First, start a Python interpreter: ```{code-block} console @@ -533,7 +514,7 @@ $ python Next, run the following Python commands: ::: {warning} -The {python}`import sleap` command may take some time to run (more than a minute). +The `import sleap` command may take some time to run (more than a minute). This is normal. Subsequent imports should be faster. ::: @@ -578,7 +559,8 @@ To completely exit the HPC cluster, you will need to logout of the SSH session $ logout $ logout ``` -See [Why do we SSH twice?](#why-do-we-ssh-twice) in the Appendix for an explanation. +See [Set up SSH for the SWC HPC cluster](../programming/SSH-SWC-cluster.md) +for more information. ## Appendix @@ -590,7 +572,7 @@ in conjunction with `sbatch` or `srun`. **Partition (Queue)** - Name: `--partition` - Alias: `-p` -- Description: Specifies the partition (or queue) to submit the job to. In this case, the job will be submitted to the "gpu" partition. To see a list of all partitions/queues, the nodes they contain and their respective time limits, type `sinfo` when logged in to the HPC cluster. +- Description: Specifies the partition (or queue) to submit the job to. To see a list of all partitions/queues, the nodes they contain and their respective time limits, type `sinfo` when logged in to the HPC cluster. - Example values: `gpu`, `cpu`, `fast`, `medium` **Job Name** @@ -628,7 +610,7 @@ in conjunction with `sbatch` or `srun`. * Name: `--gres` * Description: Requests generic resources, such as GPUs. * Example values: `gpu:1`, `gpu:rtx2080:1`, `gpu:rtx5000:1`, `gpu:a100_2g.10gb:1` -* Note: No GPU will be allocated to you unless you specify it via the `--gres` argument (ecen if you are on the "GPU" partition. To request 1 GPU of any kind, use `--gres gpu:1`. To request a specific GPU type, you have to include its name, e.g. `--gres gpu:rtx2080:1`. You can view the available GPU types on the [SWC internal wiki](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture). +* Note: No GPU will be allocated to you unless you specify it via the `--gres` argument (even if you are on the 'gpu' partition). To request 1 GPU of any kind, use `--gres gpu:1`. To request a specific GPU type, you have to include its name, e.g. `--gres gpu:rtx2080:1`. You can view the available GPU types on the [SWC internal wiki](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture). **Standard Output File** - Name: `--output` @@ -645,36 +627,17 @@ in conjunction with `sbatch` or `srun`. - Note: this file is very useful for debugging, as it contains all the error messages produced by the commands executed by the job. **Email Notifications** -* Name: `--mail-type` -* Description: Defines the conditions under which the user will be notified by email. -Example values: `ALL`, `BEGIN`, `END`, `FAIL` +- Name: `--mail-type` +- Description: Defines the conditions under which the user will be notified by email. +- Example values: `ALL`, `BEGIN`, `END`, `FAIL` **Email Address** -* Name: `--mail-user` -* Description: Specifies the email address to which notifications will be sent. -* Note: currently this feature does not work on the SWC HPC cluster. +- Name: `--mail-user` +- Description: Specifies the email address to which notifications will be sent. +- Example values: `user@domain.com` **Array jobs** -* Name: `--array` -* Description: Job array index values (a list of integers in increasing order). The task index can be accessed via the `SLURM_ARRAY_TASK_ID` environment variable. -* Example values: `--array=1-10`, `--array=1-100%5` (100 jobs, but only 5 of them will be allowed to run in parallel at any given time). -* Note: if an array consists of many jobs, using the `%` syntax to limit the maximum number of parallel jobs is recommended to prevent overloading the cluster. - - -### Why do we SSH twice? - -We first need to distinguish the different types of nodes on the SWC HPC system: - -- the *bastion* node (or "jump host") - `ssh.swc.ucl.ac.uk`. This serves as a single entry point to the cluster from external networks. By funneling all external SSH connections through this node, it's easier to monitor, log, and control access, reducing the attack surface. The *bastion* node has very little processing power. It can be used to submit and monitor SLURM jobs, but it shouldn't be used for anything else. -- the *gateway* node - `hpc-gw1`. This is a more powerful machine and can be used for light processing, such as editing your scripts, creating and copying files etc. However don't use it for anything computationally intensive, since this node's resources are shared across all users. -- the *compute* nodes - `enc1-node10`, `gpu-sr670-21`, etc. These are the machinces that actually run the jobs we submit, either interactively via `srun` or via batch scripts submitted with `sbatch`. - -![](../_static/swc_hpc_access_flowchart.png) - -The home directory, as well as the locations where filesystems like `ceph` are mounted, are shared across all of the nodes. - -The first `ssh` command - `ssh @ssh.swc.ucl.ac.uk` only takes you to the *bastion* node. A second command - `ssh hpc-gw1` - is needed to reach the *gateway* node. - -Similarly, if you are on the *gateway* node, typing `logout` once will only get you one layer outo the *bastion* node. You need to type `logout` again to exit the *bastion* node and return to your local machine. - -The *compute* nodes should only be accessed via the SLURM `srun` or `sbatch` commands. This can be done from either the *bastion* or the *gateway* nodes. If you are running an interactive job on one of the *compute* nodes, you can terminate it by typing `exit`. This will return you to the node from which you entered. +- Name: `--array` +- Description: Job array index values (a list of integers in increasing order). The task index can be accessed via the `SLURM_ARRAY_TASK_ID` environment variable. +- Example values: `--array=1-10` (10 jobs), `--array=1-100%5` (100 jobs, but only 5 of them will be allowed to run in parallel at any given time). +- Note: if an array consists of many jobs, using the `%` syntax to limit the maximum number of parallel jobs is recommended to prevent overloading the cluster. From 981eb18b926761207979706d6afdeb1f248bb0ac Mon Sep 17 00:00:00 2001 From: niksirbi Date: Thu, 9 Nov 2023 15:37:24 +0000 Subject: [PATCH 07/29] updated local sleap installation instructions --- docs/source/data_analysis/HPC-module-SLEAP.md | 46 +++++++++++++++---- 1 file changed, 37 insertions(+), 9 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 862f22c..d181bd3 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -7,13 +7,13 @@ ``` ## Abbreviations -| Acronym | Meaning | -| --- | --- | -| SLEAP | Social LEAP Estimates Animal Poses | -| SWC | Sainsbury Wellcome Centre | -| HPC | High Performance Computing | -| SLURM | Simple Linux Utility for Resource Management | -| GUI | Graphical User Interface | +| Acronym | Meaning | +| ------- | -------------------------------------------- | +| SLEAP | Social LEAP Estimates Animal Poses | +| SWC | Sainsbury Wellcome Centre | +| HPC | High Performance Computing | +| SLURM | Simple Linux Utility for Resource Management | +| GUI | Graphical User Interface | ## Prerequisites @@ -30,8 +30,10 @@ Once you are on the HPC gateway node, SLEAP should be listed among the available ```{code-block} console $ module avail +... SLEAP/2023-03-13 SLEAP/2023-08-01 +... ``` - `SLEAP/2023-03-13` corresponds to `sleap v.1.2.9` - `SLEAP/2023-08-01` corresponds to `sleap v.1.3.1` @@ -59,8 +61,34 @@ While you can delegate the GPU-intensive work to the HPC cluster, you will need to use the SLEAP GUI for some steps, such as labelling frames. Thus, you also need to install SLEAP on your local PC/laptop. -We recommend following the official [SLEAP installation guide](https://sleap.ai/installation.html). -To be on the safe side, ensure that the local installation version matches the one on the cluster. +We recommend following the official [SLEAP installation guide](https://sleap.ai/installation.html). If you already have `conda` installed, you may skip the `mamba` installation steps and opt for installing the `libmamba-solver` for `conda`: + +```{code-block} console +$ conda install -n base conda-libmamba-solver +$ conda config --set solver libmamba +``` +This will get you the much faster dependency resolution that `mamba` provides, without having to install `mamba` itself. +From `conda` version 23.10 onwards (released in November 2023), `libmamba-solver` [is anyway the default](https://conda.org/blog/2023-11-06-conda-23-10-0-release/). + +After that, you can follow the [rest of the SLEAP installation guide](https://sleap.ai/installation.html#conda-package), substituting `conda` for `mamba` in the relevant commands. + +::::{tab-set} + +:::{tab-item} Windows and Linux +```{code-block} console +$ conda create -y -n sleap -c conda-forge -c nvidia -c sleap -c anaconda sleap=1.3.1 +``` +::: + +:::{tab-item} MacOS X and Apple Silicon +```{code-block} console +$ conda create -y -n sleap -c conda-forge -c anaconda -c sleap sleap=1.3.1 +``` +::: + +:::: + +You may exchange `sleap=1.3.1` for other versions. To be on the safe side, ensure that your local installation version matches (or is at least close to) the one installed in the cluster module. ### Mount the SWC filesystem on your local PC/laptop The rest of this guide assumes that you have mounted the SWC filesystem on your local PC/laptop. From 4426663f99e636cc2b33178c325c0b5a60f8f806 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 14 Nov 2023 14:38:27 +0000 Subject: [PATCH 08/29] fixed broken links --- docs/source/conf.py | 29 ++++++++++++++++++----------- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/docs/source/conf.py b/docs/source/conf.py index 0da9be4..d7aa3ee 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -31,17 +31,17 @@ # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ - 'sphinx.ext.githubpages', - 'sphinx.ext.autodoc', - 'sphinx.ext.autosummary', - 'sphinx.ext.viewcode', - 'sphinx.ext.intersphinx', - 'sphinx.ext.napoleon', - 'sphinx_design', - 'sphinx_copybutton', - 'myst_parser', - 'numpydoc', - 'nbsphinx', + "sphinx.ext.githubpages", + "sphinx.ext.autodoc", + "sphinx.ext.autosummary", + "sphinx.ext.viewcode", + "sphinx.ext.intersphinx", + "sphinx.ext.napoleon", + "sphinx_design", + "sphinx_copybutton", + "myst_parser", + "numpydoc", + "nbsphinx", ] # Configure the myst parser to enable cool markdown features @@ -93,19 +93,24 @@ # html_theme = "pydata_sphinx_theme" html_title = "HowTo" +html_theme = "pydata_sphinx_theme" +html_title = "HowTo" # Redirect the webpage to another URL # Sphinx will create the appropriate CNAME file in the build directory # https://www.sphinx-doc.org/en/master/usage/extensions/githubpages.html html_baseurl = "https://howto.neuroinformatics.dev/" +html_baseurl = "https://howto.neuroinformatics.dev/" # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ["_static"] +html_static_path = ["_static"] html_css_files = [ "css/custom.css", + "css/custom.css", ] html_favicon = "_static/logo_light.png" @@ -124,6 +129,8 @@ "type": "fontawesome", } ], + "logo": { + ], "logo": { "text": "HowTo", "image_light": "logo_light.png", From b668e3dea5154de855f45354a943bb8565d5de71 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 14 Nov 2023 14:48:29 +0000 Subject: [PATCH 09/29] make link to ssh how-to guide explicit --- docs/source/data_analysis/HPC-module-SLEAP.md | 2 +- docs/source/programming/SSH-SWC-cluster.md | 17 +++++++++-------- 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index d181bd3..89c8bef 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -23,7 +23,7 @@ Verify that you can access HPC gateway node (typing your `` both t $ ssh @ssh.swc.ucl.ac.uk $ ssh hpc-gw1 ``` -To learn more about accessing the HPC via SSH, see the [relevant how-to guide](../programming/SSH-SWC-cluster.md). +To learn more about accessing the HPC via SSH, see the [relevant how-to guide](ssh-cluster-target). ### Access to the SLEAP module Once you are on the HPC gateway node, SLEAP should be listed among the available modules when you run `module avail`: diff --git a/docs/source/programming/SSH-SWC-cluster.md b/docs/source/programming/SSH-SWC-cluster.md index 549445c..36e8241 100644 --- a/docs/source/programming/SSH-SWC-cluster.md +++ b/docs/source/programming/SSH-SWC-cluster.md @@ -1,3 +1,4 @@ +(ssh-cluster-target)= # Set up SSH for the SWC HPC cluster This guide explains how to connect to the SWC's HPC cluster via SSH. @@ -9,14 +10,14 @@ This guide explains how to connect to the SWC's HPC cluster via SSH. ``` ## Abbreviations -| Acronym | Meaning | -| --- | --- | -| SWC | Sainsbury Wellcome Centre | -| HPC | High Performance Computing | -| SLURM | Simple Linux Utility for Resource Management | -| SSH | Secure (Socket) Shell protocol | -| IDE | Integrated Development Environment | -| GUI | Graphical User Interface | +| Acronym | Meaning | +| ------- | -------------------------------------------- | +| SWC | Sainsbury Wellcome Centre | +| HPC | High Performance Computing | +| SLURM | Simple Linux Utility for Resource Management | +| SSH | Secure (Socket) Shell protocol | +| IDE | Integrated Development Environment | +| GUI | Graphical User Interface | ## Prerequisites - You have an SWC account and know your username and password. From 0b53e1c834adcc0a0b0a6027309204efe7973321 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 14 Nov 2023 14:54:28 +0000 Subject: [PATCH 10/29] clarified comment about camera view --- docs/source/data_analysis/HPC-module-SLEAP.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 89c8bef..11e12da 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -130,7 +130,7 @@ Next, follow the instructions in [Remote Training](https://sleap.ai/guides/remot i.e. *Predict* -> *Run Training…* -> *Export Training Job Package…*. - For selecting the right configuration parameters, see [Configuring Models](https://sleap.ai/guides/choosing-models.html#) and [Troubleshooting Workflows](https://sleap.ai/guides/troubleshooting-workflows.html) - Set the *Predict On* parameter to *nothing*. Remote training and inference (prediction) are easiest to run separately on the HPC Cluster. Also unselect *Visualize Predictions During Training* in training settings, if it's enabled by default. -- If you are working with a top-down camera view, set the *Rotation Min Angle* and *Rotation Max Angle* to -180 and 180 respectively in the *Augmentation* section. +- If you are working with camera view from above or below (as opposed to a side view), set the *Rotation Min Angle* and *Rotation Max Angle* to -180 and 180 respectively in the *Augmentation* section. - Make sure to save the exported training job package (e.g. `labels.v001.slp.training_job.zip`) in the mounted SWC filesystem, for example, in the same directory as the project file. - Unzip the training job package. This will create a folder with the same name (minus the `.zip` extension). This folder contains everything needed to run the training job on the HPC cluster. From 1bb89b581b9550959730c3f85f90c65f79553331 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 14 Nov 2023 14:58:30 +0000 Subject: [PATCH 11/29] updated default values for batch directives --- docs/source/data_analysis/HPC-module-SLEAP.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 11e12da..b582935 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -206,7 +206,7 @@ An example is provided below, followed by explanations. #SBATCH -J slp_train # job name #SBATCH -p gpu # partition (queue) #SBATCH -N 1 # number of nodes -#SBATCH --mem 12G # memory pool for all cores +#SBATCH --mem 16G # memory pool for all cores #SBATCH -n 4 # number of cores #SBATCH -t 0-06:00 # time (D-HH:MM) #SBATCH --gres gpu:1 # request 1 GPU (of any kind) @@ -382,8 +382,8 @@ Below is an example SLURM batch script that contains a `sleap-track` call. #SBATCH -J slp_infer # job name #SBATCH -p gpu # partition #SBATCH -N 1 # number of nodes -#SBATCH --mem 12G # memory pool for all cores -#SBATCH -n 2 # number of cores +#SBATCH --mem 16G # memory pool for all cores +#SBATCH -n 4 # number of cores #SBATCH -t 0-02:00 # time (D-HH:MM) #SBATCH --gres gpu:1 # request 1 GPU (of any kind) #SBATCH -o slurm.%N.%j.out # write STDOUT From 0625285f14864b87220be8793c4a3fdf331cd1db Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 14 Nov 2023 15:38:06 +0000 Subject: [PATCH 12/29] move SLURM arguments primer to a separate how-to guide --- docs/source/data_analysis/HPC-module-SLEAP.md | 82 +--------- docs/source/programming/SLURM-arguments.md | 147 ++++++++++++++++++ docs/source/programming/index.md | 1 + 3 files changed, 149 insertions(+), 81 deletions(-) create mode 100644 docs/source/programming/SLURM-arguments.md diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index b582935..2a2b345 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -237,7 +237,7 @@ In `nano`, you can save the file by pressing `Ctrl+O` and exit by pressing `Ctrl :icon: info - The `#SBATCH` lines are SLURM directives. They specify the resources needed for the job, such as the number of nodes, CPUs, memory, etc. -A primer on the most useful SLURM arguments is provided in the [appendix](#slurm-arguments-primer). +A primer on the most useful SLURM arguments is provided in this [how-to guide](slurm-arguments-target). For more information see the [SLURM documentation](https://slurm.schedmd.com/sbatch.html). - The `#` lines are comments. They are not executed by SLURM, but they are useful @@ -589,83 +589,3 @@ $ logout ``` See [Set up SSH for the SWC HPC cluster](../programming/SSH-SWC-cluster.md) for more information. - -## Appendix - -### SLURM arguments primer - -Here are the most important SLURM arguments used in the above examples -in conjunction with `sbatch` or `srun`. - -**Partition (Queue)** -- Name: `--partition` -- Alias: `-p` -- Description: Specifies the partition (or queue) to submit the job to. To see a list of all partitions/queues, the nodes they contain and their respective time limits, type `sinfo` when logged in to the HPC cluster. -- Example values: `gpu`, `cpu`, `fast`, `medium` - -**Job Name** -- Name: `--job-name` -- Alias: `-J` -- Description: Specifies a name for the job, which will appear in various SLURM commands and logs, making it easier to identify the job (especially when you have multiple jobs queued up) -- Example values: `training_run_24` - -**Number of Nodes** -- Name: `--nodes` -- Alias: `-N` -- Description: Defines the number of nodes required for the job. -- Example values: `1` -- Note: This should always be `1`, unless you really know what you're doing - -**Number of Cores** -- Name: `--ntasks` -- Alias: `-n` -- Description: Defines the number of cores (or tasks) required for the job. -- Example values: `1`, `4`, `8` - -**Memory Pool for All Cores** -- Name: `--mem` -- Description: Specifies the total amount of memory (RAM) required for the job across all cores (per node) -- Example values: `8G`, `16G`, `32G` - -**Time Limit** -- Name: `--time` -- Alias: `-t` -- Description: Sets the maximum time the job is allowed to run. The format is D-HH:MM, where D is days, HH is hours, and MM is minutes. -- Example values: `0-01:00` (1 hour), `0-04:00` (4 hours), `1-00:00` (1 day). -- Note: If the job exceeds the time limit, it will be terminated by SLURM. On the other hand, avoid requesting way more time than what your job needs, as this may delay its scheduling (depending on resource availability). - -**Generic Resources (GPUs)** -* Name: `--gres` -* Description: Requests generic resources, such as GPUs. -* Example values: `gpu:1`, `gpu:rtx2080:1`, `gpu:rtx5000:1`, `gpu:a100_2g.10gb:1` -* Note: No GPU will be allocated to you unless you specify it via the `--gres` argument (even if you are on the 'gpu' partition). To request 1 GPU of any kind, use `--gres gpu:1`. To request a specific GPU type, you have to include its name, e.g. `--gres gpu:rtx2080:1`. You can view the available GPU types on the [SWC internal wiki](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture). - -**Standard Output File** -- Name: `--output` -- Alias: `-o` -- Description: Defines the file where the standard output (STDOUT) will be written. In the examples scripts, it's set to slurm.%N.%j.out, where %N is the node name and %j is the job ID. -- Example values: `slurm.%N.%j.out`, `slurm.MyAwesomeJob.out` -- Note: this file contains the output of the commands executed by the job (i.e. the messages that normally gets printed on the terminal). - -**Standard Error File** -- Name: `--error` -- Alias: `-e` -- Description: Specifies the file where the standard error (STDERR) will be written. In the examples, it's set to slurm.%N.%j.err, where %N is the node name and %j is the job ID. -- Example values: `slurm.%N.%j.err`, `slurm.MyAwesomeJob.err` -- Note: this file is very useful for debugging, as it contains all the error messages produced by the commands executed by the job. - -**Email Notifications** -- Name: `--mail-type` -- Description: Defines the conditions under which the user will be notified by email. -- Example values: `ALL`, `BEGIN`, `END`, `FAIL` - -**Email Address** -- Name: `--mail-user` -- Description: Specifies the email address to which notifications will be sent. -- Example values: `user@domain.com` - -**Array jobs** -- Name: `--array` -- Description: Job array index values (a list of integers in increasing order). The task index can be accessed via the `SLURM_ARRAY_TASK_ID` environment variable. -- Example values: `--array=1-10` (10 jobs), `--array=1-100%5` (100 jobs, but only 5 of them will be allowed to run in parallel at any given time). -- Note: if an array consists of many jobs, using the `%` syntax to limit the maximum number of parallel jobs is recommended to prevent overloading the cluster. diff --git a/docs/source/programming/SLURM-arguments.md b/docs/source/programming/SLURM-arguments.md new file mode 100644 index 0000000..88148d6 --- /dev/null +++ b/docs/source/programming/SLURM-arguments.md @@ -0,0 +1,147 @@ +(slurm-arguments-target)= +# SLURM arguments primer + +```{include} ../_static/swc-wiki-warning.md +``` + +## Abbreviations +| Acronym | Meaning | +| ------- | -------------------------------------------- | +| SWC | Sainsbury Wellcome Centre | +| HPC | High Performance Computing | +| SLURM | Simple Linux Utility for Resource Management | +| MPI | Message Passing Interface | + + +## Overview +SLURM is a job scheduler and resource manager used on the SWC HPC cluster. +It is responsible for allocating resources (e.g. CPU cores, GPUs, memory) to jobs submitted by users. +When submitting a job to SLURM, you can specify various arguments to request the resources you need. +These are called SLURM directives, and they are passed to SLURM via the `sbatch` or `srun` commands. + +These are often specified at the top of a SLURM job script, +e.g. the lines that start with `#SBATCH` in the following example: + +```{code-block} bash +#!/bin/bash + +#SBATCH -J my_job # job name +#SBATCH -p gpu # partition (queue) +#SBATCH -N 1 # number of nodes +#SBATCH --mem 16G # memory pool for all cores +#SBATCH -n 4 # number of cores +#SBATCH -t 0-06:00 # time (D-HH:MM) +#SBATCH --gres gpu:1 # request 1 GPU (of any kind) +#SBATCH -o slurm.%N.%j.out # STDOUT +#SBATCH -e slurm.%N.%j.err # STDERR +#SBATCH --mail-type=ALL +#SBATCH --mail-user=user@domain.com +#SBATCH --array=1-12%4 # job array index values + +# load modules +... + +# execute commands +... +``` +This guide provides only a brief overview of the most important SLURM arguments, +to demysify the above directives and help you get started with SLURM. +For a more detailed description see the [SLURM documentation](https://slurm.schedmd.com/sbatch.html). + +## Commonly used arguments + +### Partition (Queue) +- *Name:* `--partition` +- *Alias:* `-p` +- *Description:* Specifies the partition (or queue) to submit the job to. To see a list of all partitions/queues, the nodes they contain and their respective time limits, type `sinfo` when logged in to the HPC cluster. +- *Example values:* `gpu`, `cpu`, `fast`, `medium` + +### Job Name +- *Name:* `--job-name` +- *Alias:* `-J` +- *Description:* Specifies a name for the job, which will appear in various SLURM commands and logs, making it easier to identify the job (especially when you have multiple jobs queued up) +- *Example values:* `training_run_24` + +### Number of Nodes +- *Name:* `--nodes` +- *Alias:* `-N` +- *Description:* Defines the number of nodes required for the job. +- *Example values:* `1` + +:::{warning} +This should always be `1`, unless you really know what you're doing, +e.g. you are parallelising your code across multiple nodes with MPI. +::: + +### Number of Cores +- *Name:* `--ntasks` +- *Alias:* `-n` +- *Description:* Defines the number of cores (or tasks) required for the job. +- *Example values:* `4`, `8`, `16` + +### Memory Pool for All Cores +- *Name:* `--mem` +- *Description:* Specifies the total amount of memory (RAM) required for the job across all cores (per node) +- *Example values:* `8G`, `16G`, `32G` + +### Time Limit +- *Name:* `--time` +- *Alias:* `-t` +- *Description:* Sets the maximum time the job is allowed to run. The format is D-HH:MM, where D is days, HH is hours, and MM is minutes. +- *Example values:* `0-01:00` (1 hour), `0-04:00` (4 hours), `1-00:00` (1 day). + +:::{warning} +If the job exceeds the time limit, it will be terminated by SLURM. +On the other hand, avoid requesting way more time than what your job needs, +as this may delay its scheduling (depending on resource availability). +::: + +### Generic Resources (GPUs) +- *Name:* `--gres` +- *Description:* Requests generic resources, such as GPUs. +- *Example values:* `gpu:1`, `gpu:rtx2080:1`, `gpu:rtx5000:1`, `gpu:a100_2g.10gb:1` + +:::{warning} +No GPU will be allocated to you unless you specify it via the `--gres` argument (even if you are on the 'gpu' partition). +To request 1 GPU of any kind, use `--gres gpu:1`. To request a specific GPU type, you have to include its name, e.g. `--gres gpu:rtx2080:1`. +You can view the available GPU types on the [SWC internal wiki](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture). +::: + +### Standard Output File +- *Name:* `--output` +- *Alias:* `-o` +- *Description:* Defines the file where the standard output (STDOUT) will be written. In the examples scripts, it's set to slurm.%N.%j.out, where %N is the node name and %j is the job ID. +- *Example values:* `slurm.%N.%j.out`, `slurm.MyAwesomeJob.out` + +:::{note} +This file contains the output of the commands executed by the job (i.e. the messages that normally gets printed on the terminal). +::: + +### Standard Error File +- *Name:* `--error` +- *Alias:* `-e` +- *Description:* Specifies the file where the standard error (STDERR) will be written. In the examples, it's set to slurm.%N.%j.err, where %N is the node name and %j is the job ID. +- *Example values:* `slurm.%N.%j.err`, `slurm.MyAwesomeJob.err` + +:::{note} +This file is very useful for debugging, as it contains all the error messages produced by the commands executed by the job. +::: + +### Email Notifications +- *Name:* `--mail-type` +- *Description:* Defines the conditions under which the user will be notified by email. +- *Example values:* `ALL`, `BEGIN`, `END`, `FAIL` + +### Email Address +- *Name:* `--mail-user` +- *Description:* Specifies the email address to which notifications will be sent. +- *Example values:* `user@domain.com` + +### Array jobs +- *Name:* `--array` +- *Description:* Job array index values (a list of integers in increasing order). The task index can be accessed via the `SLURM_ARRAY_TASK_ID` environment variable. +- *Example values:* `--array=1-10` (10 jobs), `--array=1-100%5` (100 jobs, but only 5 of them will be allowed to run in parallel at any given time). + +:::{warning} +If an array consists of many jobs, using the `%` syntax to limit the maximum number of parallel jobs is recommended to prevent overloading the cluster. +::: diff --git a/docs/source/programming/index.md b/docs/source/programming/index.md index 629588e..b9e8ccd 100644 --- a/docs/source/programming/index.md +++ b/docs/source/programming/index.md @@ -7,6 +7,7 @@ Small tips and tricks that do not warrant a long-form guide can be found in the ```{toctree} :maxdepth: 1 +SLURM-arguments SSH-SWC-cluster SSH-vscode Mount-ceph-ubuntu From eb04134a23ec49605301cba1c1e8eac274dc5b25 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 14 Nov 2023 15:46:13 +0000 Subject: [PATCH 13/29] move warning about execute permission earlier --- docs/source/data_analysis/HPC-module-SLEAP.md | 26 ++++++++++++------- 1 file changed, 16 insertions(+), 10 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 2a2b345..40e1899 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -253,25 +253,31 @@ to the model configuration and the project file. - The `./train-script.sh` line runs the training job (executes the contained commands). ::: -Now you can submit the batch script via running the following command -(in the same directory as the script): -```{code-block} console -$ sbatch train_slurm.sh -Submitted batch job 3445652 -``` :::{warning} -If you are getting a permission error, make the script files executable -by running in the terminal: +Before submitting the job, ensure that you have permissions to execute +both the batch script and the training script generated by SLEAP. +You can make these files executable by running in the terminal: ```{code-block} console $ chmod +x train-script.sh $ chmod +x train_slurm.sh ``` -If the scripts are not in the same folder, you will need to specify the full path: -`chmod +x /path/to/script.sh` +If the scripts are not in your working directory, you will need to specify their full paths: + +```{code-block} console +$ chmod +x /path/to/train-script.sh +$ chmod +x /path/to/train_slurm.sh +``` ::: +Now you can submit the batch script via running the following command +(in the same directory as the script): +```{code-block} console +$ sbatch train_slurm.sh +Submitted batch job 3445652 +``` + You may monitor the progress of the job in various ways: ::::{tab-set} From 92adfa6aef0c276ce306e68e093f40b6e2f3a315 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 14 Nov 2023 15:59:42 +0000 Subject: [PATCH 14/29] Renamed docs workflow --- .github/workflows/docs_build_and_deploy.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/.github/workflows/docs_build_and_deploy.yml b/.github/workflows/docs_build_and_deploy.yml index 6a7b37f..95dcd90 100644 --- a/.github/workflows/docs_build_and_deploy.yml +++ b/.github/workflows/docs_build_and_deploy.yml @@ -1,4 +1,4 @@ -name: Build Sphinx docs and deploy to GitHub Pages +name: Docs # Generate the documentation on all merges to main, all pull requests, or by # manual workflow dispatch. The build job can be used as a CI check that the From b7736a5ac96c0a05d723d4d3f13098a0fe057ae1 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 14 Nov 2023 17:56:02 +0000 Subject: [PATCH 15/29] added job name to STDOUT and STDERR file names --- docs/source/data_analysis/HPC-module-SLEAP.md | 8 ++++---- docs/source/programming/SLURM-arguments.md | 10 +++++----- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 40e1899..514d46b 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -210,8 +210,8 @@ An example is provided below, followed by explanations. #SBATCH -n 4 # number of cores #SBATCH -t 0-06:00 # time (D-HH:MM) #SBATCH --gres gpu:1 # request 1 GPU (of any kind) -#SBATCH -o slurm.%N.%j.out # STDOUT -#SBATCH -e slurm.%N.%j.err # STDERR +#SBATCH -o slurm.%x.%N.%j.out # STDOUT +#SBATCH -e slurm.%x.%N.%j.err # STDERR #SBATCH --mail-type=ALL #SBATCH --mail-user=user@domain.com @@ -392,8 +392,8 @@ Below is an example SLURM batch script that contains a `sleap-track` call. #SBATCH -n 4 # number of cores #SBATCH -t 0-02:00 # time (D-HH:MM) #SBATCH --gres gpu:1 # request 1 GPU (of any kind) -#SBATCH -o slurm.%N.%j.out # write STDOUT -#SBATCH -e slurm.%N.%j.err # write STDERR +#SBATCH -o slurm.%x.%N.%j.out # write STDOUT +#SBATCH -e slurm.%x.%N.%j.err # write STDERR #SBATCH --mail-type=ALL #SBATCH --mail-user=user@domain.com diff --git a/docs/source/programming/SLURM-arguments.md b/docs/source/programming/SLURM-arguments.md index 88148d6..6bfb81b 100644 --- a/docs/source/programming/SLURM-arguments.md +++ b/docs/source/programming/SLURM-arguments.md @@ -32,8 +32,8 @@ e.g. the lines that start with `#SBATCH` in the following example: #SBATCH -n 4 # number of cores #SBATCH -t 0-06:00 # time (D-HH:MM) #SBATCH --gres gpu:1 # request 1 GPU (of any kind) -#SBATCH -o slurm.%N.%j.out # STDOUT -#SBATCH -e slurm.%N.%j.err # STDERR +#SBATCH -o slurm.%x.%N.%j.out # STDOUT +#SBATCH -e slurm.%x.%N.%j.err # STDERR #SBATCH --mail-type=ALL #SBATCH --mail-user=user@domain.com #SBATCH --array=1-12%4 # job array index values @@ -110,8 +110,8 @@ You can view the available GPU types on the [SWC internal wiki](https://wiki.ucl ### Standard Output File - *Name:* `--output` - *Alias:* `-o` -- *Description:* Defines the file where the standard output (STDOUT) will be written. In the examples scripts, it's set to slurm.%N.%j.out, where %N is the node name and %j is the job ID. -- *Example values:* `slurm.%N.%j.out`, `slurm.MyAwesomeJob.out` +- *Description:* Defines the file where the standard output (STDOUT) will be written. In the example script above, it's set to slurm.%x.%N.%j.out, where %x is the job name, %N is the node name and %j is the job ID. +- *Example values:* `slurm.%x.%N.%j.out`, `slurm.MyAwesomeJob.out` :::{note} This file contains the output of the commands executed by the job (i.e. the messages that normally gets printed on the terminal). @@ -120,7 +120,7 @@ This file contains the output of the commands executed by the job (i.e. the mess ### Standard Error File - *Name:* `--error` - *Alias:* `-e` -- *Description:* Specifies the file where the standard error (STDERR) will be written. In the examples, it's set to slurm.%N.%j.err, where %N is the node name and %j is the job ID. +- *Description:* Specifies the file where the standard error (STDERR) will be written. In the example script above, it's set to slurm.%x.%N.%j.out, where %x is the job name, %N is the node name and %j is the job ID. - *Example values:* `slurm.%N.%j.err`, `slurm.MyAwesomeJob.err` :::{note} From cab10d21d4e9dcddc6c83626414c2c2ed5e7ba03 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 14 Nov 2023 18:19:27 +0000 Subject: [PATCH 16/29] fixed wrong path in inference batch script --- docs/source/data_analysis/HPC-module-SLEAP.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 514d46b..80bc2b2 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -412,11 +412,12 @@ cd $SLP_JOB_DIR mkdir -p predictions # Run the inference command -sleap-track $VIDEO_DIR/videos/M708149_EPM_20200317_165049331-converted.mp4 \ +sleap-track $VIDEO_DIR/M708149_EPM_20200317_165049331-converted.mp4 \ -m $SLP_JOB_DIR/models/231010_164307.centroid/training_config.json \ -m $SLP_JOB_DIR/models/231010_164307.centered_instance/training_config.json \ --gpu auto \ --tracking.tracker simple \ + --tracking.similarity centroid \ --tracking.post_connect_single_breaks 1 \ -o predictions/labels.v001.slp.predictions.slp \ --verbosity json \ From 65a36502239c1d294557ed652c9b41ea77f4766e Mon Sep 17 00:00:00 2001 From: niksirbi Date: Tue, 14 Nov 2023 18:42:23 +0000 Subject: [PATCH 17/29] clarify whihc folder should be loaded for evaluating models --- docs/source/data_analysis/HPC-module-SLEAP.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 80bc2b2..94378a8 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -370,7 +370,7 @@ training_log.csv The SLEAP GUI on your local machine can be used to quickly evaluate the trained models. - Select *Predict* -> *Evaluation Metrics for Trained Models...* -- Click on *Add Trained Models(s)* and select the subfolder(s) containing the model(s) you want to evaluate (e.g. `230509_141357.centered_instance`). +- Click on *Add Trained Models(s)* and select the folder containing the model(s) you want to evaluate. - You can view the basic metrics on the shown table or you can also view a more detailed report (including plots) by clicking *View Metrics*. ## Model inference From 5b8a2b11d846f233b205f7a49301cd6d3f17cd26 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Wed, 22 Nov 2023 10:46:27 +0000 Subject: [PATCH 18/29] increase cores and memory in slurm examples --- docs/source/data_analysis/HPC-module-SLEAP.md | 12 +++++++----- docs/source/programming/SLURM-arguments.md | 2 +- 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 94378a8..4df2c5d 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -206,8 +206,8 @@ An example is provided below, followed by explanations. #SBATCH -J slp_train # job name #SBATCH -p gpu # partition (queue) #SBATCH -N 1 # number of nodes -#SBATCH --mem 16G # memory pool for all cores -#SBATCH -n 4 # number of cores +#SBATCH --mem 32G # memory pool for all cores +#SBATCH -n 8 # number of cores #SBATCH -t 0-06:00 # time (D-HH:MM) #SBATCH --gres gpu:1 # request 1 GPU (of any kind) #SBATCH -o slurm.%x.%N.%j.out # STDOUT @@ -388,10 +388,10 @@ Below is an example SLURM batch script that contains a `sleap-track` call. #SBATCH -J slp_infer # job name #SBATCH -p gpu # partition #SBATCH -N 1 # number of nodes -#SBATCH --mem 16G # memory pool for all cores -#SBATCH -n 4 # number of cores +#SBATCH --mem 64G # memory pool for all cores +#SBATCH -n 16 # number of cores #SBATCH -t 0-02:00 # time (D-HH:MM) -#SBATCH --gres gpu:1 # request 1 GPU (of any kind) +#SBATCH --gres gpu:rtx5000:1 # request 1 GPU (of a specific kind) #SBATCH -o slurm.%x.%N.%j.out # write STDOUT #SBATCH -e slurm.%x.%N.%j.err # write STDERR #SBATCH --mail-type=ALL @@ -425,6 +425,8 @@ sleap-track $VIDEO_DIR/M708149_EPM_20200317_165049331-converted.mp4 \ ``` The script is very similar to the training script, with the following differences: - The time limit `-t` is set lower, since inference is normally faster than training. This will however depend on the size of the video and the number of models used. +- The requested number of cores `n` and memory `--mem` are higher. This will depend on the requirements of the specific job you are running. It's best practice to try with a scaled-down version of your data first, to get an idea of the resources needed. +- The requested GPU is of a specific kind (RTX 5000). This will again depend on the requirements of your job, as the different GPU kinds vary in GPU memory size and compute capabilities (see [wiki](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture)). - The `./train-script.sh` line is replaced by the `sleap-track` command. - The `\` character is used to split the long `sleap-track` command into multiple lines for readability. It is not necessary if the command is written on a single line. diff --git a/docs/source/programming/SLURM-arguments.md b/docs/source/programming/SLURM-arguments.md index 6bfb81b..7b0d822 100644 --- a/docs/source/programming/SLURM-arguments.md +++ b/docs/source/programming/SLURM-arguments.md @@ -82,7 +82,7 @@ e.g. you are parallelising your code across multiple nodes with MPI. ### Memory Pool for All Cores - *Name:* `--mem` - *Description:* Specifies the total amount of memory (RAM) required for the job across all cores (per node) -- *Example values:* `8G`, `16G`, `32G` +- *Example values:* `8G`, `32G`, `64G` ### Time Limit - *Name:* `--time` From 2026f0f105efc53688b236b6a9ee80da5d60e21e Mon Sep 17 00:00:00 2001 From: niksirbi Date: Wed, 22 Nov 2023 11:07:11 +0000 Subject: [PATCH 19/29] added warning for out-of-memory errors --- docs/source/data_analysis/HPC-module-SLEAP.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 4df2c5d..e696698 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -336,6 +336,16 @@ $ cat slurm.gpu-sr670-20.3445652.err :::: +```{dropdown} Out-of-memory (OOM) errors +:color: warning +:icon: alert-fill + +If you encounter out-of-memory errors, there are a few things you can try: +- Request more CPU memory via the `--mem` argument in the SLURM batch script. +- Request a specific GPU card type with more GPU memory (e.g. `--gres gpu:a4500:1`). The SWC wiki provides a [list of all GPU card types and their specifications](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture). +- Reduce the size of your SLEAP models. You may tweak the model backbone architecture, or play with *Input scalng*, *Max stride* and *Batch size*. See SLEAP's [documentation](https://sleap.ai/) and [discussion forum](https://github.com/talmolab/sleap/discussions) for more details. +``` + ### Evaluate the trained models Upon successful completion of the training job, a `models` folder will have been created in the training job directory. It contains one subfolder per From 47c03a336ad3c02e67360ee626d29055c6402480 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Wed, 22 Nov 2023 11:08:33 +0000 Subject: [PATCH 20/29] modified swc wiki warning --- docs/source/_static/swc-wiki-warning.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/_static/swc-wiki-warning.md b/docs/source/_static/swc-wiki-warning.md index 99a61d2..cbb3e28 100644 --- a/docs/source/_static/swc-wiki-warning.md +++ b/docs/source/_static/swc-wiki-warning.md @@ -2,4 +2,5 @@ Some links within this document point to the [SWC internal wiki](https://wiki.ucl.ac.uk/display/SI/SWC+Intranet), which is only accessible from within the SWC network. +We recommend opening these links in a new tab. ::: From fa423a2a3cf3a7e3a82b8058950dbb20b9915ae2 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Wed, 22 Nov 2023 11:16:35 +0000 Subject: [PATCH 21/29] fixed syntac error --- docs/source/conf.py | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/source/conf.py b/docs/source/conf.py index d7aa3ee..1be640c 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -129,8 +129,6 @@ "type": "fontawesome", } ], - "logo": { - ], "logo": { "text": "HowTo", "image_light": "logo_light.png", From 5ce71baf3a0cb2975484ad75e30ba5245bfbddad Mon Sep 17 00:00:00 2001 From: Niko Sirmpilatze Date: Thu, 23 Nov 2023 13:04:41 +0000 Subject: [PATCH 22/29] Apply suggestions from code review Applied Adam's suggestions from code review Co-authored-by: Adam Tyson --- docs/source/data_analysis/HPC-module-SLEAP.md | 6 +++--- docs/source/programming/SLURM-arguments.md | 9 +++++---- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index e696698..1da2457 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -1,4 +1,4 @@ -# Use the SLEAP module on the HPC cluster +# Use the SLEAP module on the SWC HPC cluster ```{include} ../_static/swc-wiki-warning.md ``` @@ -35,8 +35,8 @@ SLEAP/2023-03-13 SLEAP/2023-08-01 ... ``` -- `SLEAP/2023-03-13` corresponds to `sleap v.1.2.9` -- `SLEAP/2023-08-01` corresponds to `sleap v.1.3.1` +- `SLEAP/2023-03-13` corresponds to `SLEAP v.1.2.9` +- `SLEAP/2023-08-01` corresponds to `SLEAP v.1.3.1` We recommend always using the latest version, which is the one loaded by default when you run `module load SLEAP`. If you want to load a specific version, diff --git a/docs/source/programming/SLURM-arguments.md b/docs/source/programming/SLURM-arguments.md index 7b0d822..d411e94 100644 --- a/docs/source/programming/SLURM-arguments.md +++ b/docs/source/programming/SLURM-arguments.md @@ -69,20 +69,19 @@ For a more detailed description see the [SLURM documentation](https://slurm.sche - *Example values:* `1` :::{warning} -This should always be `1`, unless you really know what you're doing, -e.g. you are parallelising your code across multiple nodes with MPI. +This is usually `1` unless you're parallelising your code across multiple nodes with technologies such as MPI. ::: ### Number of Cores - *Name:* `--ntasks` - *Alias:* `-n` - *Description:* Defines the number of cores (or tasks) required for the job. -- *Example values:* `4`, `8`, `16` +- *Example values:* `1`, `5`, `20` ### Memory Pool for All Cores - *Name:* `--mem` - *Description:* Specifies the total amount of memory (RAM) required for the job across all cores (per node) -- *Example values:* `8G`, `32G`, `64G` +- *Example values:* `4G`, `32G`, `64G` ### Time Limit - *Name:* `--time` @@ -94,6 +93,8 @@ e.g. you are parallelising your code across multiple nodes with MPI. If the job exceeds the time limit, it will be terminated by SLURM. On the other hand, avoid requesting way more time than what your job needs, as this may delay its scheduling (depending on resource availability). + +If needed, the systems administrator can extend long-running jobs. ::: ### Generic Resources (GPUs) From cb5d119521b6e50dc7d7e17c4361bf771a011c1f Mon Sep 17 00:00:00 2001 From: niksirbi Date: Thu, 23 Nov 2023 13:14:09 +0000 Subject: [PATCH 23/29] removed duplicate entries from conf.py --- docs/source/conf.py | 5 ----- 1 file changed, 5 deletions(-) diff --git a/docs/source/conf.py b/docs/source/conf.py index 1be640c..f0e5811 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -93,24 +93,19 @@ # html_theme = "pydata_sphinx_theme" html_title = "HowTo" -html_theme = "pydata_sphinx_theme" -html_title = "HowTo" # Redirect the webpage to another URL # Sphinx will create the appropriate CNAME file in the build directory # https://www.sphinx-doc.org/en/master/usage/extensions/githubpages.html html_baseurl = "https://howto.neuroinformatics.dev/" -html_baseurl = "https://howto.neuroinformatics.dev/" # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ["_static"] -html_static_path = ["_static"] html_css_files = [ "css/custom.css", - "css/custom.css", ] html_favicon = "_static/logo_light.png" From 9ec251bfd289ebd3b71dc9e7d99412de4624b8c2 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Thu, 23 Nov 2023 13:24:10 +0000 Subject: [PATCH 24/29] added link to abbreviations tables --- docs/source/data_analysis/HPC-module-SLEAP.md | 14 +++++++------- docs/source/programming/SLURM-arguments.md | 12 ++++++------ docs/source/programming/SSH-SWC-cluster.md | 16 ++++++++-------- 3 files changed, 21 insertions(+), 21 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 1da2457..66223df 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -7,13 +7,13 @@ ``` ## Abbreviations -| Acronym | Meaning | -| ------- | -------------------------------------------- | -| SLEAP | Social LEAP Estimates Animal Poses | -| SWC | Sainsbury Wellcome Centre | -| HPC | High Performance Computing | -| SLURM | Simple Linux Utility for Resource Management | -| GUI | Graphical User Interface | +| Acronym | Meaning | +| --------------------------------------------------------------- | -------------------------------------------- | +| [SLEAP](https://sleap.ai/) | Social LEAP Estimates Animal Poses | +| [SWC](https://www.sainsburywellcome.org/web/) | Sainsbury Wellcome Centre | +| [HPC](https://en.wikipedia.org/wiki/High-performance_computing) | High Performance Computing | +| [SLURM](https://slurm.schedmd.com/) | Simple Linux Utility for Resource Management | +| [GUI](https://en.wikipedia.org/wiki/Graphical_user_interface) | Graphical User Interface | ## Prerequisites diff --git a/docs/source/programming/SLURM-arguments.md b/docs/source/programming/SLURM-arguments.md index d411e94..664d13f 100644 --- a/docs/source/programming/SLURM-arguments.md +++ b/docs/source/programming/SLURM-arguments.md @@ -5,12 +5,12 @@ ``` ## Abbreviations -| Acronym | Meaning | -| ------- | -------------------------------------------- | -| SWC | Sainsbury Wellcome Centre | -| HPC | High Performance Computing | -| SLURM | Simple Linux Utility for Resource Management | -| MPI | Message Passing Interface | +| Acronym | Meaning | +| --------------------------------------------------------------- | -------------------------------------------- | +| [SWC](https://www.sainsburywellcome.org/web/) | Sainsbury Wellcome Centre | +| [HPC](https://en.wikipedia.org/wiki/High-performance_computing) | High Performance Computing | +| [SLURM](https://slurm.schedmd.com/) | Simple Linux Utility for Resource Management | +| [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) | Message Passing Interface | ## Overview diff --git a/docs/source/programming/SSH-SWC-cluster.md b/docs/source/programming/SSH-SWC-cluster.md index 36e8241..610d377 100644 --- a/docs/source/programming/SSH-SWC-cluster.md +++ b/docs/source/programming/SSH-SWC-cluster.md @@ -10,14 +10,14 @@ This guide explains how to connect to the SWC's HPC cluster via SSH. ``` ## Abbreviations -| Acronym | Meaning | -| ------- | -------------------------------------------- | -| SWC | Sainsbury Wellcome Centre | -| HPC | High Performance Computing | -| SLURM | Simple Linux Utility for Resource Management | -| SSH | Secure (Socket) Shell protocol | -| IDE | Integrated Development Environment | -| GUI | Graphical User Interface | +| Acronym | Meaning | +| ----------------------------------------------------------------------- | -------------------------------------------- | +| [SWC](https://www.sainsburywellcome.org/web/) | Sainsbury Wellcome Centre | +| [HPC](https://en.wikipedia.org/wiki/High-performance_computing) | High Performance Computing | +| [SLURM](https://slurm.schedmd.com/) | Simple Linux Utility for Resource Management | +| [SSH](https://en.wikipedia.org/wiki/Secure_Shell) | Secure (Socket) Shell protocol | +| [IDE](https://en.wikipedia.org/wiki/Integrated_development_environment) | Integrated Development Environment | +| [GUI](https://en.wikipedia.org/wiki/Graphical_user_interface) | Graphical User Interface | ## Prerequisites - You have an SWC account and know your username and password. From ec2ed0485ecbedb9d8d55b4135121fbdfa87da5c Mon Sep 17 00:00:00 2001 From: niksirbi Date: Thu, 23 Nov 2023 13:29:47 +0000 Subject: [PATCH 25/29] Revert "temporarily enable publishing from this branch for review" This reverts commit 8c572bfb3cf3557c33991cc364a0d08eb9de5f4a. --- .github/workflows/docs_build_and_deploy.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/docs_build_and_deploy.yml b/.github/workflows/docs_build_and_deploy.yml index 95dcd90..8a781c1 100644 --- a/.github/workflows/docs_build_and_deploy.yml +++ b/.github/workflows/docs_build_and_deploy.yml @@ -8,7 +8,7 @@ name: Docs on: push: branches: - - sleap-module + - main tags: - '*' pull_request: @@ -38,7 +38,7 @@ jobs: needs: build_sphinx_docs permissions: contents: write - if: github.event_name == 'push' && github.ref_name == 'sleap-module' + if: github.event_name == 'push' && github.ref_name == 'main' runs-on: ubuntu-latest steps: - uses: neuroinformatics-unit/actions/deploy_sphinx_docs@v2 From 62ba48b616a905ed02add9bd0021cf47aa16a367 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Thu, 23 Nov 2023 13:47:08 +0000 Subject: [PATCH 26/29] clarified two memory types --- docs/source/data_analysis/HPC-module-SLEAP.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 66223df..75d880f 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -340,10 +340,10 @@ $ cat slurm.gpu-sr670-20.3445652.err :color: warning :icon: alert-fill -If you encounter out-of-memory errors, there are a few things you can try: -- Request more CPU memory via the `--mem` argument in the SLURM batch script. -- Request a specific GPU card type with more GPU memory (e.g. `--gres gpu:a4500:1`). The SWC wiki provides a [list of all GPU card types and their specifications](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture). -- Reduce the size of your SLEAP models. You may tweak the model backbone architecture, or play with *Input scalng*, *Max stride* and *Batch size*. See SLEAP's [documentation](https://sleap.ai/) and [discussion forum](https://github.com/talmolab/sleap/discussions) for more details. +If you encounter out-of-memory errors, keep in mind that there two main sources of memory usage: +- CPU memory (RAM), specified via the `--mem` argument in the SLURM batch script. This is the memory used by the Python process running the training job and is shared among all the CPU cores. +- GPU memory, this is the memory used by the GPU card(s) and depends on the GPU card type you requested via the `--gres gpu:1` argument in the SLURM batch script. To increase it, you can request a specific GPU card type with more GPU memory (e.g. `--gres gpu:a4500:1`). The SWC wiki provides a [list of all GPU card types and their specifications](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture). +- If requesting more memory doesn't help, you can try reducing the size of your SLEAP models. You may tweak the model backbone architecture, or play with *Input scalng*, *Max stride* and *Batch size*. See SLEAP's [documentation](https://sleap.ai/) and [discussion forum](https://github.com/talmolab/sleap/discussions) for more details. ``` ### Evaluate the trained models From 6ce227000ba85c1b502b0c629b06315bc6bfab89 Mon Sep 17 00:00:00 2001 From: Niko Sirmpilatze Date: Fri, 24 Nov 2023 16:34:49 +0000 Subject: [PATCH 27/29] Apply suggestions from CHL's code review Co-authored-by: Chang Huan Lo --- docs/source/data_analysis/HPC-module-SLEAP.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 75d880f..2d13e3f 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -343,7 +343,7 @@ $ cat slurm.gpu-sr670-20.3445652.err If you encounter out-of-memory errors, keep in mind that there two main sources of memory usage: - CPU memory (RAM), specified via the `--mem` argument in the SLURM batch script. This is the memory used by the Python process running the training job and is shared among all the CPU cores. - GPU memory, this is the memory used by the GPU card(s) and depends on the GPU card type you requested via the `--gres gpu:1` argument in the SLURM batch script. To increase it, you can request a specific GPU card type with more GPU memory (e.g. `--gres gpu:a4500:1`). The SWC wiki provides a [list of all GPU card types and their specifications](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture). -- If requesting more memory doesn't help, you can try reducing the size of your SLEAP models. You may tweak the model backbone architecture, or play with *Input scalng*, *Max stride* and *Batch size*. See SLEAP's [documentation](https://sleap.ai/) and [discussion forum](https://github.com/talmolab/sleap/discussions) for more details. +- If requesting more memory doesn't help, you can try reducing the size of your SLEAP models. You may tweak the model backbone architecture, or play with *Input scaling*, *Max stride* and *Batch size*. See SLEAP's [documentation](https://sleap.ai/) and [discussion forum](https://github.com/talmolab/sleap/discussions) for more details. ``` ### Evaluate the trained models @@ -385,7 +385,7 @@ The SLEAP GUI on your local machine can be used to quickly evaluate the trained ## Model inference By inference, we mean using a trained model to predict the labels on new frames/videos. -SLEAP provides the `sleap-track` command line utility for running inference +SLEAP provides the [`sleap-track`](https://sleap.ai/guides/cli.html?#inference-and-tracking) command line utility for running inference on a single video or a folder of videos. Below is an example SLURM batch script that contains a `sleap-track` call. @@ -436,7 +436,7 @@ sleap-track $VIDEO_DIR/M708149_EPM_20200317_165049331-converted.mp4 \ The script is very similar to the training script, with the following differences: - The time limit `-t` is set lower, since inference is normally faster than training. This will however depend on the size of the video and the number of models used. - The requested number of cores `n` and memory `--mem` are higher. This will depend on the requirements of the specific job you are running. It's best practice to try with a scaled-down version of your data first, to get an idea of the resources needed. -- The requested GPU is of a specific kind (RTX 5000). This will again depend on the requirements of your job, as the different GPU kinds vary in GPU memory size and compute capabilities (see [wiki](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture)). +- The requested GPU is of a specific kind (RTX 5000). This will again depend on the requirements of your job, as the different GPU kinds vary in GPU memory size and compute capabilities (see [the SWC wiki](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture)). - The `./train-script.sh` line is replaced by the `sleap-track` command. - The `\` character is used to split the long `sleap-track` command into multiple lines for readability. It is not necessary if the command is written on a single line. @@ -598,7 +598,7 @@ If all is as expected, you can exit the Python interpreter, and then exit the GP ```{code-block} console $ exit() ``` -If you encounter troubles with using the SLEAP module, contact the +If you encounter troubles with using the SLEAP module, contact Niko Sirmpilatze of the SWC [Neuroinformatics Unit](https://neuroinformatics.dev/). To completely exit the HPC cluster, you will need to logout of the SSH session twice: From c90187571f8be99368cc9580c3804aa431e39f06 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Fri, 24 Nov 2023 16:42:30 +0000 Subject: [PATCH 28/29] added reference to SLEAP model evaluation notebook --- docs/source/data_analysis/HPC-module-SLEAP.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index 2d13e3f..d015695 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -383,6 +383,8 @@ The SLEAP GUI on your local machine can be used to quickly evaluate the trained - Click on *Add Trained Models(s)* and select the folder containing the model(s) you want to evaluate. - You can view the basic metrics on the shown table or you can also view a more detailed report (including plots) by clicking *View Metrics*. +For more detailed evaluation metrics, you can refer to [SLEAP's model evaluation notebook](https://sleap.ai/notebooks/Model_evaluation.html). + ## Model inference By inference, we mean using a trained model to predict the labels on new frames/videos. SLEAP provides the [`sleap-track`](https://sleap.ai/guides/cli.html?#inference-and-tracking) command line utility for running inference From 9dc11167fa7776ad3e40d16a9ad723520933c902 Mon Sep 17 00:00:00 2001 From: niksirbi Date: Fri, 24 Nov 2023 16:46:20 +0000 Subject: [PATCH 29/29] reordered abbreviations based on order of appearance --- docs/source/data_analysis/HPC-module-SLEAP.md | 2 +- docs/source/programming/SLURM-arguments.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/data_analysis/HPC-module-SLEAP.md b/docs/source/data_analysis/HPC-module-SLEAP.md index d015695..b89c67b 100644 --- a/docs/source/data_analysis/HPC-module-SLEAP.md +++ b/docs/source/data_analysis/HPC-module-SLEAP.md @@ -12,8 +12,8 @@ | [SLEAP](https://sleap.ai/) | Social LEAP Estimates Animal Poses | | [SWC](https://www.sainsburywellcome.org/web/) | Sainsbury Wellcome Centre | | [HPC](https://en.wikipedia.org/wiki/High-performance_computing) | High Performance Computing | -| [SLURM](https://slurm.schedmd.com/) | Simple Linux Utility for Resource Management | | [GUI](https://en.wikipedia.org/wiki/Graphical_user_interface) | Graphical User Interface | +| [SLURM](https://slurm.schedmd.com/) | Simple Linux Utility for Resource Management | ## Prerequisites diff --git a/docs/source/programming/SLURM-arguments.md b/docs/source/programming/SLURM-arguments.md index 664d13f..346ddde 100644 --- a/docs/source/programming/SLURM-arguments.md +++ b/docs/source/programming/SLURM-arguments.md @@ -7,9 +7,9 @@ ## Abbreviations | Acronym | Meaning | | --------------------------------------------------------------- | -------------------------------------------- | +| [SLURM](https://slurm.schedmd.com/) | Simple Linux Utility for Resource Management | | [SWC](https://www.sainsburywellcome.org/web/) | Sainsbury Wellcome Centre | | [HPC](https://en.wikipedia.org/wiki/High-performance_computing) | High Performance Computing | -| [SLURM](https://slurm.schedmd.com/) | Simple Linux Utility for Resource Management | | [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) | Message Passing Interface | @@ -94,7 +94,7 @@ If the job exceeds the time limit, it will be terminated by SLURM. On the other hand, avoid requesting way more time than what your job needs, as this may delay its scheduling (depending on resource availability). -If needed, the systems administrator can extend long-running jobs. +If needed, the systems administrator can extend long-running jobs. ::: ### Generic Resources (GPUs)