Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use the SLEAP HPC module with SLURM #40

Merged
merged 29 commits into from
Nov 24, 2023
Merged
Show file tree
Hide file tree
Changes from 26 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
01d05b5
started guide on SLEAP module
niksirbi Sep 26, 2023
dca90d3
add and configure sphinx-copybutton
niksirbi Sep 26, 2023
85a98f4
added complete draft for the SLEAP guide
niksirbi Sep 26, 2023
389932e
corrected typos and clarified some spots
niksirbi Sep 27, 2023
8c572bf
temporarily enable publishing from this branch for review
niksirbi Sep 27, 2023
bd49d71
updated SLEAP module guide
niksirbi Oct 10, 2023
981eb18
updated local sleap installation instructions
niksirbi Nov 9, 2023
4426663
fixed broken links
niksirbi Nov 14, 2023
b668e3d
make link to ssh how-to guide explicit
niksirbi Nov 14, 2023
0b53e1c
clarified comment about camera view
niksirbi Nov 14, 2023
1bb89b5
updated default values for batch directives
niksirbi Nov 14, 2023
0625285
move SLURM arguments primer to a separate how-to guide
niksirbi Nov 14, 2023
eb04134
move warning about execute permission earlier
niksirbi Nov 14, 2023
92adfa6
Renamed docs workflow
niksirbi Nov 14, 2023
b7736a5
added job name to STDOUT and STDERR file names
niksirbi Nov 14, 2023
cab10d2
fixed wrong path in inference batch script
niksirbi Nov 14, 2023
65a3650
clarify whihc folder should be loaded for evaluating models
niksirbi Nov 14, 2023
5b8a2b1
increase cores and memory in slurm examples
niksirbi Nov 22, 2023
2026f0f
added warning for out-of-memory errors
niksirbi Nov 22, 2023
47c03a3
modified swc wiki warning
niksirbi Nov 22, 2023
fa423a2
fixed syntac error
niksirbi Nov 22, 2023
5ce71ba
Apply suggestions from code review
niksirbi Nov 23, 2023
cb5d119
removed duplicate entries from conf.py
niksirbi Nov 23, 2023
9ec251b
added link to abbreviations tables
niksirbi Nov 23, 2023
ec2ed04
Revert "temporarily enable publishing from this branch for review"
niksirbi Nov 23, 2023
62ba48b
clarified two memory types
niksirbi Nov 23, 2023
6ce2270
Apply suggestions from CHL's code review
niksirbi Nov 24, 2023
c901875
added reference to SLEAP model evaluation notebook
niksirbi Nov 24, 2023
9dc1116
reordered abbreviations based on order of appearance
niksirbi Nov 24, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/docs_build_and_deploy.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Build Sphinx docs and deploy to GitHub Pages
name: Docs

# Generate the documentation on all merges to main, all pull requests, or by
# manual workflow dispatch. The build job can be used as a CI check that the
Expand Down
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ nbsphinx
numpydoc
pydata-sphinx-theme
sphinx
sphinx-copybutton
sphinx-design
1 change: 1 addition & 0 deletions docs/source/_static/swc-wiki-warning.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
Some links within this document point to the
[SWC internal wiki](https://wiki.ucl.ac.uk/display/SI/SWC+Intranet),
which is only accessible from within the SWC network.
We recommend opening these links in a new tab.
:::
5 changes: 5 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
"sphinx.ext.intersphinx",
"sphinx.ext.napoleon",
"sphinx_design",
"sphinx_copybutton",
"myst_parser",
"numpydoc",
"nbsphinx",
Expand Down Expand Up @@ -134,3 +135,7 @@

# Hide the "Show Source" button
html_show_sourcelink = False

# Configure the code block copy button
# don't copy line numbers, prompts, or console outputs
copybutton_exclude = ".linenos, .gp, .go"
610 changes: 610 additions & 0 deletions docs/source/data_analysis/HPC-module-SLEAP.md

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/source/data_analysis/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ Guides related to the analysis of neuroscientific data, spanning a wide range of
```{toctree}
:maxdepth: 1

HPC-module-SLEAP
```
148 changes: 148 additions & 0 deletions docs/source/programming/SLURM-arguments.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
(slurm-arguments-target)=
# SLURM arguments primer

```{include} ../_static/swc-wiki-warning.md
```

## Abbreviations
| Acronym | Meaning |
| --------------------------------------------------------------- | -------------------------------------------- |
| [SWC](https://www.sainsburywellcome.org/web/) | Sainsbury Wellcome Centre |
| [HPC](https://en.wikipedia.org/wiki/High-performance_computing) | High Performance Computing |
| [SLURM](https://slurm.schedmd.com/) | Simple Linux Utility for Resource Management |
| [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface) | Message Passing Interface |


## Overview
SLURM is a job scheduler and resource manager used on the SWC HPC cluster.
It is responsible for allocating resources (e.g. CPU cores, GPUs, memory) to jobs submitted by users.
When submitting a job to SLURM, you can specify various arguments to request the resources you need.
These are called SLURM directives, and they are passed to SLURM via the `sbatch` or `srun` commands.

These are often specified at the top of a SLURM job script,
e.g. the lines that start with `#SBATCH` in the following example:

```{code-block} bash
#!/bin/bash

#SBATCH -J my_job # job name
#SBATCH -p gpu # partition (queue)
#SBATCH -N 1 # number of nodes
#SBATCH --mem 16G # memory pool for all cores
#SBATCH -n 4 # number of cores
#SBATCH -t 0-06:00 # time (D-HH:MM)
#SBATCH --gres gpu:1 # request 1 GPU (of any kind)
#SBATCH -o slurm.%x.%N.%j.out # STDOUT
#SBATCH -e slurm.%x.%N.%j.err # STDERR
#SBATCH --mail-type=ALL
#SBATCH [email protected]
#SBATCH --array=1-12%4 # job array index values

# load modules
...

# execute commands
...
```
This guide provides only a brief overview of the most important SLURM arguments,
to demysify the above directives and help you get started with SLURM.
For a more detailed description see the [SLURM documentation](https://slurm.schedmd.com/sbatch.html).

## Commonly used arguments

### Partition (Queue)
- *Name:* `--partition`
- *Alias:* `-p`
- *Description:* Specifies the partition (or queue) to submit the job to. To see a list of all partitions/queues, the nodes they contain and their respective time limits, type `sinfo` when logged in to the HPC cluster.
- *Example values:* `gpu`, `cpu`, `fast`, `medium`

### Job Name
- *Name:* `--job-name`
- *Alias:* `-J`
- *Description:* Specifies a name for the job, which will appear in various SLURM commands and logs, making it easier to identify the job (especially when you have multiple jobs queued up)
- *Example values:* `training_run_24`

### Number of Nodes
- *Name:* `--nodes`
- *Alias:* `-N`
- *Description:* Defines the number of nodes required for the job.
- *Example values:* `1`

:::{warning}
This is usually `1` unless you're parallelising your code across multiple nodes with technologies such as MPI.
:::

### Number of Cores
- *Name:* `--ntasks`
- *Alias:* `-n`
- *Description:* Defines the number of cores (or tasks) required for the job.
- *Example values:* `1`, `5`, `20`

### Memory Pool for All Cores
- *Name:* `--mem`
- *Description:* Specifies the total amount of memory (RAM) required for the job across all cores (per node)
- *Example values:* `4G`, `32G`, `64G`

### Time Limit
- *Name:* `--time`
- *Alias:* `-t`
- *Description:* Sets the maximum time the job is allowed to run. The format is D-HH:MM, where D is days, HH is hours, and MM is minutes.
- *Example values:* `0-01:00` (1 hour), `0-04:00` (4 hours), `1-00:00` (1 day).

:::{warning}
If the job exceeds the time limit, it will be terminated by SLURM.
On the other hand, avoid requesting way more time than what your job needs,
as this may delay its scheduling (depending on resource availability).
niksirbi marked this conversation as resolved.
Show resolved Hide resolved

If needed, the systems administrator can extend long-running jobs.
:::

### Generic Resources (GPUs)
- *Name:* `--gres`
- *Description:* Requests generic resources, such as GPUs.
- *Example values:* `gpu:1`, `gpu:rtx2080:1`, `gpu:rtx5000:1`, `gpu:a100_2g.10gb:1`

:::{warning}
No GPU will be allocated to you unless you specify it via the `--gres` argument (even if you are on the 'gpu' partition).
To request 1 GPU of any kind, use `--gres gpu:1`. To request a specific GPU type, you have to include its name, e.g. `--gres gpu:rtx2080:1`.
You can view the available GPU types on the [SWC internal wiki](https://wiki.ucl.ac.uk/display/SSC/CPU+and+GPU+Platform+architecture).
:::

### Standard Output File
- *Name:* `--output`
- *Alias:* `-o`
- *Description:* Defines the file where the standard output (STDOUT) will be written. In the example script above, it's set to slurm.%x.%N.%j.out, where %x is the job name, %N is the node name and %j is the job ID.
- *Example values:* `slurm.%x.%N.%j.out`, `slurm.MyAwesomeJob.out`

:::{note}
This file contains the output of the commands executed by the job (i.e. the messages that normally gets printed on the terminal).
:::

### Standard Error File
- *Name:* `--error`
- *Alias:* `-e`
- *Description:* Specifies the file where the standard error (STDERR) will be written. In the example script above, it's set to slurm.%x.%N.%j.out, where %x is the job name, %N is the node name and %j is the job ID.
- *Example values:* `slurm.%N.%j.err`, `slurm.MyAwesomeJob.err`

:::{note}
This file is very useful for debugging, as it contains all the error messages produced by the commands executed by the job.
:::

### Email Notifications
- *Name:* `--mail-type`
- *Description:* Defines the conditions under which the user will be notified by email.
- *Example values:* `ALL`, `BEGIN`, `END`, `FAIL`

### Email Address
- *Name:* `--mail-user`
- *Description:* Specifies the email address to which notifications will be sent.
- *Example values:* `[email protected]`

### Array jobs
- *Name:* `--array`
- *Description:* Job array index values (a list of integers in increasing order). The task index can be accessed via the `SLURM_ARRAY_TASK_ID` environment variable.
- *Example values:* `--array=1-10` (10 jobs), `--array=1-100%5` (100 jobs, but only 5 of them will be allowed to run in parallel at any given time).

:::{warning}
If an array consists of many jobs, using the `%` syntax to limit the maximum number of parallel jobs is recommended to prevent overloading the cluster.
:::
17 changes: 9 additions & 8 deletions docs/source/programming/SSH-SWC-cluster.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
(ssh-cluster-target)=
# Set up SSH for the SWC HPC cluster

This guide explains how to connect to the SWC's HPC cluster via SSH.
Expand All @@ -9,14 +10,14 @@ This guide explains how to connect to the SWC's HPC cluster via SSH.
```

## Abbreviations
| Acronym | Meaning |
| --- | --- |
| SWC | Sainsbury Wellcome Centre |
| HPC | High Performance Computing |
| SLURM | Simple Linux Utility for Resource Management |
| SSH | Secure (Socket) Shell protocol |
| IDE | Integrated Development Environment |
| GUI | Graphical User Interface |
| Acronym | Meaning |
niksirbi marked this conversation as resolved.
Show resolved Hide resolved
| ----------------------------------------------------------------------- | -------------------------------------------- |
| [SWC](https://www.sainsburywellcome.org/web/) | Sainsbury Wellcome Centre |
| [HPC](https://en.wikipedia.org/wiki/High-performance_computing) | High Performance Computing |
| [SLURM](https://slurm.schedmd.com/) | Simple Linux Utility for Resource Management |
| [SSH](https://en.wikipedia.org/wiki/Secure_Shell) | Secure (Socket) Shell protocol |
| [IDE](https://en.wikipedia.org/wiki/Integrated_development_environment) | Integrated Development Environment |
| [GUI](https://en.wikipedia.org/wiki/Graphical_user_interface) | Graphical User Interface |

## Prerequisites
- You have an SWC account and know your username and password.
Expand Down
1 change: 1 addition & 0 deletions docs/source/programming/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ Small tips and tricks that do not warrant a long-form guide can be found in the
```{toctree}
:maxdepth: 1

SLURM-arguments
SSH-SWC-cluster
SSH-vscode
Mount-ceph-ubuntu
Expand Down