Skip to content

Commit

Permalink
more post updates
Browse files Browse the repository at this point in the history
  • Loading branch information
stmorse committed Sep 11, 2024
1 parent 24d52c1 commit ceed957
Show file tree
Hide file tree
Showing 6 changed files with 81 additions and 17 deletions.
10 changes: 5 additions & 5 deletions _data/navigation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@
url: /hpc-gitbook/logging-in-and-setting-up-your-hpc-account/create.html
- title: 👋 Login & Basic Setup
url: /hpc-gitbook/logging-in-and-setting-up-your-hpc-account/login-and-basic-setup.html
- title: 🐍 Uploading Files
- title: 📁 Uploading Files
url: /hpc-gitbook/logging-in-and-setting-up-your-hpc-account/filezilla.html
- title: Using the Jump Server and Configuring SSH
- title: 🔒 Using the Jump Server and Configuring SSH
url: /hpc-gitbook/logging-in-and-setting-up-your-hpc-account/configuring-ssh.html
- title: 🐍 Using Python and Conda
- title: 🐍 Intro to using Python and Conda
url: /hpc-gitbook/logging-in-and-setting-up-your-hpc-account/conda-environments.html

- title: The Batch System
Expand All @@ -20,14 +20,14 @@
url: /hpc-gitbook/the-batch-system/what-is-the-batch-system.html
- title: 👷 Jobs
url: /hpc-gitbook/the-batch-system/jobs.html
- title: 🗺 PBSTOP - Your Cluster Roadmap
url: /hpc-gitbook/the-batch-system/pbstop-your-cluster-roadmap.html
- title: Interactive Jobs
url: /hpc-gitbook/the-batch-system/interactive-jobs.html
- title: Non-Interactive Jobs
url: /hpc-gitbook/the-batch-system/non-interactive-jobs.html
- title: Checking the status of your jobs
url: /hpc-gitbook/the-batch-system/checking-the-status-of-your-jobs.html
- title: 🗺 PBSTOP - Your Cluster Roadmap
url: /hpc-gitbook/the-batch-system/pbstop-your-cluster-roadmap.html
- title: Deleting Jobs
url: /hpc-gitbook/the-batch-system/deleting-jobs.html

Expand Down
2 changes: 2 additions & 0 deletions logging-in-and-setting-up-your-hpc-account/filezilla.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ scp myFile.txt <user>@bora.sciclone.wm.edu:test

If your [config file is setup](https://d8a-science.github.io/hpc-gitbook/logging-in-and-setting-up-your-hpc-account/configuring-ssh.html), this can simplify to just `scp myFile.txt bora:test`.

(**Note:** even if you don't want the file to go in a subfolder on your remote home, you still need the colon and some placeholder like `~`.)

But there are a wide range of tools you can use to get files onto the HPC --- another is [rsync](https://www.samba.org/rsync/).

While not appropriate for large files (say, >100Gb), there are also nice GUI options, like [filezilla](https://filezilla-project.org/), a free FTP platform.
Expand Down
62 changes: 61 additions & 1 deletion the-batch-system/interactive-jobs.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,66 @@
# Interactive Jobs

Interactive jobs function as if you were programming regularly from your terminal. I generally launch interactive jobs when I need to move large amounts of data or when I'm testing or debugging short bits of code. We can launch an interactive job using the following command:
Interactive jobs function as if you were programming regularly from your terminal. We generally launch interactive jobs when we need to move large amounts of data or when we're testing or debugging short bits of code. We can launch an interactive job in similar ways using Slurm or Torque.


## Slurm

In Slurm, you start interactive sessions with `salloc` (think: "slurm allocate"). Go into a subcluster that has GPUs and try:

```
salloc -N 1 -n 1 -t 30:00 --gpus=1
```

Each flag denotes a different resource request: `-N` (or `--nodes`) for nodes, `-n` (or `--ntasks`) for cores, `-t` (or `--time`) for time, and `--gpus` (or `-G`) for ... GPUs. So this is a request for 1 node, 1 core, and 1 GPU, for 30 minutes.

There are many more examples on the updated [HPC website here](https://www.wm.edu/offices/it/services/researchcomputing/using/running_jobs_slurm/).

After running `salloc`, you will see some feedback from the resource manager --- in this case, for such a meager request, it will likely immediately say that it has allocated your requested resource and then place you into a new shell like `[gu03]` or `[vx02]` or whatever depending on your entry subcluster.

Test that you actually got the GPU you requested with a command like:

```
[gu03] nvidia-smi
```

which should return something like

```
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 On | 00000000:81:00.0 Off | 0 |
| 0% 22C P8 22W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
```

Cool!

Caveats: Double-check your `pwd`. Slurm defaults to where you ran `salloc`, Torque defaults to home. Also double-check your modules -- Slurm should carry over what you had loaded at startup, Torque does not. Some other minor details and differences [here](https://www.wm.edu/offices/it/services/researchcomputing/using/running_jobs_slurm/).

Note that it doesn't matter which node you are actually on. Any Conda environments you've created and any files you have have will exist just the same on any node. To see this, type in `ls`. You'll see the same folders you saw when you ran `ls` from outside of the job. No matter what physical node you are running on on Vortex, you'll always see the same folders and files.

You can run scripts from this node, in the terminal. A tricky bit emerges when you want to connect a Jupyter notebook or IDE session to this node --- we'll need to cover that in a different post.

When you are done in an interactive job, you can exit by hitting `ctrl + d` or typing `exit`.


## Torque

Torque is now being discontinued but in case that changes, or you happen to be working with a system outside the HPC where this may be useful ... here is an example of a command to create an interactive session in Torque:

```
qsub -I -l nodes=1:vortex:ppn=12,walltime=01:00:00
Expand Down
10 changes: 3 additions & 7 deletions the-batch-system/jobs.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,15 +27,11 @@ Below we will cover some basic commands for Torque and Slurm to understand the j

## Basic Commands

Most basic commands have analogues in Torque/Slurm. Tor`q`ue commands typically start with `q`, while `S`lurm commands typically start with `s`.
Most basic commands have analogues in Torque/Slurm. Tor`q`ue commands typically start with `q`, while `S`lurm commands typically start with `s`. The HPC provides a great Rosetta Stone between the two flavors of resource management on their updated page [here](https://www.wm.edu/offices/it/services/researchcomputing/using/running_jobs_slurm/).

`qstat`, `squeue` - Provides a list of all jobs currently in the system, requested times, status ("R" for running, "Q" for in queue) and other factors. You may also provide a username flag, for example: `qstat -u <username>` to limit the list to *your* current jobs.
For example, here are some Torque specific commands. We'll cover these and the Slurm specific ones in subsequent posts.

**Slurm-specific:**

*To-do*

**Torque-specific:**
`qstat` - Provides a list of all jobs currently in the system, requested times, status ("R" for running, "Q" for in queue) and other factors. You may also provide a username flag, for example: `qstat -u <username>` to limit the list to *your* current jobs.

`showstats` - Provides a synopsis of cluster-wide availability.

Expand Down
10 changes: 9 additions & 1 deletion the-batch-system/non-interactive-jobs.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
# Non-Interactive Jobs

Non-Interactive jobs are much more common, however they require what we call a job script. This is just a text file that might look like the following:
Non-Interactive jobs are much more common, however they require what we call a job script.

## Slurm

**To do.**

## Torque

In Torque, a job script is just a text file that might look like the following:

```
#!/bin/tcsh
Expand Down
4 changes: 1 addition & 3 deletions the-batch-system/what-is-the-batch-system.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ By submitting jobs, we reserve the resources on the nodes for ourselves. If we w

There are two common ways to approach this resource management and scheduling problem:

- [**Torque**](https://en.wikipedia.org/wiki/TORQUE) (Terascale Open-source Resource and Queue Manager), which uses PBS (Portable Batch System), and is integrated with a scheduler like Maui or Moab, and
- [**Torque**](https://en.wikipedia.org/wiki/TORQUE) (Terascale Open-source Resource and Queue Manager), which uses PBS (Portable Batch System) and is integrated with a scheduler like Maui or Moab.

- [**Slurm**](https://en.wikipedia.org/wiki/Slurm_Workload_Manager) (Simple Linux Utility for Resource Management).

Expand All @@ -19,5 +19,3 @@ Most of the resources in this guide (as of September 2024) are written for Torqu
Both systems use a Message Passing Interface (MPI) to coordinate processes in parallel, and both can run jobs in batch mode or interactive mode, among other similarities, although there also many differences, like where and how jobs are submitted ([check out this summary](https://www.wm.edu/offices/it/services/researchcomputing/using/running_jobs_slurm/))--- so even instructions written for a Torque context will likely still provide useful help in a Slurm context, with appropriate caution.

In the next post we'll cover basics of jobs, the job queue, and commands for Torque and Slurm, before diving into some more specific examples.

####

0 comments on commit ceed957

Please sign in to comment.