Skip to content

Commit

Permalink
try 2 node run
Browse files Browse the repository at this point in the history
  • Loading branch information
mwalmsley committed Nov 3, 2023
1 parent adf9f26 commit 257b4dc
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 3 deletions.
2 changes: 1 addition & 1 deletion only_for_me/narval/finetune.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@
os.path.join(os.environ['SLURM_TMPDIR'], 'walml/finetune/checkpoints'),
accelerator='gpu',
devices=2,
num_nodes=1,
num_nodes=2,
strategy='ddp',
precision='16-mixed',
max_epochs=max_epochs,
Expand Down
9 changes: 8 additions & 1 deletion only_for_me/narval/finetune.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,18 @@
#!/bin/bash
#SBATCH --mem=32G
#SBATCH --nodes=1
#SBATCH --nodes=2
#SBATCH --time=0:20:0
#SBATCH --tasks-per-node=2
#SBATCH --cpus-per-task=12
#SBATCH --gres=gpu:a100:2

#### SBATCH --mem=32G
#### SBATCH --nodes=1
#### SBATCH --time=0:20:0
#### SBATCH --tasks-per-node=2
#### SBATCH --cpus-per-task=12
#### SBATCH --gres=gpu:a100:2

####
#### SBATCH --mem=16G
#### SBATCH --nodes=1
Expand Down
7 changes: 6 additions & 1 deletion only_for_me/narval/narval.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ https://prashp.gitlab.io/post/compute-canada-tut/
https://docs.alliancecan.ca/wiki/Python

ssh [email protected]
ssh-copy-id to avoid password in future

module purge
module avail
Expand Down Expand Up @@ -51,8 +52,12 @@ and my own cloned repos
pip install --no-deps -e galaxy-datasets
pip install --no-deps -e zoobot

Run training

sbatch only_for_me/narval/finetune.sh

Works with simple images on multi-GPU, single node

Multi-node notes

https://lightning.ai/docs/pytorch/stable/clouds/cluster_intermediate_2.html#
https://pytorch.org/docs/stable/elastic/run.html#environment-variables
Expand Down

0 comments on commit 257b4dc

Please sign in to comment.