Add mitch-ish 256 for LUMI #351

epwalsh · 2023-10-31T23:33:33Z

This adds a 256-node mitch-ish run for LUMI (2x the batch size). I think this will run as-is, but if not we'll have to try a different FSDP wrapping strategy.

dirkgr

Why do you think this will run out of the box? That's some confidence!

dirkgr · 2023-11-01T04:36:03Z

scripts/lumi/v1_5-mix-medium-mitch-ish-on-lumi.sh

+    python scripts/train.py configs/v1_5-mix-medium-mitch-ish.yaml ${@} \
+      --run_name=${SLURM_JOB_ID} \
+      --global_train_batch_size=4096 \
+      --max_duration=238418


What is this duration?

That's 2T tokens

(if my math is right)

Yes, it is, for a batch size of 8M tokens.

dirkgr · 2023-11-01T04:36:50Z

scripts/lumi/v1_5-mix-medium-mitch-ish-on-lumi.sh

+
+module load LUMI/22.08 partition/G
+
+export OLMO_CONTAINER=llm-lumi_latest.sif


We should run on torch 2.1 now.

What's the name of that image?

llm-lumi-torch32_latest.sif, I think.

dirkgr · 2023-11-01T04:37:44Z

scripts/lumi/v1_5-mix-medium-mitch-ish-on-lumi.sh

@@ -0,0 +1,59 @@
+#!/bin/bash
+#SBATCH --job-name=v1.5-mix-medium-mitch-ish


Config is now called mitch7? Or mitch7-something_related_to_large_batch_size?

I don't see any reason to rename the train config. But we could rename this script to v1_5-mix-medium-mitch-ish-large-batch-on-lumi.sh?

v1_5-mix-medium-mitch-ish-large-batch-on-lumi is a bit unwieldy, no? When we talk about it, we just call it mitchish. All the configs are v1.5 now. 7 is shorter than medium.

Renamed to mitch-ish-7b.sh.

dirkgr · 2023-11-02T18:58:49Z

Between this config and the on @ibeltagy wrote, which one are we keeping?

ibeltagy · 2023-11-02T19:14:42Z

Between this config and the on @ibeltagy wrote, which one are we keeping?

We can delete the one on my branch, but give me a few minutes to compare both configs and leave comments for differences.

ibeltagy · 2023-11-02T19:19:59Z

The differences are:

flash attention, compile, and fsdp wrapping (all for speed and hardware)
The only real difference is this scheduler.name: linear_with_warmup

scripts/lumi/mitch-ish-7b.sh

dirkgr

Does it run on Kempner?

epwalsh · 2023-11-02T21:56:56Z

Does it run on Kempner?

It runs on MosaicML which is a better comparison to LUMI. The problem with Kempner is not enough nodes, so FSDP still takes a lot of memory, and so we have to do all of these other tricks like activation checkpointing.

ibeltagy

The run on MosaicML uses scheduler.name: linear_with_warmup while this one uses cosine. Checking if this is on purpose or needs to be updated.

epwalsh · 2023-11-02T22:22:38Z

@ibeltagy good catch. That's actually been updated on the mitchish branch which this will merge into.

Add mitch-ish 256 for LUMI

81c4930

epwalsh requested a review from dirkgr November 1, 2023 00:10

dirkgr requested changes Nov 1, 2023

View reviewed changes

epwalsh added 2 commits November 2, 2023 13:58

switch to torch 2.1 image

8a8e986

shorter file name

1a5b3dd

epwalsh changed the base branch from main to mitchish November 2, 2023 21:02

dirkgr requested changes Nov 2, 2023

View reviewed changes

scripts/lumi/mitch-ish-7b.sh Outdated Show resolved Hide resolved

Update scripts/lumi/mitch-ish-7b.sh

ead67e4

dirkgr approved these changes Nov 2, 2023

View reviewed changes

ibeltagy self-requested a review November 2, 2023 21:59

ibeltagy requested changes Nov 2, 2023

View reviewed changes

ibeltagy approved these changes Nov 2, 2023

View reviewed changes

epwalsh merged commit c481165 into mitchish Nov 6, 2023
1 check passed

epwalsh deleted the epwalsh/mitch-ish-256 branch November 6, 2023 23:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mitch-ish 256 for LUMI #351

Add mitch-ish 256 for LUMI #351

epwalsh commented Oct 31, 2023 •

edited

Loading

dirkgr left a comment

dirkgr Nov 1, 2023

epwalsh Nov 1, 2023

epwalsh Nov 1, 2023

dirkgr Nov 2, 2023

dirkgr Nov 1, 2023

epwalsh Nov 1, 2023

dirkgr Nov 2, 2023

epwalsh Nov 2, 2023

dirkgr Nov 1, 2023

epwalsh Nov 1, 2023

dirkgr Nov 2, 2023

epwalsh Nov 2, 2023

dirkgr commented Nov 2, 2023

ibeltagy commented Nov 2, 2023

ibeltagy commented Nov 2, 2023

dirkgr left a comment

epwalsh commented Nov 2, 2023

ibeltagy left a comment •

edited

Loading

epwalsh commented Nov 2, 2023


		module load LUMI/22.08 partition/G

		export OLMO_CONTAINER=llm-lumi_latest.sif

		@@ -0,0 +1,59 @@
		#!/bin/bash
		#SBATCH --job-name=v1.5-mix-medium-mitch-ish

Add mitch-ish 256 for LUMI #351

Add mitch-ish 256 for LUMI #351

Conversation

epwalsh commented Oct 31, 2023 • edited Loading

dirkgr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dirkgr commented Nov 2, 2023

ibeltagy commented Nov 2, 2023

ibeltagy commented Nov 2, 2023

dirkgr left a comment

Choose a reason for hiding this comment

epwalsh commented Nov 2, 2023

ibeltagy left a comment • edited Loading

Choose a reason for hiding this comment

epwalsh commented Nov 2, 2023

epwalsh commented Oct 31, 2023 •

edited

Loading

ibeltagy left a comment •

edited

Loading