-
Notifications
You must be signed in to change notification settings - Fork 504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mitch-ish 256 for LUMI #351
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think this will run out of the box? That's some confidence!
python scripts/train.py configs/v1_5-mix-medium-mitch-ish.yaml ${@} \ | ||
--run_name=${SLURM_JOB_ID} \ | ||
--global_train_batch_size=4096 \ | ||
--max_duration=238418 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this duration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's 2T tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if my math is right)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is, for a batch size of 8M tokens.
|
||
module load LUMI/22.08 partition/G | ||
|
||
export OLMO_CONTAINER=llm-lumi_latest.sif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should run on torch 2.1 now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the name of that image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llm-lumi-torch32_latest.sif
, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -0,0 +1,59 @@ | |||
#!/bin/bash | |||
#SBATCH --job-name=v1.5-mix-medium-mitch-ish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Config is now called mitch7
? Or mitch7-something_related_to_large_batch_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason to rename the train config. But we could rename this script to v1_5-mix-medium-mitch-ish-large-batch-on-lumi.sh
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v1_5-mix-medium-mitch-ish-large-batch-on-lumi
is a bit unwieldy, no? When we talk about it, we just call it mitchish
. All the configs are v1.5 now. 7
is shorter than medium
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to mitch-ish-7b.sh
.
Between this config and the on @ibeltagy wrote, which one are we keeping? |
We can delete the one on my branch, but give me a few minutes to compare both configs and leave comments for differences. |
The differences are:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it run on Kempner?
It runs on MosaicML which is a better comparison to LUMI. The problem with Kempner is not enough nodes, so FSDP still takes a lot of memory, and so we have to do all of these other tricks like activation checkpointing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run on MosaicML uses scheduler.name: linear_with_warmup
while this one uses cosine. Checking if this is on purpose or needs to be updated.
@ibeltagy good catch. That's actually been updated on the |
This adds a 256-node mitch-ish run for LUMI (2x the batch size). I think this will run as-is, but if not we'll have to try a different FSDP wrapping strategy.