Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mitch-ish 256 for LUMI #351

Merged
merged 4 commits into from
Nov 6, 2023
Merged

Add mitch-ish 256 for LUMI #351

merged 4 commits into from
Nov 6, 2023

Conversation

epwalsh
Copy link
Member

@epwalsh epwalsh commented Oct 31, 2023

This adds a 256-node mitch-ish run for LUMI (2x the batch size). I think this will run as-is, but if not we'll have to try a different FSDP wrapping strategy.

@epwalsh epwalsh requested a review from dirkgr November 1, 2023 00:10
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think this will run out of the box? That's some confidence!

python scripts/train.py configs/v1_5-mix-medium-mitch-ish.yaml ${@} \
--run_name=${SLURM_JOB_ID} \
--global_train_batch_size=4096 \
--max_duration=238418
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this duration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's 2T tokens

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if my math is right)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is, for a batch size of 8M tokens.


module load LUMI/22.08 partition/G

export OLMO_CONTAINER=llm-lumi_latest.sif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run on torch 2.1 now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the name of that image?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llm-lumi-torch32_latest.sif, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,59 @@
#!/bin/bash
#SBATCH --job-name=v1.5-mix-medium-mitch-ish
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config is now called mitch7? Or mitch7-something_related_to_large_batch_size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason to rename the train config. But we could rename this script to v1_5-mix-medium-mitch-ish-large-batch-on-lumi.sh?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v1_5-mix-medium-mitch-ish-large-batch-on-lumi is a bit unwieldy, no? When we talk about it, we just call it mitchish. All the configs are v1.5 now. 7 is shorter than medium.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to mitch-ish-7b.sh.

@dirkgr
Copy link
Member

dirkgr commented Nov 2, 2023

Between this config and the on @ibeltagy wrote, which one are we keeping?

@ibeltagy
Copy link
Contributor

ibeltagy commented Nov 2, 2023

Between this config and the on @ibeltagy wrote, which one are we keeping?

We can delete the one on my branch, but give me a few minutes to compare both configs and leave comments for differences.

@ibeltagy
Copy link
Contributor

ibeltagy commented Nov 2, 2023

The differences are:

  • flash attention, compile, and fsdp wrapping (all for speed and hardware)
  • The only real difference is this scheduler.name: linear_with_warmup

@epwalsh epwalsh changed the base branch from main to mitchish November 2, 2023 21:02
scripts/lumi/mitch-ish-7b.sh Outdated Show resolved Hide resolved
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it run on Kempner?

@epwalsh
Copy link
Member Author

epwalsh commented Nov 2, 2023

Does it run on Kempner?

It runs on MosaicML which is a better comparison to LUMI. The problem with Kempner is not enough nodes, so FSDP still takes a lot of memory, and so we have to do all of these other tricks like activation checkpointing.

@ibeltagy ibeltagy self-requested a review November 2, 2023 21:59
Copy link
Contributor

@ibeltagy ibeltagy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run on MosaicML uses scheduler.name: linear_with_warmup while this one uses cosine. Checking if this is on purpose or needs to be updated.

@epwalsh
Copy link
Member Author

epwalsh commented Nov 2, 2023

@ibeltagy good catch. That's actually been updated on the mitchish branch which this will merge into.

@epwalsh epwalsh merged commit c481165 into mitchish Nov 6, 2023
1 check passed
@epwalsh epwalsh deleted the epwalsh/mitch-ish-256 branch November 6, 2023 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants