Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between 0724 and 0424 7B models #746

Open
jiahai-feng opened this issue Nov 13, 2024 · 1 comment
Open

Difference between 0724 and 0424 7B models #746

jiahai-feng opened this issue Nov 13, 2024 · 1 comment
Labels
type/documentation An issue or pull request related to documentation

Comments

@jiahai-feng
Copy link

📚 The doc issue

What is the difference between the 0724 and 0424 model? I can't find documentation any where. It seems like the official config files are identical. Looking at the intermediate checkpoints, it looks like 0724 is a continuation of 0424, resuming from the preannealing checkpoint. If so, what is the LR schedule for the continuation, and what is the additional dataset?

Suggest a potential alternative/fix

No response

@jiahai-feng jiahai-feng added the type/documentation An issue or pull request related to documentation label Nov 13, 2024
@aman-17
Copy link
Member

aman-17 commented Dec 3, 2024

We trained OLMo 7B 0424 with a two-stage curriculum:

  • In the first stage, we trained the model from scratch on the Dolma 1.7 dataset. We set a cosine learning rate schedule with a warmup of 2500 steps, a peak learning rate of 3e-4, and a cosine decay to 3e-5 after 3T tokens. We cut off this stage after 2.7T tokens.
  • We switch to the second stage, in which we train on a higher-quality subset of Dolma 1.7 for another 50B tokens, while linearly decaying the learning rate to 0.
    Our high-quality Dolma 1.7 subset includes (1) using all available Wikipedia, OpenWebMath and Flan data, (2) removing Dolma CC, CC News, and Megawika, and (3) rebalancing remaining sources to achieve approximately equal proportions of each. See exact token counts and relative proportions of this second stage mix below. Both stages contribute equally to the final performance of the OLMo model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/documentation An issue or pull request related to documentation
Projects
None yet
Development

No branches or pull requests

2 participants