Llama Config #317

dirkgr · 2023-10-05T22:23:09Z

This is a config that tracks Llama as closely as possible.

Differences that we know of:

~~Uses MQA instead of GQA~~
Different data
Different tokenizer

TODO:

use @2015aroras's Llama block
use @soldni's Llama-tokenized data

dirkgr · 2023-10-31T19:13:57Z

I want to merge this as it is, and add @2015aroras's Llama block later. @AkshitaB needs some of the changes in here for her config.

dirkgr · 2023-10-31T19:14:42Z

My top concern is whether and of the code changed in here affect the stability of existing runs, like mitchish.

epwalsh · 2023-10-31T19:20:08Z

olmo/config.py

@@ -423,8 +424,8 @@ class OptimizerConfig(BaseConfig):
    learning_rate: float = 1.0e-4
    weight_decay: float = 0.01
    betas: Tuple[float, float] = (0.9, 0.95)
-    no_decay_norm_and_bias: bool = True
-    """Do not apply weight decay to norms and biases."""
+    decay_norm_and_bias: bool = False


This is a breaking change. I'd suggest keeping the old option and marking it as deprecated, or we need a preprocessor that renames the old option to the new one when present.

Renames, and also, changes the value to not value.

I did that in 7244c0b. Please read that carefully! There are a lot of not and "no"s around, and this is exactly the kind of thing where you can run for 500B tokens before you notice you screwed up.

epwalsh · 2023-10-31T19:23:18Z

olmo/initialization.py

@@ -15,6 +15,7 @@ def init_weights(
    d: Optional[int] = None,
    layer_id: Optional[int] = None,
    std_factor: float = 1.0,
+    type_of_module: str = "",


Nit: make a StrEnum for this

epwalsh · 2023-10-31T19:25:00Z

olmo/config.py

@@ -594,6 +595,15 @@ class ShardedCheckpointerType(StrEnum):
    local = "local"


+class ActivationCheckpointingStrategy(StrEnum):
+    none = "none"


Nit: instead of having a "none" variant we could make ActivationCheckpointStrategy optional. I think that's more Pythonic.

2015aroras

LGTM other than Pete's comment on the breaking change

2015aroras · 2023-10-31T20:45:15Z

olmo/optim.py

-            },
+            }
+        )
+    if len(no_decay_sorted) > 0:


This could result in less than 2 param groups overall. Just checking that there are no foreseeable problems with that.

I actually removed some checks that relied on there always being two. I think that change should be in here?

I don't recall the checks, but if you have accounted for this them I'm not worried

2015aroras · 2023-10-31T21:40:03Z

olmo/optim.py

-                if pn.endswith("bias"):
-                    # all biases will not be decayed
+            if pn.endswith("bias"):
+                if cfg.optimizer.decay_norm_and_bias:


You could potentially break up decay_norm_and_bias further into decay_norm and decay_bias or similar.

If we wanted to actually experiment with that, I would.

AkshitaB · 2023-10-31T22:01:54Z

My top concern is whether and of the code changed in here affect the stability of existing runs, like mitchish.

With my run, the goal, I think, is to be as close to the "original run" plus weight UN-tying. Will committing all the changes here cause unexpected discrepancies? For the specific decay-related arguments, is it ok to simply use the current main status and use the older no_decay_norm_and_bias flag? The caveat is that it won't decay embeddings, if I'm reading it right.

dirkgr · 2023-11-01T06:41:53Z

The caveat is that it won't decay embeddings, if I'm reading it right.

The new run should decay everything, including embeddings. The new run should use the new flags.

Will committing all the changes here cause unexpected discrepancies?

If it does, then that's a problem in this PR. I hope it does not, but there is a lot of room for error here, which is why I'm glad we're all looking at it instead of just me.

dirkgr · 2023-11-01T07:11:37Z

Argh, I think those function pointer shenanigans make torch.compile() not work :-(

This reverts commit f3491db.

dirkgr · 2023-11-01T07:21:15Z

Nevermind, compile() and activation checkpointing were never going to work at the same time.

AkshitaB · 2023-11-01T16:51:01Z

The new run should decay everything, including embeddings. The new run should use the new flags.

Alright, I'll wait for this code to be merged.

2015aroras

LGTM

epwalsh and others added 30 commits September 28, 2023 09:36

cast logits to fp32 before output

c6b5ee9

revert logit manual cast

27e7c84

add option for no weight tying

1850ebb

init ff_out weights

b852ec4

fix init device for ff_out

a2f1517

refactor how we cache buffers

387f659

cache rope sin and cos

160d143

Refactor how RoPE is applied

7fc33c5

remove unused import

ee95fd3

Add back Olmo.device property

95c806c

Merge branch 'main' into petew/tweaks

299b5cc

give cache a type, make it required in constructors

5a628a3

Merge branch 'main' into petew/tweaks

ba80eba

Allows us to use an intermediate size instead of an mlp ratio

5098b94

Makes RMSNorm like Llama's

85e38fc

Llama-like config

d0f61ca

Enable flash

8a3c9e5

Actually use intermediate size

0d3ad37

Merge remote-tracking branch 'origin/petew/tweaks' into Llama

a3c722b

Turn off weight tying

96102e2

Formatting

3976c8b

Make mypy happy

f95c153

Switch to the 1.5 data mix

186fd2a

Longer sequence length

c245630

Merge branch 'main' into petew/tweaks

0e6dfcd

add mitch config

b331f8b

add option to override hidden size

75c5813

MCLI configs

b9805ff

rename config option to mlp_hidden_size

36370d0

don't use adaptive clipping

7e8b88f

dirkgr added 4 commits October 31, 2023 11:41

Clean up configs

de0ef38

Merge branch 'Llama' of https://github.com/allenai/LLM into Llama

41c1251

Fix and rename LUMI script

771b04e

Undo debugging change

581538c

dirkgr requested a review from 2015aroras October 31, 2023 19:13

dirkgr requested a review from AkshitaB October 31, 2023 19:16

epwalsh requested changes Oct 31, 2023

View reviewed changes

2015aroras approved these changes Oct 31, 2023

View reviewed changes

dirkgr added 6 commits October 31, 2023 22:54

Merge remote-tracking branch 'origin/main' into Llama

d88e02d

Use StrEnum for module type

f44fae2

None means "no activation checkpointing"

a4609d4

Productivity through formatting

ba7bc3a

Resolve circular imports

445d7c6

Preserve the functionality of existing configs

7244c0b

Fix activation checkpointing defaults

4f90216

dirkgr added 3 commits November 1, 2023 00:16

Try to bring back compile

f3491db

Merge branch 'Llama' of https://github.com/allenai/LLM into Llama

f52220f

Revert "Try to bring back compile"

7b6add5

This reverts commit f3491db.

2015aroras approved these changes Nov 1, 2023

View reviewed changes

AkshitaB approved these changes Nov 1, 2023

View reviewed changes

epwalsh approved these changes Nov 2, 2023

View reviewed changes

dirkgr merged commit da91f34 into main Nov 2, 2023
10 checks passed

dirkgr deleted the Llama branch November 2, 2023 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama Config #317

Llama Config #317

dirkgr commented Oct 5, 2023 •

edited

Loading

dirkgr commented Oct 31, 2023

dirkgr commented Oct 31, 2023

epwalsh Oct 31, 2023

AkshitaB Oct 31, 2023

dirkgr Nov 1, 2023

epwalsh Oct 31, 2023

dirkgr Nov 1, 2023

epwalsh Oct 31, 2023

dirkgr Nov 1, 2023

2015aroras left a comment

2015aroras Oct 31, 2023

dirkgr Oct 31, 2023

2015aroras Oct 31, 2023

2015aroras Oct 31, 2023

dirkgr Nov 1, 2023

AkshitaB commented Oct 31, 2023

dirkgr commented Nov 1, 2023

dirkgr commented Nov 1, 2023

dirkgr commented Nov 1, 2023

AkshitaB commented Nov 1, 2023

2015aroras left a comment

Llama Config #317

Llama Config #317

Conversation

dirkgr commented Oct 5, 2023 • edited Loading

TODO:

dirkgr commented Oct 31, 2023

dirkgr commented Oct 31, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

2015aroras left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AkshitaB commented Oct 31, 2023

dirkgr commented Nov 1, 2023

dirkgr commented Nov 1, 2023

dirkgr commented Nov 1, 2023

AkshitaB commented Nov 1, 2023

2015aroras left a comment

Choose a reason for hiding this comment

dirkgr commented Oct 5, 2023 •

edited

Loading