Validation step3 #21

kylematoba · 2024-11-19T19:15:20Z

No description provided.

…ectly (like dtype != float, requires grad), make flash attention dep optional (still needs to be checked)

ischlag

lgtm

AleHD

Looks good to me. Was able to do a minimal test after the dp>1 fix. Have you tried saving and loading checkpoints to make sure valid dataloader samples are reproducible?

Also, I'd vote to add a way to held out a portion of the train set and use it as validation, instead of needing to add an additional dataset necessarily. Something like the --split argument in megatron https://github.com/NVIDIA/Megatron-LM/blob/81fee9b0047fb3ac6001b5e71e4df89fc01b2a1c/megatron/training/arguments.py#L1760-L1764. How hard would it be to implement this?

src/nanotron/trainer.py

AleHD · 2024-11-21T14:28:16Z

reconcile_rotary_embeddings.py

+import torch
+import torchtune
+import flash_attn
+import flash_attn.layers.rotary
+
+
+class RotaryEmbeddingKyleLikeFA(torch.nn.Module):
+    """
+    Has the same function signature as FA, for interleaved=True and separate q, kv. 
+    seqlen_offset = 0
+    Does not operate inplace, but that's fine for how it's used in Nanotron. 
+    """
+    def __init__(self, dim: int, base: float):
+        super().__init__()
+        self.dim = dim
+        self.base = float(base)
+
+        self.max_seq_len = None
+        self.rpe = None
+
+    def forward(self, q, kv):
+        bs, q_len, n_heads, _ = q.shape
+        assert self.dim == _
+
+        assert (bs, q_len, 2, n_heads, self.dim) == kv.shape
+
+        if (self.rpe is None) or (self.max_seq_len != q_len):
+            self.max_seq_len = q_len 
+            self.rpe = torchtune.modules.RotaryPositionalEmbeddings(dim=self.dim, 
+                                                                    max_seq_len=self.max_seq_len,
+                                                                    base=self.base).to(device)
+        q_out = self.rpe(q)
+        kv_out = torch.stack((self.rpe(kv[:, :, 0]), kv[:, :, 1]), 2)
+        return q_out, kv_out
+
+
+
+if __name__ == "__main__":
+    device = torch.device(0) 
+    theta = 10000
+
+    batch_size = 3
+    dim_qk = 4
+    q_len = 256
+    kv_len = 256
+    n_heads = 4
+
+    max_seq_len = max(q_len, kv_len) 
+
+    print(max_seq_len) 
+
+
+    query_states = torch.rand(batch_size, q_len, n_heads, dim_qk, device=device) 
+    key_value_states = torch.rand(batch_size, kv_len, 2, n_heads, dim_qk, device=device).contiguous()
+
+
+    interleaved = True 
+    # interleaved = False
+    re1 = flash_attn.layers.rotary.RotaryEmbedding(dim=dim_qk, interleaved=interleaved, base=theta).to(device)
+    re2 = torchtune.modules.RotaryPositionalEmbeddings(dim=dim_qk, max_seq_len=max_seq_len, base=theta).to(device)
+    re3 = RotaryEmbeddingKyleLikeFA(dim=dim_qk, base=theta).to(device)
+
+
+
+    print(key_value_states[:, :, 0].shape)
+
+    out2 = re2(query_states)
+    out3 = re2(key_value_states[:, :, 0]) 
+    # out4 = re2(key_value_states[:, :, 1]) 
+
+    out_eq = re3(query_states, kv=key_value_states)
+
+    # torch.testing.assert_close(out2, query_states)
+    out1 = re1(query_states, kv=key_value_states)
+
+    torch.testing.assert_close(out_eq[0], out1[0])
+    torch.testing.assert_close(out_eq[1], out1[1])
+
+
+    # Do this second, since the computation is inplace
+    torch.testing.assert_close(out1[0], query_states)
+
+    test = torch.stack((out3, key_value_states[:, :, 1]), 2)
+    torch.testing.assert_close(out1[1], test)
+    # torch.testing.assert_close(out1[1][:, :, 0], out3) 
+
+
+    torch.testing.assert_close(out1[0], out2)
+
+    print("done")
+


I'd say this one should either go somewhere to tests or be removed.

Co-authored-by: AleHC <[email protected]>

…validation_step3

kylematoba added 4 commits November 18, 2024 20:05

add validation step, take out some steps that I don't think work corr…

f587ee8

…ectly (like dtype != float, requires grad), make flash attention dep optional (still needs to be checked)

fix FA, start to make validation step look like train step

d0facd6

FA-less ROPE with FA signature

a821f1e

Can do training without FA

d6d89ae

kylematoba requested review from ischlag and AleHD November 19, 2024 19:15

work

ffb5fbf

ischlag approved these changes Nov 20, 2024

View reviewed changes

AleHD and others added 2 commits November 20, 2024 11:02

Fixed wrong lr initialization when loading checkpoints

5c4c0c6

Log against iteration_step

ace4d33

AleHD requested changes Nov 21, 2024

View reviewed changes

kylematoba and others added 11 commits November 22, 2024 10:46

Update src/nanotron/trainer.py

8659582

Co-authored-by: AleHC <[email protected]>

Update src/nanotron/trainer.py

b8731e2

Co-authored-by: AleHC <[email protected]>

Merge branch 'validation_step3' of github.com:swiss-ai/nanotron into …

0fc0f4e

…validation_step3

Redo saving and loading

f73f33b

Merge branch 'load_correct_lr' into validation_step3

c4effa9

work

36d5819

work

5329310

tweaks

3ead1c7

tweaks

2bc34de

Merge branch 'validation_step3' of github.com:swiss-ai/nanotron into …

1182d75

…validation_step3

work

2e74edf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation step3 #21

Validation step3 #21

kylematoba commented Nov 19, 2024

ischlag left a comment

AleHD left a comment

AleHD Nov 21, 2024

Validation step3 #21

Are you sure you want to change the base?

Validation step3 #21

Conversation

kylematoba commented Nov 19, 2024

ischlag left a comment

Choose a reason for hiding this comment

AleHD left a comment

Choose a reason for hiding this comment

AleHD Nov 21, 2024

Choose a reason for hiding this comment