theme | _class | paginate | backgroundColor | marp |
---|---|---|---|---|
gaia |
lead |
true |
true |
第5讲: 大语言模型解析 II
基于HF LlaMA实现的讲解
- LLM结构解析(开源LlaMA)
- 自定义数据集构造
- 自定义损失函数和模型训练/微调
- Encoder-decoder结构
- 输入部分
- Input embedding
- Positional embedding
- Transformer部分
- Attention module
- Feed forward network
LlamaForCausalLM(
(model): LlamaModel(
...
(layers): ModuleList(
(0-15): 16 x LlamaDecoderLayer(
(self_attn): LlamaAttention
(mlp): LlamaMLP
(input_layernorm): LlamaRMSNorm
(post_attention_layernorm): LlamaRMSNorm
)
...
)
(lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=512, bias=False)
(v_proj): Linear(in_features=2048, out_features=512, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)
(input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
(post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
- Normalization模块
- LlamaRMSNorm
- 两个主要模块
- LlamaAttention
- LlamaMLP
(input_layernorm): LlamaRMSNorm
- Normalization,中文又称“归一化”,“标准化”,...
- 作用:调整数据的分布(scaling),亦可称“缩放”
- 数据包括:输入、输出、中间层表示,...
最经典的Normalization
Standard score:
“药效”:加速训练收敛,让输入更“规整”,降低过拟合(overfitting),增强泛化(generalization)
- 目标不同:
- Normalization=调整数据
- 比如:
$X'=X-\frac{X_\min}{X_\max-X_\min}$
- 比如:
- Regularization=调整预测/损失函数
- 比如:
$\text{loss}=\min\sum_{i=1}^N L(f(x_i), y_i)+\lambda R(\theta_f)$
- 比如:
- Normalization=调整数据
Normalization:“调整数据分布”
- 数据长什么样子?都是tensor!维度很多!
- 原始输入: vocab embedding
- tensor shape: <batch_size, sequence_length, hidden_dim>
- 深度学习模型中间层表示(hidden states/representations)
- tensor shape: <batch_size, sequence_length, hidden_dim>
- 原始输入: vocab embedding
- tensor shape: <batch_size, sequence_length, hidden_dim>
- 选择最合适的Normalization维度
- batch:X=[batch_size,sequence_length, hidden_dim]
- sequence: X=[sequence_length, hidden_dim]
- hidden: <bs, seq, hidden> => <N, hidden>, X=[hidden_dim]
- 又称LayerNrom
当前流行的LayerNorm:RMSNorm
- torch 2.4,提供了RMSNorm类的实现torch.nn.RMSNorm
- 原理: "regularizes the summed inputs to a neuron in one layer according to root mean square (RMS)"
$y = \text{RMSNorm}(x)=\frac{x}{\sqrt{\text{RMS}[x]+\epsilon}}*\gamma$ $\text{RMS}(x)=\sqrt{\frac{\sum_{i=1}^N x_i^2}{N}}$
如何利用torch算子自行实现RMSNorm?
- 输入和输出的shape: <batch_size, sequence_length, hidden_dim>
- 涉及的计算
- 平方,求和,开根
- torch提供: tensor.pow, tensor.mean, torch.rsqrt
我们来试试
-
目标:实现RoPE(对应的$R_{\Theta, m}^d$)
- 准备一堆$\theta_i$,以及对应的cos和sin数值
-
构建RoPE矩阵:
$R_{\Theta, m}^d$ 流程- 批量算cos,再批量算sin
- 涉及torch算子
- torch.arrange, torch.sin, torch.cos, torch.outer
我们来试试
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
(up_proj): Linear(in_features=2048, out_features=8192, bias=False)
(down_proj): Linear(in_features=8192, out_features=2048, bias=False)
(act_fn): SiLU()
)
- 组件:三个nn.Linear层,一个SiLU激活函数
- SiLU: torch.nn.functional.silu(x)
- Linear: torch.nn.Linear(in_features, out_features)
- 输入tensor: <batch_size, sequence_length, hidden_dim>
- 第一步:
- 通过gate_proj获得gate tensor,经过SiLU激活得到gate tensor
- 通过up_proj获得up tensor
- 第二步:元素乘(elementwise multiply): gate tensor 和 up tensor
- 第三步: 通过down_proj获得down tensor
摘抄自transformers/src/models/modeling_llama.py
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x)