RoPE #409

d-kleine · 2024-10-23T12:31:30Z

d-kleine
Oct 23, 2024

I have noticed that RoPE is very similar to standard positional encoding:

For even indices:

$$ PE_{\text{pos}, 2i} = \sin\left(\frac{\text{pos}}{10000^{\frac{2i}{d_{\text{model}}}}}\right) $$

For odd indices:

$$ PE_{\text{pos}, 2i+1} = \cos\left(\frac{\text{pos}}{10000^{\frac{2i}{d_{\text{model}}}}}\right) $$

Where:

pos is the position in the sequence.
i is the dimension index.
d_model is the dimensionality of the embedding vector.

In the original paper, the inverse frequencies use a factor of -2 in the exponent for scaling that seems to come from the PE factor 2 from the formula above:

$$ \Theta = \lbrace \theta_i = \text{base}^{\frac{-2(i-1)}{d}}, i \in [1, 2, \ldots, d/2] \rbrace $$

see 3.3 Properties of RoPE (p. 5) in the paper.

Do you know why Llama 2 and LLama 3 models (3, 3.1, 3.2) does not use such a scaling factor, does this have any benefits?

rasbt · 2024-10-23T12:43:09Z

rasbt
Oct 23, 2024
Maintainer

These are good observations/questions. I will try to get back to you on that some day.

1 reply

d-kleine Oct 23, 2024
Author

I think this can be closed. Due to #412 I have understood that the 2 in the -2i in the RoFormer/RoPE formula

if theta_base = 10000:

$$10000^{-2i/d} = \frac{1}{10000^{2i/d}}$$

is not a scaling factor - it's related to the dimension pairing in RoPE, that's why you need

torch.arange(0, head_dim, 2)[: (head_dim // 2)]

The notation confused me here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoPE #409

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

RoPE #409

d-kleine Oct 23, 2024

Replies: 1 comment · 1 reply

rasbt Oct 23, 2024 Maintainer

d-kleine Oct 23, 2024 Author

d-kleine
Oct 23, 2024

Replies: 1 comment 1 reply

rasbt
Oct 23, 2024
Maintainer

d-kleine Oct 23, 2024
Author