RoPE
#409
Replies: 1 comment 1 reply
-
These are good observations/questions. I will try to get back to you on that some day. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have noticed that RoPE is very similar to standard positional encoding:
For even indices:
For odd indices:
Where:
In the original paper, the inverse frequencies use a factor of
-2
in the exponent for scaling that seems to come from the PE factor2
from the formula above:see 3.3 Properties of RoPE (p. 5) in the paper.
Do you know why Llama 2 and LLama 3 models (3, 3.1, 3.2) does not use such a scaling factor, does this have any benefits?
Beta Was this translation helpful? Give feedback.
All reactions