We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phi-4 uses Tiktoken tokenizer (100k vocab).
https://arxiv.org/pdf/[2412.08905v1](https://arxiv.org/pdf/2412.08905v1)
we now use the tiktoken tokenizer (for better multilingual support) with a padded vocabulary size of 100,352 (including unused tokens)
Consider adding it as an option to the encoding map so it's easier to create.
machinelearning/src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs
Lines 1025 to 1035 in 01c4164
The text was updated successfully, but these errors were encountered:
tarekgh
No branches or pull requests
Phi-4 uses Tiktoken tokenizer (100k vocab).
https://arxiv.org/pdf/[2412.08905v1](https://arxiv.org/pdf/2412.08905v1)
Consider adding it as an option to the encoding map so it's easier to create.
machinelearning/src/Microsoft.ML.Tokenizers/Model/TiktokenTokenizer.cs
Lines 1025 to 1035 in 01c4164
The text was updated successfully, but these errors were encountered: