-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Support tokenizers with binary models #197
Comments
I'd recommend exporting from python with |
Downloading and re-uploading is probably the easiest user solution for now. The ideal scenario is being able to use one of these base model's tokenizer as is, but I suspect it'll take some work for that to happen. Given the complexity, this might be out of scope, but i wanted to make the feature request just so that there's at least a discussion and others could check it if they run into the same issue. I'm not sure how the LLamaTokenizer works but I'm just going to assume its a sentencepiece model binary. |
If the open llama models use the same tokenizer as the original llama models, you can reuse the tokenizer as follows: import { AutoTokenizer } from 'https://cdn.jsdelivr.net/npm/@xenova/[email protected]';
let tokenizer = await AutoTokenizer.from_pretrained('hf-internal-testing/llama-tokenizer');
let model_inputs = await tokenizer("Hello there");
console.log(model_inputs);
// {
// attention_mask: Tensor {
// data: BigInt64Array(3) [1n, 1n, 1n],
// dims: [1, 3],
// size: 3,
// type: "int64",
// },
// input_ids: Tensor {
// data: BigInt64Array(3) [1n, 15043n, 727n],
// dims: [1, 3],
// size: 3,
// type: "int64",
// },
// }
Since I have implemented the LLamaTokenizer already - and the open llama models you linked above do seem to use the same tokenizer structure (
|
Binary Tokenizers
In general, the feature you want added should be supported by HuggingFace's transformers library:
Some HuggingFace tokenizer models use a binary tokenizer file instead of a JSON configuration. Namely the new Open LLama models.
The current load tokenizer function only checks for json configurations:
In order to support binary models, this probably needs to be generalized.
Reason for request
Why is it important that we add this feature? What is your intended use case? Remember, we are more likely to add support for models/pipelines/tasks that are popular (e.g., many downloads), or contain functionality that does not exist (e.g., new input type).
This is required for doing javascript tokenization with OpenLLama models.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: