Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ignore_merges option to BPE tokenizers #716

Merged
merged 1 commit into from
Apr 18, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion src/tokenizers.js
Original file line number Diff line number Diff line change
Expand Up @@ -630,10 +630,12 @@ class BPE extends TokenizerModel {
* Create a BPE instance.
* @param {Object} config The configuration object for BPE.
* @param {Object} config.vocab A mapping of tokens to ids.
* @param {string[]} config.merges An array of BPE merges as strings.
* @param {string} config.unk_token The unknown token used for out of vocabulary words.
* @param {string} config.end_of_word_suffix The suffix to place at the end of each word.
* @param {string} [config.continuing_subword_suffix] The suffix to insert between words.
* @param {Array} config.merges An array of BPE merges as strings.
* @param {boolean} [config.byte_fallback=false] Whether to use spm byte-fallback trick (defaults to False)
* @param {boolean} [config.ignore_merges=false] Whether or not to match tokens with the vocab before using merges.
*/
constructor(config) {
super(config);
Expand Down Expand Up @@ -665,6 +667,8 @@ class BPE extends TokenizerModel {
this.text_encoder = new TextEncoder();
}

this.ignore_merges = this.config.ignore_merges ?? false;

/** @type {Map<string, string[]>} */
this.cache = new Map();
}
Expand Down Expand Up @@ -826,6 +830,10 @@ class BPE extends TokenizerModel {
const outputTokens = [];

for (const token of tokens) {
if (this.ignore_merges && this.tokens_to_ids.has(token)) {
outputTokens.push(token);
continue;
}
const bpe_token_list = this.bpe(token);

for (const t of bpe_token_list) {
Expand Down
Loading