Skip to content

Commit

Permalink
Add support for HerBERT models
Browse files Browse the repository at this point in the history
  • Loading branch information
xenova committed Sep 7, 2023
1 parent f2fce14 commit baa5869
Show file tree
Hide file tree
Showing 3 changed files with 31 additions and 2 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -273,6 +273,7 @@ You can refine your search by selecting the task you're interested in (e.g., [te
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, and Ireneusz Gawlik.
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
Expand Down
1 change: 1 addition & 0 deletions docs/snippets/6_supported-models.snippet
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
1. **[GPT-2](https://huggingface.co/docs/transformers/model_doc/gpt2)** (from OpenAI) released with the paper [Language Models are Unsupervised Multitask Learners](https://blog.openai.com/better-language-models/) by Alec Radford*, Jeffrey Wu*, Rewon Child, David Luan, Dario Amodei** and Ilya Sutskever**.
1. **[GPT-J](https://huggingface.co/docs/transformers/model_doc/gptj)** (from EleutherAI) released in the repository [kingoflolz/mesh-transformer-jax](https://github.com/kingoflolz/mesh-transformer-jax/) by Ben Wang and Aran Komatsuzaki.
1. **[GPTBigCode](https://huggingface.co/docs/transformers/model_doc/gpt_bigcode)** (from BigCode) released with the paper [SantaCoder: don't reach for the stars!](https://arxiv.org/abs/2301.03988) by Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo García del Río, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra.
1. **[HerBERT](https://huggingface.co/docs/transformers/model_doc/herbert)** released with the paper [KLEJ: Comprehensive Benchmark for Polish Language Understanding](https://www.aclweb.org/anthology/2020.acl-main.111.pdf) by Piotr Rybak, Robert Mroczkowski, Janusz Tracz, and Ireneusz Gawlik.
1. **[M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100)** (from Facebook) released with the paper [Beyond English-Centric Multilingual Machine Translation](https://arxiv.org/abs/2010.11125) by Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Edouard Grave, Michael Auli, Armand Joulin.
1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team.
1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer.
Expand Down
31 changes: 29 additions & 2 deletions src/tokenizers.js
Original file line number Diff line number Diff line change
Expand Up @@ -1344,6 +1344,8 @@ class PostProcessor extends Callable {

case 'RobertaProcessing':
return new RobertaProcessing(config);
case 'BertProcessing':
return new BertProcessing(config);

default:
throw new Error(`Unknown PostProcessor type: ${config.type}`);
Expand Down Expand Up @@ -1375,9 +1377,8 @@ class PostProcessor extends Callable {

/**
* A post-processor that adds special tokens to the beginning and end of the input.
* @extends PostProcessor
*/
class RobertaProcessing extends PostProcessor {
class BertProcessing extends PostProcessor {
/**
* @param {Object} config The configuration for the post-processor.
* @param {string[]} config.cls The special tokens to add to the beginning of the input.
Expand Down Expand Up @@ -1408,6 +1409,7 @@ class RobertaProcessing extends PostProcessor {
return tokens;
}
}
class RobertaProcessing extends BertProcessing { } // NOTE: extends BertProcessing

/**
* Post processor that replaces special tokens in a template with actual tokens.
Expand Down Expand Up @@ -1519,6 +1521,8 @@ class Decoder extends Callable {

case 'CTC':
return new CTCDecoder(config);
case 'BPEDecoder':
return new BPEDecoder(config);
default:
throw new Error(`Unknown Decoder type: ${config.type}`);
}
Expand Down Expand Up @@ -1625,6 +1629,7 @@ class FuseDecoder extends Decoder {
}
}


class StripDecoder extends Decoder {
constructor(config) {
super(config);
Expand Down Expand Up @@ -1846,6 +1851,21 @@ class DecoderSequence extends Decoder {

}

class BPEDecoder extends Decoder {
constructor(config) {
super(config);

this.suffix = this.config.suffix;
}
/** @type {Decoder['decode_chain']} */
decode_chain(tokens) {
return tokens.map((token, i) => {
return token.replaceAll(this.suffix, (i === tokens.length - 1) ? '' : ' ')
});
}
}


/**
* This PreTokenizer replaces spaces with the given replacement character, adds a prefix space if requested,
* and returns a list of tokens.
Expand Down Expand Up @@ -2558,6 +2578,12 @@ export class DebertaV2Tokenizer extends PreTrainedTokenizer {
return add_token_types(inputs);
}
}
export class HerbertTokenizer extends PreTrainedTokenizer {
/** @type {add_token_types} */
prepare_model_inputs(inputs) {
return add_token_types(inputs);
}
}
export class DistilBertTokenizer extends PreTrainedTokenizer { }
export class CamembertTokenizer extends PreTrainedTokenizer { }

Expand Down Expand Up @@ -3665,6 +3691,7 @@ export class AutoTokenizer {
DebertaTokenizer,
DebertaV2Tokenizer,
BertTokenizer,
HerbertTokenizer,
MobileBertTokenizer,
SqueezeBertTokenizer,
AlbertTokenizer,
Expand Down

0 comments on commit baa5869

Please sign in to comment.