-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'AntNLP:main' into main
- Loading branch information
Showing
8 changed files
with
411 additions
and
8 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,95 @@ | ||
## Tokenizer | ||
|
||
To build a tokenizer, we need to perform three steps: | ||
+ pre-process: split words according to whitespace and punctuation, or using tools like [spaCy](https://spacy.io/) and [Moses](https://www.statmt.org/moses/?n=Development.GetStarted). | ||
+ train: build vocabulary on the corpus. | ||
+ encode: output sub-words according to the vocabulary. | ||
|
||
### Resources & References | ||
+ Paper: | ||
+ [Byte Pair Encoding (BPE)](https://aclanthology.org/P16-1162.pdf) | ||
+ [WordPiece](https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf) | ||
+ [SentencePiece Unigram](https://arxiv.org/pdf/1804.10959.pdf) | ||
+ Code: | ||
+ [huggingface](https://github.com/huggingface/tokenizers) (Rust implementation) | ||
+ [BPE](https://github.com/rsennrich/subword-nmt/tree/master/subword_nmt) (Python implementation) | ||
+ [BPE (light version)](https://github.com/lovit/WordPieceModel/blob/master/wordpiecemodel/bpe.py) (Python implementation) | ||
+ [SentencePiece](https://github.com/google/sentencepiece) (C++ implementation) | ||
+ [WordPiece](https://github.com/google-research/bert/blob/master/tokenization.py) (Python Implementation, without training code) | ||
+ blog: | ||
+ https://huggingface.co/docs/transformers/tokenizer_summary | ||
+ https://medium.com/@makcedward/how-subword-helps-on-your-nlp-model-83dd1b836f46 | ||
+ https://towardsdatascience.com/wordpiece-subword-based-tokenization-algorithm-1fbd14394ed7 | ||
|
||
### Implementation | ||
For clarity, we assume that the corpus has been pre-processed with spaCy. Thus, the structure of [`Tokenizer` class](tokenizer.py) as follow: | ||
+ `__init__`: initialize. | ||
+ `train`: build the vocabulary. | ||
+ `encode`: output sub-words according to the vocabulary. | ||
+ `save`: save the class to the file. | ||
+ `from_file`: instantiate a new class from the file. | ||
|
||
### Verification | ||
##### Our Output | ||
```python | ||
# Step 1: Set the random seed | ||
SEED=xxx | ||
random.seed(SEED) | ||
np.random.seed(SEED) | ||
|
||
# Step 2: Prepare the corpus | ||
# one line per sentence, words are split by whitespace | ||
corpus_file = "xxx" | ||
|
||
# Step 3: Build the vocabulary | ||
tokenizer = Tokenizer( | ||
vocab=None, | ||
unk_token="[UNK]" | ||
... | ||
) | ||
tokenizer.train( | ||
files=[corpus_file], | ||
vocab_size=30000, | ||
... | ||
) | ||
tokenizer.save("tokenizer.json") | ||
|
||
# Step 4: Tokenize | ||
tokenizer = Tokenizer.from_file("tokenizer.json") | ||
output = tokenizer.encode("Hello, y'all! How are you 😁 ?") | ||
``` | ||
|
||
|
||
##### Huggingface's Output | ||
```python | ||
# Step 1: Set the random seed | ||
# IMPORTANT! SEED must be the same as ours! | ||
SEED=xxx | ||
random.seed(SEED) | ||
np.random.seed(SEED) | ||
|
||
# Step 2: Prepare the corpus | ||
# one line per sentence, words are split by whitespace | ||
corpus_file = "xxx" | ||
|
||
# Step 3: Build the vocabulary, here taking BPE as an example | ||
from tokenizers import Tokenizer | ||
from tokenizers.models import BPE | ||
from tokenizers.trainers import BpeTrainer | ||
from tokenizers.pre_tokenizers import Whitespace | ||
|
||
# We should keep the same hyper-parameters with ours | ||
tokenizer = Tokenizer(BPE(unk_token="[UNK]")) | ||
tokenizer.pre_tokenizer = Whitespace() | ||
|
||
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) | ||
tokenizer.train([corpus_file], trainer) | ||
|
||
tokenizer.save("tokenizer.json") | ||
|
||
# Step 4: Tokenize | ||
tokenizer = Tokenizer.from_file("tokenizer.json") | ||
output = tokenizer.encode("Hello, y'all! How are you 😁 ?") | ||
``` | ||
|
||
Lastly, to verify the correctness of our implementation, we should compare huggingface's output with ours. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
from typing import List, Dict | ||
|
||
|
||
class Tokenizer(): | ||
""" | ||
Constructing a tokenizer, including training and encoding. | ||
""" | ||
def __init__(self, | ||
vocab: Dict[str, int] = None, | ||
unk_token: str = "[UNK]", | ||
prefix: str = "##", | ||
lowercase: bool = False, | ||
**kwargs) -> None: | ||
""" | ||
Args: | ||
vocab (`Dict[str, int]`, optional, defaults to `None`): | ||
A dictionnary of string keys and their ids `{"am": 0,...}`. | ||
unk_token (`str`, optional, defaults to `[UNK]`): | ||
The unknown token to be used by the model. | ||
prefix (`str`, optional, defaults to `##`): | ||
A prefix to be used for every subword that is not a beginning-of-word. | ||
lowercase (`bool`, optional, defaults to `False`): | ||
Whether to lowercase. | ||
""" | ||
|
||
if vocab is None: | ||
self.vocab = {} | ||
else: | ||
self.vocab = vocab | ||
|
||
pass | ||
|
||
def train(self, | ||
files: List[str], | ||
vocab_size: int = 30000, | ||
min_frequency: int = 2, | ||
special_tokens: List[str] = [ | ||
"[PAD]", | ||
"[UNK]", | ||
"[CLS]", | ||
"[SEP]", | ||
"[MASK]", | ||
], | ||
limit_alphabet: int = 1000, | ||
initial_alphabet: List[str] = [], | ||
prefix: str = "##", | ||
**kwargs) -> None: | ||
"""Build vocabulary | ||
Args: | ||
files (`List[str]`): | ||
A list of path to the files that we should use for training. | ||
vocab_size (`int`, optional, default to `30000`): | ||
The size of the final vocabulary, including all tokens and alphabet. Note that 30000 for BPE and WordPiece, while 8000 for SentencePieceUnigram. | ||
min_frequency (`int`, optional, default to `2`): | ||
The minimum frequency a pair should have in order to be merged. Note that 0 for WordPiece and SentencePieceUnigram, while 2 for BPE. | ||
special_tokens (`List[str]`, optional, default to `["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]",]`): | ||
A list of special tokens the model should know of. | ||
limit_alphabet (`int`, optional, default to `1000`): | ||
The maximum different characters to keep in the alphabet. | ||
initial_alphabet (`List[str]`, optional, default to `[]`) | ||
A list of characters to include in the initial alphabet, even if not seen in the training dataset. If the strings contain more than one character, only the first one is kept. | ||
prefix (`str`, optional, `##`): | ||
A prefix to be used for every subword that is not a beginning-of-word. | ||
""" | ||
pass | ||
|
||
def encode(self, sequence: str) -> List[str]: | ||
"""Tokenize | ||
Args: | ||
sequence (`str`): | ||
The raw text sequence we want to encode. | ||
""" | ||
pass | ||
|
||
def save(self, path: str) -> None: | ||
"""Save the class `Tokenizer` to the file at the given path. | ||
Args: | ||
path (`str`): | ||
A path to a file in which to save the serialized tokenizer. | ||
""" | ||
pass | ||
|
||
def from_file(path: str) -> Tokenizer: | ||
"""Instantiate a new class `Tokenizer` from the file at the given path. | ||
Args: | ||
path (`str`): | ||
A path to a local JSON file representing a previously serialized | ||
class `Tokenizer`. | ||
""" | ||
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.