To build a tokenizer, we need to perform three steps:
- pre-process: split words according to whitespace and punctuation, or using tools like spaCy and Moses.
- train: build vocabulary on the corpus.
- encode: output sub-words according to the vocabulary.
- Paper:
- Code:
- huggingface (Rust implementation)
- BPE (Python implementation)
- BPE (light version) (Python implementation)
- SentencePiece (C++ implementation)
- WordPiece (Python Implementation, without training code)
- blog:
For clarity, we assume that the corpus has been pre-processed with spaCy. Thus, the structure of Tokenizer
class as follow:
__init__
: initialize.train
: build the vocabulary.encode
: output sub-words according to the vocabulary.save
: save the class to the file.from_file
: instantiate a new class from the file.
# Step 1: Set the random seed
SEED=xxx
random.seed(SEED)
np.random.seed(SEED)
# Step 2: Prepare the corpus
# one line per sentence, words are split by whitespace
corpus_file = "xxx"
# Step 3: Build the vocabulary
tokenizer = Tokenizer(
vocab=None,
unk_token="[UNK]"
...
)
tokenizer.train(
files=[corpus_file],
vocab_size=30000,
...
)
tokenizer.save("tokenizer.json")
# Step 4: Tokenize
tokenizer = Tokenizer.from_file("tokenizer.json")
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
# Step 1: Set the random seed
# IMPORTANT! SEED must be the same as ours!
SEED=xxx
random.seed(SEED)
np.random.seed(SEED)
# Step 2: Prepare the corpus
# one line per sentence, words are split by whitespace
corpus_file = "xxx"
# Step 3: Build the vocabulary, here taking BPE as an example
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace
# We should keep the same hyper-parameters with ours
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train([corpus_file], trainer)
tokenizer.save("tokenizer.json")
# Step 4: Tokenize
tokenizer = Tokenizer.from_file("tokenizer.json")
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
Lastly, to verify the correctness of our implementation, we should compare huggingface's output with ours.