Skip to content

Latest commit

 

History

History

core-peripherals

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Tokenizer

To build a tokenizer, we need to perform three steps:

  • pre-process: split words according to whitespace and punctuation, or using tools like spaCy and Moses.
  • train: build vocabulary on the corpus.
  • encode: output sub-words according to the vocabulary.

Resources & References

Implementation

For clarity, we assume that the corpus has been pre-processed with spaCy. Thus, the structure of Tokenizer class as follow:

  • __init__: initialize.
  • train: build the vocabulary.
  • encode: output sub-words according to the vocabulary.
  • save: save the class to the file.
  • from_file: instantiate a new class from the file.

Verification

Our Output
# Step 1: Set the random seed
SEED=xxx
random.seed(SEED)
np.random.seed(SEED)

# Step 2: Prepare the corpus
# one line per sentence, words are split by whitespace
corpus_file = "xxx"

# Step 3: Build the vocabulary
tokenizer = Tokenizer(
    vocab=None,
    unk_token="[UNK]"
    ...
)
tokenizer.train(
    files=[corpus_file],
    vocab_size=30000,
    ...
)
tokenizer.save("tokenizer.json")

# Step 4: Tokenize
tokenizer = Tokenizer.from_file("tokenizer.json")
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
Huggingface's Output
# Step 1: Set the random seed
# IMPORTANT! SEED must be the same as ours!
SEED=xxx
random.seed(SEED)
np.random.seed(SEED)

# Step 2: Prepare the corpus
# one line per sentence, words are split by whitespace
corpus_file = "xxx"

# Step 3: Build the vocabulary, here taking BPE as an example
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# We should keep the same hyper-parameters with ours
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train([corpus_file], trainer)

tokenizer.save("tokenizer.json")

# Step 4: Tokenize
tokenizer = Tokenizer.from_file("tokenizer.json")
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")

Lastly, to verify the correctness of our implementation, we should compare huggingface's output with ours.