fm-gym/core-peripherals at main · AntNLP/fm-gym

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
tokenizer.py		tokenizer.py

README.md

Tokenizer

To build a tokenizer, we need to perform three steps:

pre-process: split words according to whitespace and punctuation, or using tools like spaCy and Moses.
train: build vocabulary on the corpus.
encode: output sub-words according to the vocabulary.

Resources & References

Paper:
Code:
- huggingface (Rust implementation)
- BPE (Python implementation)
- BPE (light version) (Python implementation)
- SentencePiece (C++ implementation)
- WordPiece (Python Implementation, without training code)
blog:

Implementation

For clarity, we assume that the corpus has been pre-processed with spaCy. Thus, the structure of Tokenizer class as follow:

__init__: initialize.
train: build the vocabulary.
encode: output sub-words according to the vocabulary.
save: save the class to the file.
from_file: instantiate a new class from the file.

Verification

Our Output

# Step 1: Set the random seed
SEED=xxx
random.seed(SEED)
np.random.seed(SEED)

# Step 2: Prepare the corpus
# one line per sentence, words are split by whitespace
corpus_file = "xxx"

# Step 3: Build the vocabulary
tokenizer = Tokenizer(
    vocab=None,
    unk_token="[UNK]"
    ...
)
tokenizer.train(
    files=[corpus_file],
    vocab_size=30000,
    ...
)
tokenizer.save("tokenizer.json")

# Step 4: Tokenize
tokenizer = Tokenizer.from_file("tokenizer.json")
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")

Huggingface's Output

# Step 1: Set the random seed
# IMPORTANT! SEED must be the same as ours!
SEED=xxx
random.seed(SEED)
np.random.seed(SEED)

# Step 2: Prepare the corpus
# one line per sentence, words are split by whitespace
corpus_file = "xxx"

# Step 3: Build the vocabulary, here taking BPE as an example
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# We should keep the same hyper-parameters with ours
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
tokenizer.pre_tokenizer = Whitespace()

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train([corpus_file], trainer)

tokenizer.save("tokenizer.json")

# Step 4: Tokenize
tokenizer = Tokenizer.from_file("tokenizer.json")
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")

Lastly, to verify the correctness of our implementation, we should compare huggingface's output with ours.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core-peripherals

core-peripherals

README.md

Tokenizer

Resources & References

Implementation

Verification

Our Output

Huggingface's Output

Files

core-peripherals

Directory actions

More options

Directory actions

More options

Latest commit

History

core-peripherals

Folders and files

parent directory

README.md

Tokenizer

Resources & References

Implementation

Verification

Our Output

Huggingface's Output