Skip to content

Latest commit

 

History

History
14 lines (8 loc) · 329 Bytes

README.md

File metadata and controls

14 lines (8 loc) · 329 Bytes

Huggingface Transformers Tokenizer in C++

A tokenizer is in charge of preparing the inputs for a model.

The tokenizer can tokenize Chinese-English bilingual in Linux.

This project mainly solves some Chinese character encoding problems.

Requirements

  • Boost

C++ unicode support