Releases: himkt/konoha
Releases · himkt/konoha
Release v3.0.1
- #41 Add whitespace tokenizer
In [1]: from tiny_tokenizer import WordTokenizer
In [2]: tk = WordTokenizer("whitespace")
In [3]: tk.tokenize("わたし は 猫")
Out[3]: [私, は, 猫]
Release v3.0.0
Release v2.1.0
- Support Sudachi tokenizer: #20
from tiny_tokenizer import SentenceTokenizer
from tiny_tokenizer import WordTokenizer
if __name__ == "__main__":
sentence_tokenizer = SentenceTokenizer()
tokenizer = WordTokenizer(tokenizer="Sudachi", mode="A")
# ^^^^^^^^
# You can choose splitting mode.
#
# (https://github.com/WorksApplications/SudachiPy#as-a-python-package)
#
sentence = "我輩は猫である."
print("input: ", sentence)
result = tokenizer.tokenize(sentence)
print(result)
Release v2.0.0
Release v1.3.1
We can install tiny_tokenizer
without word tokenizers.
Release v1.3.0
- Change the type of a result from WordTokenizer.tokenize from
str
tolist[str]
(#13)
Release v1.2.0
Support character/sub-word tokenization.
Release v1.1.0
- Add Dockerfile
- Add docstring
- Update the example