Skip to content

Releases: himkt/konoha

Release v3.0.1

27 Sep 17:13
145f4e0
Compare
Choose a tag to compare
  • #41 Add whitespace tokenizer
In [1]: from tiny_tokenizer import WordTokenizer
In [2]: tk = WordTokenizer("whitespace")
In [3]: tk.tokenize("わたし は 猫")
Out[3]: [, , ]

Release v3.0.0

27 Sep 05:45
Compare
Choose a tag to compare

Release v2.1.0

22 Jul 12:15
9b8a3b1
Compare
Choose a tag to compare
  • Support Sudachi tokenizer: #20
from tiny_tokenizer import SentenceTokenizer
from tiny_tokenizer import WordTokenizer


if __name__ == "__main__":
    sentence_tokenizer = SentenceTokenizer()
    tokenizer = WordTokenizer(tokenizer="Sudachi", mode="A")
    #                                              ^^^^^^^^
    #                                 You can choose splitting mode.
    #
    #      (https://github.com/WorksApplications/SudachiPy#as-a-python-package)
    #

    sentence = "我輩は猫である."
    print("input: ", sentence)

    result = tokenizer.tokenize(sentence)
    print(result)

Release v2.0.0

11 Jul 13:00
1e99cb9
Compare
Choose a tag to compare

This release breaks compatibility.

Introduce Token class.

Release v1.3.1

09 Jul 05:40
44743da
Compare
Choose a tag to compare

We can install tiny_tokenizer without word tokenizers.

Release v1.3.0

27 Jun 09:47
443aaa5
Compare
Choose a tag to compare
  • Change the type of a result from WordTokenizer.tokenize from str to list[str] (#13)

Release v1.2.0

28 May 09:10
Compare
Choose a tag to compare

Support character/sub-word tokenization.

Release v1.1.0

25 Dec 11:01
Compare
Choose a tag to compare
  • Add Dockerfile
  • Add docstring
  • Update the example