You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
New API: Use new class SoMaJo instead of Tokenizer and SentenceSplitter. Currently, the old API is still supported but will issue deprecation warnings.
Speed-up: Due to a new internal representation of the input text during processing (as a doubly linked list of Token objects), tokenization is now two to three times faster.
Incremental and parallel processing of XML: If a sensible set of eos_tags is specified, the XML input will be processed incrementally (allowing for arbitrarily large XML input). In addition, if a sensible set of eos_tags is specified, processing can also be parallelized.
New option --strip-tags to suppress the output of XML tags.
Support for textual representations of emojis (:smile:, :stuck_out_tongue_winking_eye:, etc.).
Support for textfaces (༼ʘ̚ل͜ʘ̚༽, ╚(ಠ_ಠ)=┐, etc.).
Breaking changes
Removed the tokenizer script (deprecated since version 1.5.0 released in October 2017). Use somajo-tokenizer instead.
Language codes contain the tokenization guideline: "de_CMC" instead of "de" and "en_PTB" instead of "en".