Skip to content

v2.0.0

Compare
Choose a tag to compare
@tsproisl tsproisl released this 19 Dec 14:13
· 280 commits to master since this release

New features and improvements

  • New API: Use new class SoMaJo instead of Tokenizer and SentenceSplitter. Currently, the old API is still supported but will issue deprecation warnings.
  • Speed-up: Due to a new internal representation of the input text during processing (as a doubly linked list of Token objects), tokenization is now two to three times faster.
  • Incremental and parallel processing of XML: If a sensible set of eos_tags is specified, the XML input will be processed incrementally (allowing for arbitrarily large XML input). In addition, if a sensible set of eos_tags is specified, processing can also be parallelized.
  • New option --strip-tags to suppress the output of XML tags.
  • Support for textual representations of emojis (:smile:, :stuck_out_tongue_winking_eye:, etc.).
  • Support for textfaces (༼ʘ̚ل͜ʘ̚༽, ╚(ಠ_ಠ)=┐, etc.).

Breaking changes

  • Removed the tokenizer script (deprecated since version 1.5.0 released in October 2017). Use somajo-tokenizer instead.
  • Language codes contain the tokenization guideline: "de_CMC" instead of "de" and "en_PTB" instead of "en".