Skip to content

Releases: tsproisl/SoMaJo

v2.4.3

05 Aug 06:43
Compare
Choose a tag to compare
  • Move non-abbreviation tokens that should not be split from single_token_abbreviations_<LANG>.txt to single_tokens_<LANG>.txt and add cellular networks generations (issue #32).

v2.4.2

19 Feb 12:24
Compare
Choose a tag to compare
  • Fix issues #28 and #29 (markdown links with trailing symbols after URL part).

v2.4.1

09 Feb 08:52
Compare
Choose a tag to compare
  • Fix issue #27 (URLs in angle brackets).

v2.4.0

23 Dec 20:32
Compare
Choose a tag to compare
  • New feature: SoMaJo can output character offsets for tokens, allowing for stand-off tokenization. Pass character_offsets=True to the constructor or use the option --character-offsets on the command line to enable the feature. The character offsets are determined by aligning the tokenized output with the input, therefore activating the feature incurs a noticeable increase in processing time.

v2.3.1

23 Sep 09:10
Compare
Choose a tag to compare
  • Fix issue #26 (markdown links that contain a URL in the link text).

v2.3.0

14 Aug 18:56
Compare
Choose a tag to compare
  • Potentially breaking change: The somajo-tokenizer script is automatically created upon installation and bin/somajo-tokenizer is removed. For most users, this does not make a difference. If you used to run your own modified version of SoMaJo directly via bin/somajo-tokenizer, consider installing the project in editable mode (see Development section in README.md).
  • Switch from setup.py to pyconfig.toml and restructure the project (source in src, tests in tests).
  • When creating a Token object, only known token classes can be passed.
  • Fix issue #25 (dates at the end of sentences)

v2.2.4

16 Jun 08:45
Compare
Choose a tag to compare
  • Improvements to tokenization of words containing numbers (e.g. COVID-19-Pandemie, FFP2-Maske).

v2.2.3

02 Feb 10:40
Compare
Choose a tag to compare
  • Improvements to tokenization: Roman ordinals, abbreviation “Art.” preceding a number, certain units of measurement at the end of a sentence (e.g. km/h).

v2.2.2

12 Sep 17:52
Compare
Choose a tag to compare
  • Bugfix: Command-line option --sentence_tag implies option --split_sentences.

v2.2.1

08 Mar 08:57
Compare
Choose a tag to compare
  • Bugfix: Command-line option --strip-tags implies option --xml.