Skip to content

Releases: tsproisl/SoMaJo

v2.2.0

18 Jan 09:50
Compare
Choose a tag to compare
  • New feature: Prune XML tags and their contents from the input before tokenization (via the command line option --prune TAGNAME1 --prune TAGNAME2 … or by passing prune_tags=["TAGNAME1", "TAGNAME2", …] to tokenize_xml or tokenize_xml_file). This can be useful when processing HTML files, e.g. for removing any <script> and <style> tags from the input.

v2.1.6

13 Dec 15:21
Compare
Choose a tag to compare
  • Recognize more URLs without protocol.
  • Fix a small bug in implementation of doubly linked lists.

v2.1.5

24 Aug 14:38
Compare
Choose a tag to compare
  • Split sequences of hashtags without spaces.
  • Add legal abbreviations (issue #21).

v2.1.4

09 Jul 09:10
Compare
Choose a tag to compare
  • Add a few abbreviations.
  • Improve detection of sentence boundaries when punctuation is followed by emoticons, mentions or hashtags.

v2.1.3

05 Mar 15:04
Compare
Choose a tag to compare
  • Add a few abbreviations.
  • Improve tokenization of protocol-less URLs.
  • Improve tokenization of a few emoticons and symbols/dingbats.
  • Improve tokenization of gendered nouns (gender star, gender colon).
  • Improve tokenization of simple arithmetic operations.

v2.1.2

29 Jan 07:29
Compare
Choose a tag to compare

Allow hyphens in hashtags. While hyphens cannot be part of Twitter hashtags, we do not want to split compounds like “#Refugeeswelcome-Bewegung”.

v2.1.1

30 Jun 11:11
Compare
Choose a tag to compare
  • Detection of quotes delimited by apostrophes ('…') is more conservative, now (issue #16).

v2.1.0

17 Jun 12:37
Compare
Choose a tag to compare
  • New feature: Delimit sentences with XML tags (via the command line option --sentence-tag TAGNAME or by passing xml_sentences="TAGNAME" to the constructor). When using this option with XML input, SoMaJo tries hard to produce well-formed XML as output. To achieve this, some tags will need to be closed and re-opened at sentence boundaries. In this paragraph, for example, the italic region contains a sentence boundary:
    <p>Hi <i>there! How</i> are you?</p>
    SoMaJo will close the i tag before the end of the sentence and re-open it afterwards:
    <p> <s> Hi <i> there ! </i> </s> <s> <i> How </i> are you ? </s> </p>

v2.0.6

12 Jun 16:15
Compare
Choose a tag to compare
  • Support all textual smileys and textfaces from Signal messenger.
  • Raise a TypeError if tokenize_text is called with a string instead of an iterable of strings (issue #13).

v2.0.5

09 Apr 18:29
Compare
Choose a tag to compare
  • Add heuristics for ambiguous quotation marks (issue #11).
  • Avoid false positives for emoticons that contain a space (issue #12).
  • Correctly tokenize obfuscated email addresses that contain spaces.
  • Do not split tl;dr and its German variant zl;ng.