Skip to content

Latest commit

 

History

History
16 lines (14 loc) · 744 Bytes

README.rst

File metadata and controls

16 lines (14 loc) · 744 Bytes

SoMaJo

SoMaJo is a rule-based tokenizer and sentence splitter that implements tokenization guidelines for German and English. It has a strong focus on web and social media texts (it was originally created as the winning submission to the EmpiriST 2015 shared task on automatic linguistic annotation of computer-mediated communication / social media) and is particularly well-suited to perform tokenization on all kinds of written discourse, for example chats, forums, wiki talk pages, tweets, blog comments, social networks, SMS and WhatsApp dialogues. Of course it also works on more formal texts.

More detailed documentation is available here.