-
Notifications
You must be signed in to change notification settings - Fork 21
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
12 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,16 @@ | ||
SoMaJo | ||
====== | ||
|
||
SoMaJo is a state-of-the-art tokenizer for German and English web and | ||
social media texts. It won the `EmpiriST 2015 shared task | ||
SoMaJo is a rule-based tokenizer and sentence splitter that implements | ||
tokenization guidelines for German and English. It has a strong focus | ||
on web and social media texts (it was originally created as the | ||
winning submission to the `EmpiriST 2015 shared task | ||
<https://sites.google.com/site/empirist2015/>`_ on automatic | ||
linguistic annotation of computer-mediated communication / social | ||
media. As such, it is particularly well-suited to perform tokenization | ||
on all kinds of written discourse, for example chats, forums, wiki | ||
talk pages, tweets, blog comments, social networks, SMS and WhatsApp | ||
dialogues. | ||
media) and is particularly well-suited to perform tokenization on all | ||
kinds of written discourse, for example chats, forums, wiki talk | ||
pages, tweets, blog comments, social networks, SMS and WhatsApp | ||
dialogues. Of course it also works on more formal texts. | ||
|
||
More detailed documentation is available `here | ||
<https://github.com/tsproisl/SoMaJo>`_. |