Documentation

tsproisl · Nov 28, 2023 · c4bf0e7 · c4bf0e7
1 parent d9a572e
commit c4bf0e7
Show file tree

Hide file tree

Showing 2 changed files with 12 additions and 10 deletions.
diff --git a/README.md b/README.md
@@ -420,7 +420,7 @@ SoMaJo was the system with the highest average F₁ score in the
 EmpiriST 2015 shared task. The performance of the current version on
 the two test sets is summarized in the following table (Training and
 test sets are available from the [official
-website](https://sites.google.com/site/empirist2015/home/gold)):
+website](https://sites.google.com/site/empirist2015/gscl-shared-task-automatic-linguistic-annotation-of-computer-mediated-communication-social-media/gold-standard)):
 
 | Corpus | Precision | Recall | F₁    |
 |--------|-----------|--------|-------|
@@ -430,9 +430,9 @@ website](https://sites.google.com/site/empirist2015/home/gold)):
 
 ## Tokenizing English text
 
-Starting with version 1.8.0, SoMaJo can also tokenize English text. In
-general, we follow the “new” Penn Treebank conventions described, for
-example, in the guidelines for ETTB 2.0 [(Mott et al.,
+SoMaJo can also tokenize English text. In general, we follow the “new”
+Penn Treebank conventions described, for example, in the guidelines
+for ETTB 2.0 [(Mott et al.,
 2009)](https://web.archive.org/web/20110727133755/http://projects.ldc.upenn.edu/gale/task_specifications/ettb_guidelines.pdf)
 and CLEAR [(Warner et al.,
 2012)](https://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf).

diff --git a/README.rst b/README.rst
@@ -1,14 +1,16 @@
 SoMaJo
 ======
 
-SoMaJo is a state-of-the-art tokenizer for German and English web and
-social media texts. It won the `EmpiriST 2015 shared task
+SoMaJo is a rule-based tokenizer and sentence splitter that implements
+tokenization guidelines for German and English. It has a strong focus
+on web and social media texts (it was originally created as the
+winning submission to the `EmpiriST 2015 shared task
 <https://sites.google.com/site/empirist2015/>`_ on automatic
 linguistic annotation of computer-mediated communication / social
-media. As such, it is particularly well-suited to perform tokenization
-on all kinds of written discourse, for example chats, forums, wiki
-talk pages, tweets, blog comments, social networks, SMS and WhatsApp
-dialogues.
+media) and is particularly well-suited to perform tokenization on all
+kinds of written discourse, for example chats, forums, wiki talk
+pages, tweets, blog comments, social networks, SMS and WhatsApp
+dialogues. Of course it also works on more formal texts.
 
 More detailed documentation is available `here
 <https://github.com/tsproisl/SoMaJo>`_.