Skip to content

Commit

Permalink
Documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
tsproisl committed Nov 28, 2023
1 parent d9a572e commit c4bf0e7
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 10 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -420,7 +420,7 @@ SoMaJo was the system with the highest average F₁ score in the
EmpiriST 2015 shared task. The performance of the current version on
the two test sets is summarized in the following table (Training and
test sets are available from the [official
website](https://sites.google.com/site/empirist2015/home/gold)):
website](https://sites.google.com/site/empirist2015/gscl-shared-task-automatic-linguistic-annotation-of-computer-mediated-communication-social-media/gold-standard)):

| Corpus | Precision | Recall | F₁ |
|--------|-----------|--------|-------|
Expand All @@ -430,9 +430,9 @@ website](https://sites.google.com/site/empirist2015/home/gold)):

## Tokenizing English text

Starting with version 1.8.0, SoMaJo can also tokenize English text. In
general, we follow the “new” Penn Treebank conventions described, for
example, in the guidelines for ETTB 2.0 [(Mott et al.,
SoMaJo can also tokenize English text. In general, we follow the “new”
Penn Treebank conventions described, for example, in the guidelines
for ETTB 2.0 [(Mott et al.,
2009)](https://web.archive.org/web/20110727133755/http://projects.ldc.upenn.edu/gale/task_specifications/ettb_guidelines.pdf)
and CLEAR [(Warner et al.,
2012)](https://clear.colorado.edu/compsem/documents/treebank_guidelines.pdf).
Expand Down
14 changes: 8 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
SoMaJo
======

SoMaJo is a state-of-the-art tokenizer for German and English web and
social media texts. It won the `EmpiriST 2015 shared task
SoMaJo is a rule-based tokenizer and sentence splitter that implements
tokenization guidelines for German and English. It has a strong focus
on web and social media texts (it was originally created as the
winning submission to the `EmpiriST 2015 shared task
<https://sites.google.com/site/empirist2015/>`_ on automatic
linguistic annotation of computer-mediated communication / social
media. As such, it is particularly well-suited to perform tokenization
on all kinds of written discourse, for example chats, forums, wiki
talk pages, tweets, blog comments, social networks, SMS and WhatsApp
dialogues.
media) and is particularly well-suited to perform tokenization on all
kinds of written discourse, for example chats, forums, wiki talk
pages, tweets, blog comments, social networks, SMS and WhatsApp
dialogues. Of course it also works on more formal texts.

More detailed documentation is available `here
<https://github.com/tsproisl/SoMaJo>`_.

0 comments on commit c4bf0e7

Please sign in to comment.