Original dataset | DOI not available yet |
Document type | newspaper (mid-19C to mid 20C) |
Languages | German |
Annotation guidelines | |
Annotation tool | neat |
Original format and tagging scheme | .tsv, IOB |
Annotations | NERC, EL |
Version (used in HIPE-2022) | v1.0 with many corrections from the HIPE 2022 team |
Related publication | Named Entity Linking mit Wikidata und GND |
License |
Coarse-grained tagset | Fine-grained tagset | Nesting applies | Linking applies |
---|---|---|---|
PER | - | no | yes |
LOC | - | no | yes |
ORG | - | no | yes |
N.B. SoNAR guidelines indicate other possible entity types but they are not present in the data.
The sonar dataset can be used for:
- Tasks: NERC-Coarse, NEL.
- Challenges: Multilingual Newspaper Coarse, Global Adaptation Coarse.
Please note that the SoNAR dataset has been revised by HIPE team (see release notes below).
- Documents: sonar documents corresponds to newspaper articles (UPDATE)
- Train set: for this dataset, there is no training set. Only a dev set that is representative for the test set in terms of newspapers and periods.
- Sentence splitting: performed automatically on OCRed text (performances not perfect) (UPDATE).
- Entity linking and metonymic sense: only one linking annotation exists per linked entity.
- Known glitches:
- no embedded entities (the guidelines specify them), which sometimes can lead to inconsistent annotations for complex entities (e.g. company names with a person or location inside)
- the annotation guidelines have not been strictly followed in the original version v1.0
- the original NEL information was not manually disambiguated; the HIPE 2022 team corrected them according to our understanding of the intended SoNAR annotations.
- the original data contained discontinuous NER spans, probably due to retokenization that treated all punctuation symbols as separate tokens (including abbreviation periods)
HIPE-2022 v2.1-test release notes
- test set: EL annotations have been extensively and thoroughly revised (xx QIDs links were corrected)
- link to release v2.1-test.
HIPE-2022 v2.1 hotfix release notes
- dev set: EL annotation was in the wrong column (NE-NESTED) instead of NE-LIT. Fixed with the hotfix commit cc57462
- dev set: empty tokens removed
- link to release v2.1-test_allmasked%2Bsonar_hotfix
HIPE-2022 v2.0 release notes
- EL annotation is now part of the release.
- Thorough revision of NER and NEL annotation have been done by the HIPE team.
- Due to time limits, we could not revise all dev set pages completely to the end of the page. The material that could not be revised was removed from the dev set. Meaning, although there is a bit less material in the dev set now, it is fully revised.
- link to release v2.0
HIPE-2022 v1.0 release notes
- EL annotation are not present in current sonar file and will be added in the next release.
- link to release v1.0