Newseye dataset


Original dataset
Document type	newspaper (mid-19C to mid 20C)
Languages	Finnish, French, German, Swedish
Annotation guidelines
Annotation tool	transkribus
Original format and tagging scheme	`.tsv, IOB`
Annotations	NERC, EL (towards Wikidata, unknown dump)
Version (used in HIPE-2022)	v1.0
Related publication	A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical Newspapers
License

Coarse-grained tagset	Fine-grained tagset	Nesting applies	Linking applies
PER	PERS.author	yes	yes
LOC	-	yes	yes
ORG	-	yes	yes
HumanProd	-	yes	yes

The newseye dataset can be used for:

Tasks: NERC-Coarse, NERC-Fine, NEL.
Challenges: Multilingual Newspaper Coarse, Multilingual Newspaper Fine, Global Adaptation Coarse.

Annotation guidelines: mostly compatible with hipe2020 dataset (except for the consideration of metonymic sense for entity linking, see participation guidelines p. 12).
Documents: newseye documents corresponds to newspaper pages.
Document IDs: document ID were added during HIPE-2022 conversion and are composed of: split_language_docNumber (in file).
Sentence splitting: performed automatically on OCRed text (performances not perfect).
Entity linking and metonymic sense: only one linking annotation exists per linked entity.
Known glitches:
- 309 lines/tokens across all datasets in all languages contain the ### subword marker.
- at conversion time, some missing tab (\t) and malformed Wikidata QID were manually corrected.

HIPE-2022 v2.0 release notes

In the document metadata of each file, the value of # hipe2022:document_id has (sligtly) changed: the number in train_[lg]_[nb] is now the sequence number of the document in the file (it was by mistake the token sequential number in previous release). Newseye orginal data does not have document id, it is an addition from HIPE-2022.
The train set of Newseye German data contained quite some parts with no annotations. Therefore we removed the following documents from the file HIPE-2022-v2.0-newseye-train-de.tsv:
- documents train_de_3, train_de_4 and train_de_7, which do not have any NE annotation;
- document train_de_6, which has less than 20 entities for 10,000 tokens.

Also, please note that train_de_10 has a large chunk of ca. 100,000 non-annotated tokens at the beginning.

HIPE-2022 v1.0 release notes

Some parts (mainly tables) of some German documents seem to be missing annotations. This will be fixed for the next HIPE-2022 release.

Provide feedback