Skip to content

V7.0.0 - new genres, Wikification and more

Compare
Choose a tag to compare
@amir-zeldes amir-zeldes released this 19 Jan 23:18
· 637 commits to master since this release
2a08bcc
  • 20 documents added from four new genres (total tokens: 150,756):
    • Face to face conversation (material from the Santa Barbara Corpus courtesy of John Du Bois, UCSB)
    • Political speeches (public domain data)
    • Open access text books from OpenStax
    • YouTube Creative Commons-licensed vlogs
  • New Wikification layer covering all named entities, including nested and pronominal mentions (work by Yi-Ju Lin)
  • Complete overhaul of date/time normalization (work by Nitin Venkateswaran)
  • Added function labels to constituent trees
  • Added addressee information for speakers in UD data
  • Complete overhaul of entity and coreference annotations, incl. separate annotation of split antecedents (work by Yi-Ju Lin and Amir Zeldes)
  • Increased consistency with other UD corpora, incl. new and more comprehensive morphological features
  • Many corrections to all annotation layers