Skip to content

Releases: amir-zeldes/gum

V7.1.0 - enhanced dependencies, consistency overhaul and more

05 May 19:14
4525197
Compare
Choose a tag to compare

(Note: this version contains the content-identical superset of annotations producing UD_English-GUM in Universal Dependencies V2.8)

  • Massive round of consistency corrections and harmonization with English Web Treebank, PTB and OntoNotes
  • Added enhanced dependencies
  • More error validations
  • Added multiword tokens to CoNLL-U format (caution: token IDs like 1-2 now in use!)
  • Added reconstructed ellipsis tokens to CoNLL-U format (caution: token IDs like 8.1 now in use!)
  • Added metadata to CoNLL-U files
  • Better escape characters in Wikification
  • ANNIS conversion support for null nodes to accommodate ellipsis tokens

V7.0.0 - new genres, Wikification and more

19 Jan 23:18
2a08bcc
Compare
Choose a tag to compare
  • 20 documents added from four new genres (total tokens: 150,756):
    • Face to face conversation (material from the Santa Barbara Corpus courtesy of John Du Bois, UCSB)
    • Political speeches (public domain data)
    • Open access text books from OpenStax
    • YouTube Creative Commons-licensed vlogs
  • New Wikification layer covering all named entities, including nested and pronominal mentions (work by Yi-Ju Lin)
  • Complete overhaul of date/time normalization (work by Nitin Venkateswaran)
  • Added function labels to constituent trees
  • Added addressee information for speakers in UD data
  • Complete overhaul of entity and coreference annotations, incl. separate annotation of split antecedents (work by Yi-Ju Lin and Amir Zeldes)
  • Increased consistency with other UD corpora, incl. new and more comprehensive morphological features
  • Many corrections to all annotation layers

V6.2.0 - corrections and more consistency

13 Nov 21:41
9704864
Compare
Choose a tag to compare
  • Massive corrections to entity and coreference annotation
  • Massive corrections to TEI date/time annotations by @nitinvwaran
  • Removed entity type quantity and manually collapsed to other underlying types
  • Added infstat value split for split antecedent (edges are still bridge, but anaphor is now annotated explicitly, not just bridging + giv)
  • Coreference infstat and entity type matching now automatically validated (no new with antecedent/given with no antecedent)
  • src/tsv/ now contains only coref and bridge edges - all other edge types are derived:
    • Coref chains beginning with a non-first/second person pronoun have an initial cata edge in target/
    • Coref chain links between entities whose heads have the appos dependency in syntax receive the appos coref type automatically
    • Non-cataphoric links emanating from a pronoun receive ana
  • Many improvements to syntactic dependency consistency, mostly consolidated with EWT annotation practices

V6.1.0 - corrections and bug fixes

08 Jun 18:51
eb4f19f
Compare
Choose a tag to compare

Many corrections and fixes to the build bot

V6.0.0 - first release of GUM series 6

06 Mar 02:11
dc9e151
Compare
Choose a tag to compare

New in this version:

  • 22 documents added (total tokens: 129,660)
  • Discourse parses in Rhetorical Structure Theory now follow RST-DT guidelines
  • 5 new discourse relations (means, manner, attribution, question and same-unit)
  • Discourse dependency representation and lisp-style formats available
  • Now using native Universal Dependencies syntax trees (not automatic conversion)
  • Many manual corrections to lemmatization, POS and other consistency improvements

V5.1.0 - Final release of GUM series 5

31 Oct 14:12
449342a
Compare
Choose a tag to compare

Final release of GUM5, numerous corrections

  • global overhaul of some lemmas
  • fresh constituent parses
  • corrections to all annotation layers
  • this will be the last version of GUM with Stanford parses as a basis for dependencies (switching to UD as primary gold parses in V6)

V5.0.0 - first release of GUM series 5

21 Mar 17:31
f764070
Compare
Choose a tag to compare

New in version 5:

  • New documents in academic, bio, fiction and reddit subcorpora
  • Split bridging relations into 3 subtypes:
    • bridge:aggr - aggregate reference to multiple antecedents
    • bridge:def - definite entity introduced by bridging
    • bridge:other - all other cases of bridging
  • Add morph layer based on UD morphology
  • Add sentence type multiple for sentence coordinating multiple types; the type other now only includes sentences not falling into any other category
  • Merge Stanford and UD parses for simultaneous queries in ANNIS/PAULA
  • Separate coreference and bridging visualizations in ANNIS

V4.2.0 - Final release of GUM series 4

20 Jan 01:37
69b3b83
Compare
Choose a tag to compare

Final release of GUM series 4:

  • Added s_type="multiple" for sentences containing multiple types (previously under "other")
  • Standardized some @rend from "italics" to always "italic"
  • Standardized hyphens/dashes in number ranges to have POS tag 'TO' (e.g. in ranges of years), matching the syntactic analysis
  • Changed some inconsistent POS tags for IPA name pronunciation from FW to NP
  • Added better imperative mood labeling to CoreNLP UD morph features based on manual s_type annotations
  • Removed spurious spans in RST files and fixed some segmentations not conforming to guidelines
  • Numerous assorted error corrections

V4.1.0 - corresponds to UD V2.2

16 May 20:34
c8c391a
Compare
Choose a tag to compare

Stable release V4.1.0 / V4.1.0nr

  • Version number in top level folder suffixed with nr indicates Reddit data is not included in top level folders
  • To build the complete V4.1.0 see README_reddit.md (source annotation data is included in _build/src/)
  • This version contains the data which was used to generate Universal Dependencies release 2.2 of UniversalDependencies/UD_English-GUM

V4.0.1 - Minor build bot fixes

03 Mar 21:51
9c2665e
Compare
Choose a tag to compare
  • Build can be run from other directories with relative path
  • utf8 encoding specified for ud conversion
  • Annotations identical to V4.0.0