Releases: amir-zeldes/gum
Releases · amir-zeldes/gum
V7.1.0 - enhanced dependencies, consistency overhaul and more
(Note: this version contains the content-identical superset of annotations producing UD_English-GUM in Universal Dependencies V2.8)
- Massive round of consistency corrections and harmonization with English Web Treebank, PTB and OntoNotes
- Added enhanced dependencies
- More error validations
- Added multiword tokens to CoNLL-U format (caution: token IDs like
1-2
now in use!) - Added reconstructed ellipsis tokens to CoNLL-U format (caution: token IDs like
8.1
now in use!) - Added metadata to CoNLL-U files
- Better escape characters in Wikification
- ANNIS conversion support for null nodes to accommodate ellipsis tokens
V7.0.0 - new genres, Wikification and more
- 20 documents added from four new genres (total tokens: 150,756):
- Face to face conversation (material from the Santa Barbara Corpus courtesy of John Du Bois, UCSB)
- Political speeches (public domain data)
- Open access text books from OpenStax
- YouTube Creative Commons-licensed vlogs
- New Wikification layer covering all named entities, including nested and pronominal mentions (work by Yi-Ju Lin)
- Complete overhaul of date/time normalization (work by Nitin Venkateswaran)
- Added function labels to constituent trees
- Added addressee information for speakers in UD data
- Complete overhaul of entity and coreference annotations, incl. separate annotation of split antecedents (work by Yi-Ju Lin and Amir Zeldes)
- Increased consistency with other UD corpora, incl. new and more comprehensive morphological features
- Many corrections to all annotation layers
V6.2.0 - corrections and more consistency
- Massive corrections to entity and coreference annotation
- Massive corrections to TEI date/time annotations by @nitinvwaran
- Removed entity type
quantity
and manually collapsed to other underlying types - Added infstat value
split
for split antecedent (edges are stillbridge
, but anaphor is now annotated explicitly, not just bridging +giv
) - Coreference infstat and entity type matching now automatically validated (no new with antecedent/given with no antecedent)
- src/tsv/ now contains only coref and bridge edges - all other edge types are derived:
- Coref chains beginning with a non-first/second person pronoun have an initial
cata
edge in target/ - Coref chain links between entities whose heads have the
appos
dependency in syntax receive theappos
coref type automatically - Non-cataphoric links emanating from a pronoun receive
ana
- Coref chains beginning with a non-first/second person pronoun have an initial
- Many improvements to syntactic dependency consistency, mostly consolidated with EWT annotation practices
V6.1.0 - corrections and bug fixes
Many corrections and fixes to the build bot
V6.0.0 - first release of GUM series 6
New in this version:
- 22 documents added (total tokens: 129,660)
- Discourse parses in Rhetorical Structure Theory now follow RST-DT guidelines
- 5 new discourse relations (means, manner, attribution, question and same-unit)
- Discourse dependency representation and lisp-style formats available
- Now using native Universal Dependencies syntax trees (not automatic conversion)
- Many manual corrections to lemmatization, POS and other consistency improvements
V5.1.0 - Final release of GUM series 5
Final release of GUM5, numerous corrections
- global overhaul of some lemmas
- fresh constituent parses
- corrections to all annotation layers
- this will be the last version of GUM with Stanford parses as a basis for dependencies (switching to UD as primary gold parses in V6)
V5.0.0 - first release of GUM series 5
New in version 5:
- New documents in academic, bio, fiction and reddit subcorpora
- Split bridging relations into 3 subtypes:
bridge:aggr
- aggregate reference to multiple antecedentsbridge:def
- definite entity introduced by bridgingbridge:other
- all other cases of bridging
- Add
morph
layer based on UD morphology - Add sentence type
multiple
for sentence coordinating multiple types; the typeother
now only includes sentences not falling into any other category - Merge Stanford and UD parses for simultaneous queries in ANNIS/PAULA
- Separate coreference and bridging visualizations in ANNIS
V4.2.0 - Final release of GUM series 4
Final release of GUM series 4:
- Added s_type="multiple" for sentences containing multiple types (previously under "other")
- Standardized some @rend from "italics" to always "italic"
- Standardized hyphens/dashes in number ranges to have POS tag 'TO' (e.g. in ranges of years), matching the syntactic analysis
- Changed some inconsistent POS tags for IPA name pronunciation from FW to NP
- Added better imperative mood labeling to CoreNLP UD morph features based on manual s_type annotations
- Removed spurious spans in RST files and fixed some segmentations not conforming to guidelines
- Numerous assorted error corrections
V4.1.0 - corresponds to UD V2.2
Stable release V4.1.0 / V4.1.0nr
- Version number in top level folder suffixed with nr indicates Reddit data is not included in top level folders
- To build the complete V4.1.0 see README_reddit.md (source annotation data is included in _build/src/)
- This version contains the data which was used to generate Universal Dependencies release 2.2 of UniversalDependencies/UD_English-GUM
V4.0.1 - Minor build bot fixes
- Build can be run from other directories with relative path
- utf8 encoding specified for ud conversion
- Annotations identical to V4.0.0