Merge pull request #78 from amir-zeldes/dev

V7.0.0
amir-zeldes · Jan 19, 2021 · 01f29e4 · 01f29e4
2 parents 9704864 + dd04099
commit 01f29e4
Show file tree

Hide file tree

Showing 1,998 changed files with 4,647,499 additions and 3,671,909 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,21 @@
+*.pt
+*.c
+*.so
+*.annis
+*.pyd
+*.DS_Store
+
+_build/*.pt*
+__pycache__/
+utils/__pycache__/
+utils/get_reddit/__pycache__/
+_build/build/
+_build/target/
+_build/utils/pepper/tmp/
+_build/utils/pepper/pepper_out.txt
+_build/utils/pepper/pepper_warning_out.txt
+_build/utils/get_reddit/*.txt
+annis/
+paula/
+_build/src/
+_build/target/
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # GUM
 Repository for the Georgetown University Multilayer Corpus (GUM)
 
-This repository contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from eight text types:
+This repository contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from twelve written and spoken text types:
 
   * interviews
   * news
@@ -11,11 +11,15 @@ This repository contains release versions of the Georgetown University Multilaye
   * biographies
   * fiction
   * online forum discussions
+  * spontaneous face to face conversations
+  * political speeches
+  * textbooks
+  * vlogs
 
-The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.
+The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: https://corpling.uis.georgetown.edu/gum.
 
 ## A note about reddit data
-For one of the eight text types in this corpus, reddit forum discussions, plain text data is not supplied. To obtain this data, please run `_build/process_reddit.py`, then `run _build/build_gum.py`. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and reddit data is subject to reddit's terms and conditions. See [README_reddit.md](README_reddit.md) for more details.
+For one of the twelve text types in this corpus, reddit forum discussions, plain text data is not supplied. To obtain this data, please run `_build/process_reddit.py`, then `run _build/build_gum.py`. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and reddit data is subject to reddit's terms and conditions. See [README_reddit.md](README_reddit.md) for more details.
 
 ## Citing
 To cite this corpus, please refer to the following article:
@@ -35,6 +39,8 @@ Zeldes, Amir (2017) "The GUM Corpus: Creating Multilayer Resources in the Classr
 }
 ```
 
+For a full list of contributors please see [the corpus website](https://corpling.uis.georgetown.edu/gum).
+
 ## Directories
 The corpus is downloadable in multiple formats. Not all formats contain all annotations. The most complete XML representation is in [PAULA XML](https://www.sfb632.uni-potsdam.de/en/paula.html), and the easiest way to search in the corpus is using [ANNIS](http://corpus-tools.org/annis). Other formats may be useful for other purposes. See website for more details.