Skip to content

Commit

Permalink
Merge pull request #78 from amir-zeldes/dev
Browse files Browse the repository at this point in the history
V7.0.0
  • Loading branch information
amir-zeldes authored Jan 19, 2021
2 parents 9704864 + dd04099 commit 01f29e4
Show file tree
Hide file tree
Showing 1,998 changed files with 4,647,499 additions and 3,671,909 deletions.
21 changes: 21 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
*.pt
*.c
*.so
*.annis
*.pyd
*.DS_Store

_build/*.pt*
__pycache__/
utils/__pycache__/
utils/get_reddit/__pycache__/
_build/build/
_build/target/
_build/utils/pepper/tmp/
_build/utils/pepper/pepper_out.txt
_build/utils/pepper/pepper_warning_out.txt
_build/utils/get_reddit/*.txt
annis/
paula/
_build/src/
_build/target/
12 changes: 9 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# GUM
Repository for the Georgetown University Multilayer Corpus (GUM)

This repository contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from eight text types:
This repository contains release versions of the Georgetown University Multilayer Corpus (GUM), a corpus of English texts from twelve written and spoken text types:

* interviews
* news
Expand All @@ -11,11 +11,15 @@ This repository contains release versions of the Georgetown University Multilaye
* biographies
* fiction
* online forum discussions
* spontaneous face to face conversations
* political speeches
* textbooks
* vlogs

The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: http://corpling.uis.georgetown.edu/gum.
The corpus is created as part of the course LING-367 (Computational Corpus Linguistics) at Georgetown University. For more details see: https://corpling.uis.georgetown.edu/gum.

## A note about reddit data
For one of the eight text types in this corpus, reddit forum discussions, plain text data is not supplied. To obtain this data, please run `_build/process_reddit.py`, then `run _build/build_gum.py`. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and reddit data is subject to reddit's terms and conditions. See [README_reddit.md](README_reddit.md) for more details.
For one of the twelve text types in this corpus, reddit forum discussions, plain text data is not supplied. To obtain this data, please run `_build/process_reddit.py`, then `run _build/build_gum.py`. This, and all data, is provided with absolutely no warranty; users agree to use the data under the license with which it is provided, and reddit data is subject to reddit's terms and conditions. See [README_reddit.md](README_reddit.md) for more details.

## Citing
To cite this corpus, please refer to the following article:
Expand All @@ -35,6 +39,8 @@ Zeldes, Amir (2017) "The GUM Corpus: Creating Multilayer Resources in the Classr
}
```

For a full list of contributors please see [the corpus website](https://corpling.uis.georgetown.edu/gum).

## Directories
The corpus is downloadable in multiple formats. Not all formats contain all annotations. The most complete XML representation is in [PAULA XML](https://www.sfb632.uni-potsdam.de/en/paula.html), and the easiest way to search in the corpus is using [ANNIS](http://corpus-tools.org/annis). Other formats may be useful for other purposes. See website for more details.

Expand Down
Loading

0 comments on commit 01f29e4

Please sign in to comment.