To simplify subsequent data loading and modeling, we provide pre-processing scripts for commonly-used Event Extraction
datasets. Users can download the dataset and convert it to the unified OmniEvent format by configuring the data path
defined in the run.sh
file under the folder with the same name as the dataset.
A unified OmniEvent dataset is a JSON Line file with the extension .unified.jsonl
(such as,
train.unified.jsonl
, valid.unified.jsonl
, and test.unified.jsonl
), which is a convenient format for storing
structured data that enables processing one record, in one line, at a time. Taking a record from TAC KBP 2016 as an
example, a piece of data in the unified OmniEvent format could be demonstrated as follows:
{
"id": "NYT_ENG_20130910.0002-6",
"text": "In 1997 , Chun was sentenced to life in prison and Roh to 17 years .",
"events": [{
"type": "sentence",
"triggers": [{
"id": "em-2342",
"trigger_word": "sentenced",
"position": [19, 28],
"arguments": [{
"role": "defendant",
"mentions": [{
"id": "m-291",
"mention": "Chun",
"position": [10, 14]}]}, ... ]}, ... ]} ... ],
"negative_triggers": [{
"id": 0,
"trigger_word": "In",
"position": [0, 2]}, ... ],
"entities": [{
"type": "PER",
"mentions": [{
"id": "m-291",
"mention": "Chun",
"position": [10, 14]}, ... ]}, ... ]}
The pre-processing scripts support almost all commonly-used Event Extraction datasets, so as to minimize the data conversion difficulties. Additional pre-processing scripts are still being developed, and you can submit datasets for which you wish us to complete in "Pull requests". Currently, we have developed pre-processing scripts for the following datasets:
- ACE2005: ACE2005-EN, ACE2005-DyGIE, ACE2005-OneIE, ACE2005-ZH
- DuEE: DuEE1.0, DuEE-fin
- ERE: LDC2015E29, LDC2015E68, LDC2015E78
- FewFC
- TAC KBP: TAC KBP 2014, TAC KBP 2015, TAC KBP 2016, TAC KBP 2017
- LEVEN
- MAVEN
The first step of data conversion is to download the proposed dataset from its corresponding website. For example, for the DuEE 1.0 dataset, it could be downloaded from here.
After downloading the dataset from the Internet, the run.sh
file under the folder with the same name as the dataset
should be configured. For example, for the DuEE 1.0 dataset, the run.sh
file under the path
scripts/data_preprocessing/duee
should
be configured, in which the data_dir
path should be the same as the path of placing the downloaded dataset, you can
also modify the path of the processed dataset by configuring the save_dir
path:
python duee.py \
--data_dir ../../../data/original/DuEE1.0 \
--save_dir ../../../data/processed/DuEE1.0
After downloading the dataset and configuring the corresponding run.sh
file, finally, the dataset could finally be
converted to the unified OmniEvent format by executing the configured run.sh
file. For example, for the DuEE1.0
dataset, we could execute the run.sh
file as follows:
bash run.sh