Name		Name	Last commit message	Last commit date
parent directory ..
ace2005-dygie		ace2005-dygie
ace2005-en		ace2005-en
ace2005-oneie		ace2005-oneie
ace2005-zh		ace2005-zh
aggregation		aggregation
duee		duee
ere		ere
fewfc		fewfc
kbp		kbp
leven		leven
maven		maven
rams		rams
richere		richere
wikievents		wikievents
README.md		README.md
data_link.txt		data_link.txt
generate_mrc_prompt.py		generate_mrc_prompt.py
utils.py		utils.py

README.md

Convert the Dataset into Unified OmniEvent Format

To simplify subsequent data loading and modeling, we provide pre-processing scripts for commonly-used Event Extraction datasets. Users can download the dataset and convert it to the unified OmniEvent format by configuring the data path defined in the run.sh file under the folder with the same name as the dataset.

Unified OmniEvent Format

A unified OmniEvent dataset is a JSON Line file with the extension .unified.jsonl (such as, train.unified.jsonl, valid.unified.jsonl, and test.unified.jsonl), which is a convenient format for storing structured data that enables processing one record, in one line, at a time. Taking a record from TAC KBP 2016 as an example, a piece of data in the unified OmniEvent format could be demonstrated as follows:

{
    "id": "NYT_ENG_20130910.0002-6",
    "text": "In 1997 , Chun was sentenced to life in prison and Roh to 17 years .",
    "events": [{
        "type": "sentence",
        "triggers": [{
            "id": "em-2342",
            "trigger_word": "sentenced",
            "position": [19, 28], 
            "arguments": [{
                "role": "defendant",
                "mentions": [{
                    "id": "m-291",
                    "mention": "Chun",
                    "position": [10, 14]}]}, ... ]}, ... ]} ... ],
    "negative_triggers": [{
        "id": 0,
        "trigger_word": "In",
        "position": [0, 2]}, ... ], 
    "entities":  [{
        "type": "PER",
        "mentions": [{
            "id": "m-291",
            "mention": "Chun",
            "position": [10, 14]}, ... ]}, ... ]}

Supported Datasets

The pre-processing scripts support almost all commonly-used Event Extraction datasets, so as to minimize the data conversion difficulties. Additional pre-processing scripts are still being developed, and you can submit datasets for which you wish us to complete in "Pull requests". Currently, we have developed pre-processing scripts for the following datasets:

ACE2005: ACE2005-EN, ACE2005-DyGIE, ACE2005-OneIE, ACE2005-ZH
DuEE: DuEE1.0, DuEE-fin
ERE: LDC2015E29, LDC2015E68, LDC2015E78
FewFC
TAC KBP: TAC KBP 2014, TAC KBP 2015, TAC KBP 2016, TAC KBP 2017
LEVEN
MAVEN

Dataset Conversion

Step 1: Download the Dataset

The first step of data conversion is to download the proposed dataset from its corresponding website. For example, for the DuEE 1.0 dataset, it could be downloaded from here.

Step 2: Configure the Dataset Path

After downloading the dataset from the Internet, the run.sh file under the folder with the same name as the dataset should be configured. For example, for the DuEE 1.0 dataset, the run.sh file under the path scripts/data_preprocessing/duee should be configured, in which the data_dir path should be the same as the path of placing the downloaded dataset, you can also modify the path of the processed dataset by configuring the save_dir path:

python duee.py \
    --data_dir ../../../data/original/DuEE1.0 \
    --save_dir ../../../data/processed/DuEE1.0

Step 3: Execute the `run.sh` File

After downloading the dataset and configuring the corresponding run.sh file, finally, the dataset could finally be converted to the unified OmniEvent format by executing the configured run.sh file. For example, for the DuEE1.0 dataset, we could execute the run.sh file as follows:

bash run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_processing

data_processing

README.md

Convert the Dataset into Unified OmniEvent Format

Unified OmniEvent Format

Supported Datasets

Dataset Conversion

Step 1: Download the Dataset

Step 2: Configure the Dataset Path

Step 3: Execute the `run.sh` File

Files

data_processing

Directory actions

More options

Directory actions

More options

Latest commit

History

data_processing

Folders and files

parent directory

README.md

Convert the Dataset into Unified OmniEvent Format

Unified OmniEvent Format

Supported Datasets

Dataset Conversion

Step 1: Download the Dataset

Step 2: Configure the Dataset Path

Step 3: Execute the run.sh File

Step 3: Execute the `run.sh` File