Skip to content

Latest commit

 

History

History

data_processing

Convert the Dataset into Unified OmniEvent Format

To simplify subsequent data loading and modeling, we provide pre-processing scripts for commonly-used Event Extraction datasets. Users can download the dataset and convert it to the unified OmniEvent format by configuring the data path defined in the run.sh file under the folder with the same name as the dataset.

Unified OmniEvent Format

A unified OmniEvent dataset is a JSON Line file with the extension .unified.jsonl (such as, train.unified.jsonl, valid.unified.jsonl, and test.unified.jsonl), which is a convenient format for storing structured data that enables processing one record, in one line, at a time. Taking a record from TAC KBP 2016 as an example, a piece of data in the unified OmniEvent format could be demonstrated as follows:

{
    "id": "NYT_ENG_20130910.0002-6",
    "text": "In 1997 , Chun was sentenced to life in prison and Roh to 17 years .",
    "events": [{
        "type": "sentence",
        "triggers": [{
            "id": "em-2342",
            "trigger_word": "sentenced",
            "position": [19, 28], 
            "arguments": [{
                "role": "defendant",
                "mentions": [{
                    "id": "m-291",
                    "mention": "Chun",
                    "position": [10, 14]}]}, ... ]}, ... ]} ... ],
    "negative_triggers": [{
        "id": 0,
        "trigger_word": "In",
        "position": [0, 2]}, ... ], 
    "entities":  [{
        "type": "PER",
        "mentions": [{
            "id": "m-291",
            "mention": "Chun",
            "position": [10, 14]}, ... ]}, ... ]}

Supported Datasets

The pre-processing scripts support almost all commonly-used Event Extraction datasets, so as to minimize the data conversion difficulties. Additional pre-processing scripts are still being developed, and you can submit datasets for which you wish us to complete in "Pull requests". Currently, we have developed pre-processing scripts for the following datasets:

  • ACE2005: ACE2005-EN, ACE2005-DyGIE, ACE2005-OneIE, ACE2005-ZH
  • DuEE: DuEE1.0, DuEE-fin
  • ERE: LDC2015E29, LDC2015E68, LDC2015E78
  • FewFC
  • TAC KBP: TAC KBP 2014, TAC KBP 2015, TAC KBP 2016, TAC KBP 2017
  • LEVEN
  • MAVEN

Dataset Conversion

Step 1: Download the Dataset

The first step of data conversion is to download the proposed dataset from its corresponding website. For example, for the DuEE 1.0 dataset, it could be downloaded from here.

Step 2: Configure the Dataset Path

After downloading the dataset from the Internet, the run.sh file under the folder with the same name as the dataset should be configured. For example, for the DuEE 1.0 dataset, the run.sh file under the path scripts/data_preprocessing/duee should be configured, in which the data_dir path should be the same as the path of placing the downloaded dataset, you can also modify the path of the processed dataset by configuring the save_dir path:

python duee.py \
    --data_dir ../../../data/original/DuEE1.0 \
    --save_dir ../../../data/processed/DuEE1.0

Step 3: Execute the run.sh File

After downloading the dataset and configuring the corresponding run.sh file, finally, the dataset could finally be converted to the unified OmniEvent format by executing the configured run.sh file. For example, for the DuEE1.0 dataset, we could execute the run.sh file as follows:

bash run.sh