LexNET for Noun Compound Relation Classification

This is a Tensorflow implementation of the LexNET algorithm for classifying relationships, specifically applied to classifying the relationships that hold between noun compounds:

olive oil is oil that is made from olives
cooking oil which is oil that is used for cooking
motor oil is oil that is contained in a motor

The model is a supervised classifier that predicts the relationship that holds between the constituents of a two-word noun compound using:

A neural "paraphrase" of each syntactic dependency path that connects the constituents in a large corpus. For example, given a sentence like This fine oil is made from first-press olives, the dependency path is something like oil <NSUBJPASS made PREP> from POBJ> olive.
The distributional information provided by the individual words; i.e., the word embeddings of the two consituents.
The distributional signal provided by the compound itself; i.e., the embedding of the noun compound in context.

The model includes several variants: path-based model uses (1) alone, the distributional model uses (2) alone, and the integrated model uses (1) and (2). The distributional-nc model and the integrated-nc model each add (3).

Training a model requires the following:

A collection of noun compounds that have been labeled using a relation inventory. The inventory describes the specific relationships that you'd like the model to differentiate (e.g. part of versus composed of versus purpose), and generally may consist of tens of classes. You can download the dataset used in the paper from here.
A collection of word embeddings: the path-based model uses the word embeddings as part of the path representation, and the distributional models use the word embeddings directly as prediction features.
The path-based model requires a collection of syntactic dependency parses that connect the constituents for each noun compound. To generate these, you'll need a corpus from which to train this data; we used Wikipedia and the LDC GigaWord5 corpora.

learn_path_embeddings.py is a script that trains and evaluates a path-based model to predict a noun-compound relationship given labeled noun-compounds and dependency parse paths.
learn_classifier.py is a script that trains and evaluates a classifier based on any combination of paths, word embeddings, and noun-compound embeddings.
get_indicative_paths.py is a script that generates the most indicative syntactic dependency paths for a particular relationship.

Also included are utilities for preparing data for training:

text_embeddings_to_binary.py converts a text file containing word embeddings into a binary file that is quicker to load.
extract_paths.py finds all the dependency paths that connect words in a corpus.
sorted_paths_to_examples.py processes the output of extract_paths.py to produce summarized training data.

This code (in particular, the utilities used to prepare the data) differs from the code that was used to prepare data for the paper. Notably, we used a proprietary dependency parser instead of spaCy, which is used here.

Dependencies

TensorFlow: see detailed installation instructions at that site.
SciKit Learn: you can probably just install this with pip install sklearn.
SpaCy: pip install spacy ought to do the trick, along with the English model.

Creating the Model

This sections described the steps necessary to create and evaluate the model described in the paper.

Generate Path Data

To begin, you need three text files:

Corpus. This file should contain natural language sentences, written with one sentence per line. For purposes of exposition, we'll assume that you have English Wikipedia serialized this way in ${HOME}/data/wiki.txt.
Labeled Noun Compound Pairs. This file contain (modfier, head, label) tuples, tab-separated, with one per line. The label represented the relationship between the head and the modifier; e.g., if purpose is one your labels, you could possibly include tooth<tab>paste<tab>purpose.
Word Embeddings. We used the GloVe word embeddings; in particular the 6B token, 300d variant. We'll assume you have this file as ${HOME}/data/glove.6B.300d.txt.

We first processed the embeddings from their text format into something that we can load a little bit more quickly:

./text_embeddings_to_binary.py \
  --input ${HOME}/data/glove.6B.300d.txt \
  --output_vocab ${HOME}/data/vocab.txt \
  --output_npy ${HOME}/data/glove.6B.300d.npy

Next, we'll extract all the dependency parse paths connecting our labeled pairs from the corpus. This process takes a looooong time, but is trivially parallelized using map-reduce if you have access to that technology.

./extract_paths.py \
  --corpus ${HOME}/data/wiki.txt \
  --labeled_pairs ${HOME}/data/labeled-pairs.tsv \
  --output ${HOME}/data/paths.tsv

The file it produces (paths.tsv) is a tab-separated file that contains the modifier, the head, the label, the encoded path, and the sentence from which the path was drawn. (This last is mostly for sanity checking.) A sample row might look something like this (where newlines would actually be tab characters):

navy
captain
owner_emp_use
<X>/PROPN/dobj/>::enter/VERB/ROOT/^::follow/VERB/advcl/<::in/ADP/prep/<::footstep/NOUN/pobj/<::of/ADP/prep/<::father/NOUN/pobj/<::bover/PROPN/appos/<::<Y>/PROPN/compound/<
He entered the Royal Navy following in the footsteps of his father Captain John Bover and two of his elder brothers as volunteer aboard HMS Perseus

This file must be sorted as follows:

sort -k1,3 -t$'\t' paths.tsv > sorted.paths.tsv

In particular, rows with the same modifier, head, and label must appear contiguously.

We next create a file that contains all the relation labels from our original labeled pairs:

awk 'BEGIN {FS="\t"} {print $3}' < ${HOME}/data/labeled-pairs.tsv \
  | sort -u > ${HOME}/data/relations.txt

With these in hand, we're ready to produce the train, validation, and test data:

./sorted_paths_to_examples.py \
   --input ${HOME}/data/sorted.paths.tsv \
   --vocab ${HOME}/data/vocab.txt \
   --relations ${HOME}/data/relations.txt \
   --splits ${HOME}/data/splits.txt \
   --output_dir ${HOME}/data

Here, splits.txt is a file that indicates which "split" (train, test, or validation) you want the pair to appear in. It should be a tab-separate file which conatins the modifier, head, and the dataset ( train, test, or val) into which the pair should be placed; e.g.,:

tooth <TAB> paste <TAB> train
banana <TAB> seat <TAB> test

The program will produce a separate file for each dataset split in the directory specified by --output_dir. Each file is contains tf.train.Example protocol buffers encoded using the TFRecord file format.

Create Path Embeddings

Now we're ready to train the path embeddings using learn_path_embeddings.py:

./learn_path_embeddings.py \
    --train ${HOME}/data/train.tfrecs.gz \
    --val ${HOME}/data/val.tfrecs.gz \
    --text ${HOME}/data/test.tfrecs.gz \
    --embeddings ${HOME}/data/glove.6B.300d.npy
    --relations ${HOME}/data/relations.txt
    --output ${HOME}/data/path-embeddings \
    --logdir /tmp/learn_path_embeddings

The path embeddings will be placed at the location specified by --output.

Train classifiers

Train classifiers and evaluate on the validation and test data using train_classifiers.py script. This shell script fragment will iterate through each dataset, split, corpus, and model type to train and evaluate classifiers.

LOGDIR=/tmp/learn_classifier
for DATASET in tratz/fine_grained tratz/coarse_grained ; do
  for SPLIT in random lexical_head lexical_mod lexical_full ; do
    for CORPUS in wiki_gigiawords ; do
      for MODEL in dist dist-nc path integrated integrated-nc ; do
        # Filename for the log that will contain the classifier results.
        LOGFILE=$(echo "${DATASET}.${SPLIT}.${CORPUS}.${MODEL}.log" | sed -e "s,/,.,g")
        python learn_classifier.py \
          --dataset_dir ~/lexnet/datasets \
          --dataset "${DATASET}" \
          --corpus "${SPLIT}/${CORPUS}" \
          --embeddings_base_path ~/lexnet/embeddings \
          --logdir ${LOGDIR} \
          --input "${MODEL}" > "${LOGDIR}/${LOGFILE}"
      done
    done
  done
done

The log file will contain the final performance (precision, recall, F1) on the train, dev, and test sets, and will include a confusion matrix for each.

Contact

If you have any questions, issues, or suggestions, feel free to contact either @vered1986 or @waterson.

If you use this code for any published research, please include the following citation:

Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model. Vered Shwartz and Chris Waterson. NAACL 2018. link.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LexNET for Noun Compound Relation Classification

Contents

Dependencies

Creating the Model

Generate Path Data

Create Path Embeddings

Train classifiers

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

LexNET for Noun Compound Relation Classification

Contents

Dependencies

Creating the Model

Generate Path Data

Create Path Embeddings

Train classifiers

Contact