Skip to content
Petr Baudis edited this page Oct 8, 2015 · 9 revisions

Knowledge Base Relations

When we are querying a structured knowledge base, whether based on raw question representation or a logical form, we need to map question terminology to the actual graph relations in the KB.

This concerns two specific problems - first, mapping from natural language vocabulary to relations; second, finding template subgraphs in order to capture constraints (like co-occurrence with another entity, or yielding only "first" entity by particular ordering or their count).

Dataset

This is a big TODO.

Baseline

Currently, we use two approaches both at once.

Trivial Baseline

First, we produce answers from all the immediate relations of a concept. Some vocabulary mapping is done by assigning each answer the LAT based on the relation name. This is an "emergency" solution.

FBPath Baseline

Second, we produce answers from specific (fixed-label) relation paths. So, vocabulary is fixed; template subgraph is just a 1 or 2 entity path. Logistic regression based multi-label classifier based on (few) lexical question features. Based on Yao: Lean question answering over freebase from scratch (2015).

In Progress: Embedding Vocabulary

Compare embedding of question and property label (using transformation matrix) to determine how likely it is to be answer-producing. Work in progress by Silvestr in f/property-selection, based on his sentence selection work.

TODO properly report property selection MRR.

Results

Baseline:

moviesC-test  ab04e7d 2015-10-02 CluesToConcepts Labe... 101/165/233 43.3%/70.8% mrr 0.511 avgtime 435.469
moviesC-test uab04e7d 2015-10-02 CluesToConcepts Labe... 100/175/233 42.9%/75.1% mrr 0.510 avgtime 306.597
moviesC-test vab04e7d 2015-10-02 CluesToConcepts Labe... 101/165/233 43.3%/70.8% mrr 0.512 avgtime 380.044
moviesC-trai  ab04e7d 2015-10-02 CluesToConcepts Labe... 328/400/542 60.5%/73.8% mrr 0.657 avgtime 1133.952
moviesC-trai uab04e7d 2015-10-02 CluesToConcepts Labe... 250/406/542 46.1%/74.9% mrr 0.547 avgtime 812.428
moviesC-trai vab04e7d 2015-10-02 CluesToConcepts Labe... 292/400/542 53.9%/73.8% mrr 0.613 avgtime 1003.900

Using embeddings and transformation matrix:

moviesC-test  22f3433 2015-10-08 Mbprop.txt: Retrain ... 102/169/233 43.8%/72.5% mrr 0.510 avgtime 490.464
moviesC-test u22f3433 2015-10-08 Mbprop.txt: Retrain ... 101/175/233 43.3%/75.1% mrr 0.509 avgtime 356.568
moviesC-test v22f3433 2015-10-08 Mbprop.txt: Retrain ... 103/169/233 44.2%/72.5% mrr 0.516 avgtime 434.219
moviesC-trai  22f3433 2015-10-08 Mbprop.txt: Retrain ... 332/400/542 61.3%/73.8% mrr 0.662 avgtime 1223.704
moviesC-trai u22f3433 2015-10-08 Mbprop.txt: Retrain ... 275/406/542 50.7%/74.9% mrr 0.578 avgtime 891.278
moviesC-trai v22f3433 2015-10-08 Mbprop.txt: Retrain ... 298/400/542 55.0%/73.8% mrr 0.618 avgtime 1090.162

Pending further investigation - it seems this overfits a bit...

In Progress: Branched FBpaths (or, FBgraph)

Instead of fixed-label relation path, consider a more complex subgraph template with other entity references. Our first iteration will keep using the fixed vocabulary, just add a "T-shaped" subgraph of three entities in addition to the path. Investigated by Honza P.

When done, this will yield (with regard to subgraph problem) a baseline that is popular across systems, with three subgraph templates - direct relation, one-hop relation, and T-shaped relation with an extra fixed entity. This is enough to get huge WebQuestions coverage, apparently.

We now call this extension "Branched fbpaths". Branched fbpaths try to cover question which have additional relation between two concepts in addition to relation between question entity and answer. These paths have to have one common relation.

For example one path: tv/tv_character/appeared_in_tv_program", "/tv/regular_tv_appearance/actor" and second path: tv/tv_character/appeared_in_tv_program", "/tv/regular_tv_appearance/series

This is typical for question which looks like: Who played character X in film Y.

Dataset

The webquestion dataset (can be obtained from here) was used for training classifier for branched (T-shaped) fbpaths relations. You gen get the file from "d-freebase-rp" directory and then you need to create tsv format of these file using "scripts/json2tsv.py". Finally, you can follow the README file in the yodaqa repository in "data/ml/fbpath" to create classifier.

Results

Because this type of fbpath corresponds mostly to one type of question from questions about movies, it improves MRR on movies dataset. Unfortunately, it make MRR on curated dataset worse.

moviesC-test u58e6f15 2015-09-18 Added sparql query f... 100/177/233 42.9%/76.0% mrr 0.506 avgtime 745.413
curated-test u88085fb 2015-09-18 Added sparql query f... 135/329/430 31.4%/76.5% mrr 0.408 avgtime 3921.207

For further information see Benchmarks wiki page.

Other Ideas

Many systems use semantic parsing first to produce a logical form, then learn rules that convert this logical form to a SPARQL query. Often, this SPARQL query is fixed to be essentially just a subgraph template like we do, e.g. the QALD5 winner Xser (the FBGraph subgraphs).

Another subgraph template matching approach is Bast, Haussmann: More accurate question answering on freebase (2015). It matches the FBGraph subgraphs. Answers are produced aggressively, and for vocabulary, it measures Freebase relation(s) alignment with the question - number of overlaping words, derived words, word vector embedding cosine similarities and indicator words in question trained by distant supervision.

Distant supervision is common in other systems too (TODO) - Wikipedia sentences that contain two entities connected with such relation in Freebase would often have the indicator word on the path between the entities in dependency parse). Some tools may be reused for this (TODO).

Another way to map vocabulary to relations is using the PATTY resource (Xser). TODO link