Skip to content

1604Prescoring

Petr Baudis edited this page Apr 5, 2016 · 2 revisions

Prescoring Experiments

In Anssel YodaQA context, we notice that (i) we have hundreds of pairs per s0 on average, and (ii) BM25 is extremely strong baseline.

Furthermore, in some intended applications like question-to-question, we'd like to consider up to thousands of pairs and NN computation is not practical in these scenarios.

Therefore, we try to pre-rank using BM25, then score just top N using the neural model. The goal is to beat the BM25 baseline on YodaQA :).

Verdict: This was not a successful strategy for improving accuracy, though it wasn't very harmful either (which is good for the q-to-q prospects). We'll try to use prescoring as additional feature instead (within the KWWeights experiments).

Baselines

curatedv2:

Model trainAllMRR devMRR testMAP testMRR settings
termfreq BM25 #w 0.483538 0.452647 0.294300 0.484530 (defaults)
rnn 0.459869 0.429780 0.228869 0.341706 (defaults)
±0.035981 ±0.015609 ±0.005554 ±0.010643
rnn80 transfer 0.619427 0.511896 0.310194 0.473334 vocabt='ubuntu' inp_e_dropout=0 dropout=0 ptscorer=B.dot_ptscorer pdim=1 balance_class=True adapt_ubuntu=True (R_uay10649016rn80d0_bal_rmsprop)
±0.022799 ±0.008472 ±0.004359 ±0.007336

large2470:

Model trainAllMRR devMRR testMAP testMRR settings
termfreq BM25 #w 0.441573 0.432115 0.313900 0.490822 (defaults)
rnn 0.460984 0.382949 0.262463 0.381298 (defaults)
±0.023715 ±0.006451 ±0.002641 ±0.007643
rnn80 transfer 0.617490 0.518555 0.359538 0.540985 vocabt='ubuntu' inp_e_dropout=0 dropout=0 ptscorer=B.dot_ptscorer pdim=1 balance_class=True adapt_ubuntu=True opt='rmsprop' (E_ual10649016rn80d0_bal_rmsprop)
±0.021733 ±0.011221 ±0.002346 ±0.006991

Initial Experiments

curatedv2 train set BM25 for s0 recall (i.e. at least one correct s1 in top N BM25-scored s1s): N=87 for >95%, N=50 for >90%, N=24 for >85%, N=14 for >80%.

Using weights-anssel-termfreq--120a2d2e6dcd0c16-bestval.h5 BM25 idf data generated on curatedv2. Just rescoring already trained models at this point(?)

N=14 (>80%)
data/anssel/yodaqa/curatedv2-training.csv Accuracy: raw 0.946056 (y=0 0.999757, y=1 0.005461), bal 0.502609
data/anssel/yodaqa/curatedv2-training.csv MRR: 0.514757
data/anssel/yodaqa/curatedv2-training.csv MAP: 0.285100
data/anssel/yodaqa/curatedv2-val.csv Accuracy: raw 0.943921 (y=0 0.999780, y=1 0.000000), bal 0.499890
data/anssel/yodaqa/curatedv2-val.csv MRR: 0.339792  
data/anssel/yodaqa/curatedv2-val.csv MAP: 0.214000

N=24 (>85%)
data/anssel/yodaqa/curatedv2-training.csv Accuracy: raw 0.945908 (y=0 0.999394, y=1 0.009102), bal 0.504248
data/anssel/yodaqa/curatedv2-training.csv MRR: 0.507937
data/anssel/yodaqa/curatedv2-training.csv MAP: 0.263000
data/anssel/yodaqa/curatedv2-val.csv Accuracy: raw 0.943783 (y=0 0.999633, y=1 0.000000), bal 0.499817
data/anssel/yodaqa/curatedv2-val.csv MRR: 0.360429
data/anssel/yodaqa/curatedv2-val.csv MAP: 0.201900

Thoroughly unimpressive.

N=87 (>95%), ubuntu pre-training, spad=80 (uses curatedv2 IDF data though!)
data/anssel/yodaqa/large2470-training.csv Accuracy: raw 0.850053 (y=0 0.871165, y=1 0.542357), bal 0.706761
data/anssel/yodaqa/large2470-training.csv MRR: 0.600323
data/anssel/yodaqa/large2470-training.csv MAP: 0.311200
data/anssel/yodaqa/large2470-val.csv Accuracy: raw 0.810205 (y=0 0.831712, y=1 0.474021), bal 0.652866
data/anssel/yodaqa/large2470-val.csv MRR: 0.532411
data/anssel/yodaqa/large2470-val.csv MAP: 0.257600

Might be better than baseline.

curatedv2 rescoring

This uses pre-trained model (as reported as 2rnn in the official evaluation), just re-evaluating it with prescoring pruning enabled. 16-way.

| rnn | 0.515577 | 0.429111 | | | prescoring='termfreq' prescoring_weightsf='weights-anssel-termfreq--120a2d2e6dcd0c16-bestval.h5' prescoring_prune=14 | E_ay_2rnn_psTF120a2d2_14 |±0.025576 |±0.010399 | | | | rnn | 0.512942 | 0.452505 | | | prescoring='termfreq' prescoring_weightsf='weights-anssel-termfreq--120a2d2e6dcd0c16-bestval.h5' prescoring_prune=24 | |±0.029483 |±0.014008 | | | | rnn | 0.480993 | 0.437912 | | | prescoring='termfreq' prescoring_weightsf='weights-anssel-termfreq--120a2d2e6dcd0c16-bestval.h5' prescoring_prune=87 | |±0.034769 |±0.014417 | | |

Roughly equivalent to the baseline.

curatedv2 retraining

16x baseline fay_2rnn - 0.429780 (95% [0.414170, 0.445389]):

10785075.arien.ics.muni.cz.fay_2rnn etc.
[0.419926, 0.414613, 0.400730, 0.405743, 0.358717, 0.436551, 0.442667, 0.435838, 0.417488, 0.406697, 0.436452, 0.456544, 0.489200, 0.461735, 0.439458, 0.454115, ]

16x epoch_fract=1, N=50 R_ay_2rnn_ef1psTF120a2d2_50 - 0.385641 (95% [0.371890, 0.399392]):

10882101.arien.ics.muni.cz.R_ay_2rnn_ef1psTF120a2d2_50 etc.
[0.364491, 0.392532, 0.376038, 0.367428, 0.386887, 0.384143, 0.376940, 0.357007, 0.402712, 0.422972, 0.362202, 0.406915, 0.438853, 0.403985, 0.395257, 0.331890, ]

16x R_ay_2rnn_ef1psTF120a2d2_24 - 0.433632 (95% [0.426472, 0.440791]):

10882103.arien.ics.muni.cz.R_ay_2rnn_ef1psTF120a2d2_24 etc.
[0.421236, 0.440794, 0.441250, 0.426530, 0.446562, 0.445044, 0.418378, 0.425502, 0.452239, 0.462429, 0.424718, 0.429560, 0.419612, 0.441844, 0.413010, 0.429399, ]

Ubuntu trasfer, N=24:

data/anssel/yodaqa/curatedv2-val.csv MRR: 0.441103
data/anssel/yodaqa/curatedv2-val.csv MAP: 0.232000

large2470 transfer

FIXME generate BM25 IDF data for large2470

12x baseline R_ual10649016rn80d0_bal_rmsprop - 0.518442 (95% [0.509043, 0.527841]):

4x epoch_fract=1, N=24, curatedv2 IDF data R_ual10649016rn80d0_bal_rmsprop_ef1psTF120a2d2_24 (in fact rnn, not rn80!) - 0.485816 (95% [0.479489, 0.492142]):

10882249.arien.ics.muni.cz.R_ual10649016rn80d0_bal_rmsprop_ef1psTF120a2d2_24 etc.
[0.485701, 0.485944, 0.480187, 0.491430, ]

epoch_fract=1, N=24, curatedv2 IDF data:

data/anssel/yodaqa/large2470-val.csv MRR: 0.487289
data/anssel/yodaqa/large2470-val.csv MAP: 0.274500

epoch_fract=1, N=87, curatedv2 IDF data:

data/anssel/yodaqa/large2470-val.csv MRR: 0.506118
data/anssel/yodaqa/large2470-val.csv MAP: 0.252400

epoch_fract=1/4, N=120, curatedv2 IDF data - MRR 0.48.

epoch_fract=1/4, N=24, curatedv2 IDF data:

data/anssel/yodaqa/large2470-val.csv MRR: 0.476674
data/anssel/yodaqa/large2470-val.csv MAP: 0.236200

IDF data: weights-anssel-termfreq-10610395e99810e9-bestval.h5

4x R_ual10649016rn80d0_bal_rmsprop_ef1psTF120a2d2_87 - 0.499942 (95% [0.482007, 0.517877]):

10882248.arien.ics.muni.cz.R_ual10649016rn80d0_bal_rmsprop_ef1psTF120a2d2_87 etc.
[0.500377, 0.517392, 0.486333, 0.495666, ]

4x R_ual10649016rn80d0_bal_rmsprop_psTF1061039_87 - 0.476053 (95% [0.468070, 0.484035]):

10884022.arien.ics.muni.cz.R_ual10649016rn80d0_bal_rmsprop_psTF1061039_87 etc.
[0.480446, 0.474916, 0.480539, 0.468310, ]

4x R_ual10649016rn80d0_bal_rmsprop_psTF1061039_50 - 0.461186 (95% [0.457568, 0.464805]):

10884024.arien.ics.muni.cz.R_ual10649016rn80d0_bal_rmsprop_psTF1061039_50 etc.
[0.460701, 0.464667, 0.461077, 0.458301, ]

No improvement with large2470-based IDF data.

Clone this wiki locally