-
Notifications
You must be signed in to change notification settings - Fork 205
1604Prescoring
In Anssel YodaQA context, we notice that (i) we have hundreds of pairs per s0 on average, and (ii) BM25 is extremely strong baseline.
Furthermore, in some intended applications like question-to-question, we'd like to consider up to thousands of pairs and NN computation is not practical in these scenarios.
Therefore, we try to pre-rank using BM25, then score just top N using the neural model. The goal is to beat the BM25 baseline on YodaQA :).
Verdict: This was not a successful strategy for improving accuracy, though it wasn't very harmful either (which is good for the q-to-q prospects). We'll try to use prescoring as additional feature instead (within the KWWeights experiments).
curatedv2:
Model | trainAllMRR | devMRR | testMAP | testMRR | settings |
---|---|---|---|---|---|
termfreq BM25 #w | 0.483538 | 0.452647 | 0.294300 | 0.484530 | (defaults) |
rnn | 0.459869 | 0.429780 | 0.228869 | 0.341706 | (defaults) |
±0.035981 | ±0.015609 | ±0.005554 | ±0.010643 | ||
rnn80 transfer | 0.619427 | 0.511896 | 0.310194 | 0.473334 |
vocabt='ubuntu' inp_e_dropout=0 dropout=0 ptscorer=B.dot_ptscorer pdim=1 balance_class=True adapt_ubuntu=True (R_uay10649016rn80d0_bal_rmsprop ) |
±0.022799 | ±0.008472 | ±0.004359 | ±0.007336 |
large2470:
Model | trainAllMRR | devMRR | testMAP | testMRR | settings |
---|---|---|---|---|---|
termfreq BM25 #w | 0.441573 | 0.432115 | 0.313900 | 0.490822 | (defaults) |
rnn | 0.460984 | 0.382949 | 0.262463 | 0.381298 | (defaults) |
±0.023715 | ±0.006451 | ±0.002641 | ±0.007643 | ||
rnn80 transfer | 0.617490 | 0.518555 | 0.359538 | 0.540985 |
vocabt='ubuntu' inp_e_dropout=0 dropout=0 ptscorer=B.dot_ptscorer pdim=1 balance_class=True adapt_ubuntu=True opt='rmsprop' (E_ual10649016rn80d0_bal_rmsprop ) |
±0.021733 | ±0.011221 | ±0.002346 | ±0.006991 |
curatedv2 train set BM25 for s0 recall (i.e. at least one correct s1 in top N BM25-scored s1s): N=87 for >95%, N=50 for >90%, N=24 for >85%, N=14 for >80%.
Using weights-anssel-termfreq--120a2d2e6dcd0c16-bestval.h5 BM25 idf data generated on curatedv2. Just rescoring already trained models at this point(?)
N=14 (>80%)
data/anssel/yodaqa/curatedv2-training.csv Accuracy: raw 0.946056 (y=0 0.999757, y=1 0.005461), bal 0.502609
data/anssel/yodaqa/curatedv2-training.csv MRR: 0.514757
data/anssel/yodaqa/curatedv2-training.csv MAP: 0.285100
data/anssel/yodaqa/curatedv2-val.csv Accuracy: raw 0.943921 (y=0 0.999780, y=1 0.000000), bal 0.499890
data/anssel/yodaqa/curatedv2-val.csv MRR: 0.339792
data/anssel/yodaqa/curatedv2-val.csv MAP: 0.214000
N=24 (>85%)
data/anssel/yodaqa/curatedv2-training.csv Accuracy: raw 0.945908 (y=0 0.999394, y=1 0.009102), bal 0.504248
data/anssel/yodaqa/curatedv2-training.csv MRR: 0.507937
data/anssel/yodaqa/curatedv2-training.csv MAP: 0.263000
data/anssel/yodaqa/curatedv2-val.csv Accuracy: raw 0.943783 (y=0 0.999633, y=1 0.000000), bal 0.499817
data/anssel/yodaqa/curatedv2-val.csv MRR: 0.360429
data/anssel/yodaqa/curatedv2-val.csv MAP: 0.201900
Thoroughly unimpressive.
N=87 (>95%), ubuntu pre-training, spad=80 (uses curatedv2 IDF data though!)
data/anssel/yodaqa/large2470-training.csv Accuracy: raw 0.850053 (y=0 0.871165, y=1 0.542357), bal 0.706761
data/anssel/yodaqa/large2470-training.csv MRR: 0.600323
data/anssel/yodaqa/large2470-training.csv MAP: 0.311200
data/anssel/yodaqa/large2470-val.csv Accuracy: raw 0.810205 (y=0 0.831712, y=1 0.474021), bal 0.652866
data/anssel/yodaqa/large2470-val.csv MRR: 0.532411
data/anssel/yodaqa/large2470-val.csv MAP: 0.257600
Might be better than baseline.
This uses pre-trained model (as reported as 2rnn in the official evaluation), just re-evaluating it with prescoring pruning enabled. 16-way.
| rnn | 0.515577 | 0.429111 | | | prescoring='termfreq'
prescoring_weightsf='weights-anssel-termfreq--120a2d2e6dcd0c16-bestval.h5'
prescoring_prune=14
| E_ay_2rnn_psTF120a2d2_14 |±0.025576 |±0.010399 | | |
| rnn | 0.512942 | 0.452505 | | | prescoring='termfreq'
prescoring_weightsf='weights-anssel-termfreq--120a2d2e6dcd0c16-bestval.h5'
prescoring_prune=24
| |±0.029483 |±0.014008 | | |
| rnn | 0.480993 | 0.437912 | | | prescoring='termfreq'
prescoring_weightsf='weights-anssel-termfreq--120a2d2e6dcd0c16-bestval.h5'
prescoring_prune=87
| |±0.034769 |±0.014417 | | |
Roughly equivalent to the baseline.
16x baseline fay_2rnn - 0.429780 (95% [0.414170, 0.445389]):
10785075.arien.ics.muni.cz.fay_2rnn etc.
[0.419926, 0.414613, 0.400730, 0.405743, 0.358717, 0.436551, 0.442667, 0.435838, 0.417488, 0.406697, 0.436452, 0.456544, 0.489200, 0.461735, 0.439458, 0.454115, ]
16x epoch_fract=1
, N=50 R_ay_2rnn_ef1psTF120a2d2_50 - 0.385641 (95% [0.371890, 0.399392]):
10882101.arien.ics.muni.cz.R_ay_2rnn_ef1psTF120a2d2_50 etc.
[0.364491, 0.392532, 0.376038, 0.367428, 0.386887, 0.384143, 0.376940, 0.357007, 0.402712, 0.422972, 0.362202, 0.406915, 0.438853, 0.403985, 0.395257, 0.331890, ]
16x R_ay_2rnn_ef1psTF120a2d2_24 - 0.433632 (95% [0.426472, 0.440791]):
10882103.arien.ics.muni.cz.R_ay_2rnn_ef1psTF120a2d2_24 etc.
[0.421236, 0.440794, 0.441250, 0.426530, 0.446562, 0.445044, 0.418378, 0.425502, 0.452239, 0.462429, 0.424718, 0.429560, 0.419612, 0.441844, 0.413010, 0.429399, ]
Ubuntu trasfer, N=24:
data/anssel/yodaqa/curatedv2-val.csv MRR: 0.441103
data/anssel/yodaqa/curatedv2-val.csv MAP: 0.232000
FIXME generate BM25 IDF data for large2470
12x baseline R_ual10649016rn80d0_bal_rmsprop - 0.518442 (95% [0.509043, 0.527841]):
4x epoch_fract=1
, N=24, curatedv2 IDF data R_ual10649016rn80d0_bal_rmsprop_ef1psTF120a2d2_24 (in fact rnn, not rn80!) - 0.485816 (95% [0.479489, 0.492142]):
10882249.arien.ics.muni.cz.R_ual10649016rn80d0_bal_rmsprop_ef1psTF120a2d2_24 etc.
[0.485701, 0.485944, 0.480187, 0.491430, ]
epoch_fract=1
, N=24, curatedv2 IDF data:
data/anssel/yodaqa/large2470-val.csv MRR: 0.487289
data/anssel/yodaqa/large2470-val.csv MAP: 0.274500
epoch_fract=1
, N=87, curatedv2 IDF data:
data/anssel/yodaqa/large2470-val.csv MRR: 0.506118
data/anssel/yodaqa/large2470-val.csv MAP: 0.252400
epoch_fract=1/4
, N=120, curatedv2 IDF data - MRR 0.48.
epoch_fract=1/4
, N=24, curatedv2 IDF data:
data/anssel/yodaqa/large2470-val.csv MRR: 0.476674
data/anssel/yodaqa/large2470-val.csv MAP: 0.236200
IDF data: weights-anssel-termfreq-10610395e99810e9-bestval.h5
4x R_ual10649016rn80d0_bal_rmsprop_ef1psTF120a2d2_87 - 0.499942 (95% [0.482007, 0.517877]):
10882248.arien.ics.muni.cz.R_ual10649016rn80d0_bal_rmsprop_ef1psTF120a2d2_87 etc.
[0.500377, 0.517392, 0.486333, 0.495666, ]
4x R_ual10649016rn80d0_bal_rmsprop_psTF1061039_87 - 0.476053 (95% [0.468070, 0.484035]):
10884022.arien.ics.muni.cz.R_ual10649016rn80d0_bal_rmsprop_psTF1061039_87 etc.
[0.480446, 0.474916, 0.480539, 0.468310, ]
4x R_ual10649016rn80d0_bal_rmsprop_psTF1061039_50 - 0.461186 (95% [0.457568, 0.464805]):
10884024.arien.ics.muni.cz.R_ual10649016rn80d0_bal_rmsprop_psTF1061039_50 etc.
[0.460701, 0.464667, 0.461077, 0.458301, ]
No improvement with large2470-based IDF data.