Could you elaborate more on the extrinsic evaluation? #4

kaniblu · 2020-07-31T01:45:04Z

You mentioned in the paper that you randomly sampled 1% of the training set and 5 of each class for the validation set. I tried to replicate the baseline results on SST-2 by fine-tuning bert-base-uncased (as mentioned in the paper), but the results are much higher than the target numbers.

Your Paper: 59.08 (5.59) [15 trials]
My Attempt: 72.89 (6.36) [9 trials]

I could probably increase the number of trials to see if I was just unlucky, but it is unlikely that statistical variance could deviate the numbers that much. Could you provide more details about your experiments? Did you sample the datasets with different seeds for each trial?

BTW I am using the dataset provided by the authors of CBERT (training set size 6,228). Thanks in advance.

kaniblu · 2020-08-05T05:02:16Z

For SNIPS, the classification accuracy (bert-base-uncased) using randomly sampled 1% of the training set is as follows.

{
    "acc": {
      "mean": 0.9297142857142857,
      "std": 0.02000145767282727,
      "raw": [
        0.9042857142857142,
        0.9285714285714286,
        0.9542857142857143,
        0.95,
        0.9414285714285714,
        0.9385714285714286,
        0.9085714285714286,
        0.9328571428571428,
        0.9171428571428571,
        0.9028571428571428,
        0.96,
        0.9085714285714286,
        0.9071428571428571,
        0.9457142857142857,
        0.9457142857142857
      ]
    },
    "scarcity": 0.01
  }

varunkumar-dev · 2020-08-12T02:43:27Z

Sorry for the late reply. Here is how we sampled data.

We took initial dataset and randomly sampled data 15 times (both training and dev set). With 1% data, 92% accuracy with 0.02 std for SNIPS looks too good to be true. You should observe a much larger variance with 1% experiment

kaniblu · 2020-08-19T06:41:41Z

Thanks for the reply. The SNIPS results of 92% accuracy at 1% data level (around 20 examples per class) are definitely plausible. As an indirect evidence, you can check out FSI experiments in this paper, which claims that "BERT generalizes well with just 30 examples" hence they went with 10 seed examples per class.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could you elaborate more on the extrinsic evaluation? #4

Could you elaborate more on the extrinsic evaluation? #4

kaniblu commented Jul 31, 2020

kaniblu commented Aug 5, 2020

varunkumar-dev commented Aug 12, 2020 •

edited

Loading

kaniblu commented Aug 19, 2020

Could you elaborate more on the extrinsic evaluation? #4

Could you elaborate more on the extrinsic evaluation? #4

Comments

kaniblu commented Jul 31, 2020

kaniblu commented Aug 5, 2020

varunkumar-dev commented Aug 12, 2020 • edited Loading

kaniblu commented Aug 19, 2020

varunkumar-dev commented Aug 12, 2020 •

edited

Loading