base checkpoint selection #10

alsbhn · 2022-06-03T07:11:33Z

I see in the code that two models (distilbert-base-uncased, msmarco-distilbert-margin-mse) are recommended to use as initial checkpoints. I tried to use other Sentence-Transformers models like all-mpnet-base-v2 but it didn't work. Is there a difference in the architecture of the models and the implementation out there? What models can be used here as initial checkpoints?

kwang2049 · 2022-06-08T19:01:00Z

Hi @alsbhn, could you please tell me what you mean by "didn't work"? Do you mean the code was not runnable with this setting or something about the performance?

alsbhn · 2022-06-09T07:28:54Z

The code works well and with no error. But the issue is with the performance. When I use "distilbert-base-uncased" or "msmarco-distilbert-margin-mse" as base checkpoint the performance increases after a couple of 10000 steps as expected but using other models like all-mpnet-base-v2 and all-MiniLM-L6-v2 the model does not perform well on my dataset and the performance even decreases as I train it for more steps.

kwang2049 · 2022-06-16T10:58:27Z

Thanks for pointing out this issue. I need some time to check what could be the exact reason. As I can imagine, there might be four potential reasons:
(1) The base checkpoint might be already stronger than the teacher cross-encoder;
(2) The training steps might be too few: For some target datasets, I found there could be degeneration at the beginning, but the final performance would be improved after longer training (e.g. 100K steps);
(3) The negative miner might be too weak. For this, we can try setting base_ckpt and retrievers to the same checkpoint, e.g. sentence-transformers/all-mpnet-base-v2. From my experience, I found this is very important when we use TAS-B as the base checkpoint;
(4) It might be due to the similarity function between dot product and cosine similarity. @nreimers recently found MarginMSE would result in poor in-domain performance if we use cosine similarity (compared with simple CrossEntropy loss). I am not sure whether this will be the same case for the domain-adaptation scenario. Note that both all-mpnet-base-v2 and all-MiniLM-L6-v2 were trained with cosine similarity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

base checkpoint selection #10

base checkpoint selection #10

alsbhn commented Jun 3, 2022

kwang2049 commented Jun 8, 2022

alsbhn commented Jun 9, 2022

kwang2049 commented Jun 16, 2022

base checkpoint selection #10

base checkpoint selection #10

Comments

alsbhn commented Jun 3, 2022

kwang2049 commented Jun 8, 2022

alsbhn commented Jun 9, 2022

kwang2049 commented Jun 16, 2022