Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base checkpoint selection #10

Open
alsbhn opened this issue Jun 3, 2022 · 3 comments
Open

base checkpoint selection #10

alsbhn opened this issue Jun 3, 2022 · 3 comments

Comments

@alsbhn
Copy link

alsbhn commented Jun 3, 2022

I see in the code that two models (distilbert-base-uncased, msmarco-distilbert-margin-mse) are recommended to use as initial checkpoints. I tried to use other Sentence-Transformers models like all-mpnet-base-v2 but it didn't work. Is there a difference in the architecture of the models and the implementation out there? What models can be used here as initial checkpoints?

@kwang2049
Copy link
Member

Hi @alsbhn, could you please tell me what you mean by "didn't work"? Do you mean the code was not runnable with this setting or something about the performance?

@alsbhn
Copy link
Author

alsbhn commented Jun 9, 2022

The code works well and with no error. But the issue is with the performance. When I use "distilbert-base-uncased" or "msmarco-distilbert-margin-mse" as base checkpoint the performance increases after a couple of 10000 steps as expected but using other models like all-mpnet-base-v2 and all-MiniLM-L6-v2 the model does not perform well on my dataset and the performance even decreases as I train it for more steps.

@kwang2049
Copy link
Member

Thanks for pointing out this issue. I need some time to check what could be the exact reason. As I can imagine, there might be four potential reasons:
(1) The base checkpoint might be already stronger than the teacher cross-encoder;
(2) The training steps might be too few: For some target datasets, I found there could be degeneration at the beginning, but the final performance would be improved after longer training (e.g. 100K steps);
(3) The negative miner might be too weak. For this, we can try setting base_ckpt and retrievers to the same checkpoint, e.g. sentence-transformers/all-mpnet-base-v2. From my experience, I found this is very important when we use TAS-B as the base checkpoint;
(4) It might be due to the similarity function between dot product and cosine similarity. @nreimers recently found MarginMSE would result in poor in-domain performance if we use cosine similarity (compared with simple CrossEntropy loss). I am not sure whether this will be the same case for the domain-adaptation scenario. Note that both all-mpnet-base-v2 and all-MiniLM-L6-v2 were trained with cosine similarity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants