DPO training make model even worse #394
-
Hi guys, I'm trying to use the DPO training method explained in this document on my own dataset. However I found though the dpo loss decreases, the original model loss increases. The effect of the training seems to make the model perform much worse on rejected text and a little worse on correct text, which still decreases the dpo loss. I checked my code and looks like the logic is all the same as the official code. Is this possible in some circumstances? What can I do? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 10 replies
-
I run the dpo colab experiment and added a print for the chosen reward and the rejected reward. It looks like the degradation also occurs in your example:
As you can see, though the train reward margin is going up, the train chosen reward is fluctuating and going downwards. The overall increasing reward margin is due to the faster decrease of rejected reward margin. |
Beta Was this translation helpful? Give feedback.
Yeah, in my experience, DPO can be very tricky and finicky. Even though the DPO loss improves, it can make the model worse (it's also susceptible to collapse). I remember a bunch of papers discussing that. I think one of them was this one, which may be helpful here: https://arxiv.org/abs/2402.13228