DPO training make model even worse #394

jingedawang · 2024-10-10T08:41:19Z

jingedawang
Oct 10, 2024

Hi guys,

I'm trying to use the DPO training method explained in this document on my own dataset. However I found though the dpo loss decreases, the original model loss increases. The effect of the training seems to make the model perform much worse on rejected text and a little worse on correct text, which still decreases the dpo loss. I checked my code and looks like the logic is all the same as the official code.

Is this possible in some circumstances? What can I do?

Thanks!

Answered by rasbt

Oct 10, 2024

Yeah, in my experience, DPO can be very tricky and finicky. Even though the DPO loss improves, it can make the model worse (it's also susceptible to collapse). I remember a bunch of papers discussing that. I think one of them was this one, which may be helpful here: https://arxiv.org/abs/2402.13228

View full answer

jingedawang · 2024-10-10T08:41:57Z

jingedawang
Oct 10, 2024
Author

@rasbt

7 replies

rasbt Oct 10, 2024
Maintainer

Yeah, in my experience, DPO can be very tricky and finicky. Even though the DPO loss improves, it can make the model worse (it's also susceptible to collapse). I remember a bunch of papers discussing that. I think one of them was this one, which may be helpful here: https://arxiv.org/abs/2402.13228

Answer selected by jingedawang

jingedawang Oct 11, 2024
Author

I found one difference between your code and my code is that I didn't use mask to mask the pre-leading instructions in the prompt. I used cross-entropy loss directly.

But my pretrain and fine-tune works well. It only failed at the last dpo training phase. My model is a very small model, referred from Karpathy's implementation in his tutorial. I want to explain the whole training stages with a single example of Chinese poem. My code is here.

jingedawang Oct 11, 2024
Author

Thank you for sharing the paper. Now I have an understanding of why my case failed with DPO. The paper mentioned that when the positive and negative texts are too similar, the training is more likely to fail. This matches my experiment setting. I create negative text by replacing a sentence in positive text while leaving most of the text unchanged. In this circumstance, the model tries to reduce the possibility of the negative response, but since the negative response is too similar to the positive response, the positive response is also affected. For the model, it's easier to reduce the probability of the negative response compared with increasing the probability of the positive response. So it decides to reduce the probability of both, but more for the negative response. So, I think for such cases, adding the SFT loss back is necessary. I will try.

rasbt Oct 11, 2024
Maintainer

the model tries to reduce the possibility of the negative response, but since the negative response is too similar to the positive response, the positive response is also affected ... So it decides to reduce the probability of both

Yes, that's a very good point. In addition, I think it's also more or less crucial to use the (preferred) data for SFT before DPO.

jingedawang Oct 12, 2024
Author

The positive data is already used for SFT, but the negative data is not. Do you think this is good?

jingedawang · 2024-10-12T08:11:19Z

jingedawang
Oct 12, 2024
Author

I run the dpo colab experiment and added a print for the chosen reward and the rejected reward. It looks like the degradation also occurs in your example:

Ep 1 (Step 000000): Train loss 0.693, Val loss 0.693, Train reward margins 0.002, Val reward margins 0.002, Train chosen reward -0.011440167482942343, Train rejected reward -0.013862896338105201
Ep 1 (Step 000005): Train loss 0.692, Val loss 0.692, Train reward margins 0.021, Val reward margins 0.021, Train chosen reward -0.006461672019213438, Train rejected reward -0.027838704735040666
Ep 1 (Step 000010): Train loss 0.687, Val loss 0.691, Train reward margins 0.128, Val reward margins 0.043, Train chosen reward 0.09236057344824075, Train rejected reward -0.035618031932972374
Ep 1 (Step 000015): Train loss 0.687, Val loss 0.690, Train reward margins 0.117, Val reward margins 0.061, Train chosen reward 0.05548438318073749, Train rejected reward -0.061240730434656145
Ep 1 (Step 000020): Train loss 0.686, Val loss 0.690, Train reward margins 0.149, Val reward margins 0.070, Train chosen reward 0.0783376608043909, Train rejected reward -0.07021375671029091
Ep 1 (Step 000025): Train loss 0.680, Val loss 0.689, Train reward margins 0.260, Val reward margins 0.089, Train chosen reward 0.0577065035700798, Train rejected reward -0.20194910950958728
Ep 1 (Step 000030): Train loss 0.687, Val loss 0.688, Train reward margins 0.135, Val reward margins 0.113, Train chosen reward 0.0641798778437078, Train rejected reward -0.07040767539292574
Ep 1 (Step 000035): Train loss 0.679, Val loss 0.686, Train reward margins 0.299, Val reward margins 0.141, Train chosen reward -0.03015204966068268, Train rejected reward -0.3291884891688824
Ep 1 (Step 000040): Train loss 0.679, Val loss 0.684, Train reward margins 0.305, Val reward margins 0.178, Train chosen reward -0.07285124585032463, Train rejected reward -0.377393014729023
Ep 1 (Step 000045): Train loss 0.662, Val loss 0.682, Train reward margins 0.670, Val reward margins 0.222, Train chosen reward -0.05590401217341423, Train rejected reward -0.7259100258350373
Ep 1 (Step 000050): Train loss 0.661, Val loss 0.679, Train reward margins 0.684, Val reward margins 0.287, Train chosen reward -0.11381270028650761, Train rejected reward -0.7980358019471169
Ep 1 (Step 000055): Train loss 0.632, Val loss 0.677, Train reward margins 1.354, Val reward margins 0.341, Train chosen reward -0.09058695659041405, Train rejected reward -1.4441990733146668
Ep 1 (Step 000060): Train loss 0.660, Val loss 0.675, Train reward margins 0.716, Val reward margins 0.385, Train chosen reward -0.28193065226078035, Train rejected reward -0.997538149356842
Ep 1 (Step 000065): Train loss 0.653, Val loss 0.674, Train reward margins 0.877, Val reward margins 0.410, Train chosen reward -0.12347999960184097, Train rejected reward -1.0003669381141662
Ep 1 (Step 000070): Train loss 0.646, Val loss 0.672, Train reward margins 1.074, Val reward margins 0.438, Train chosen reward -0.2105106607079506, Train rejected reward -1.2847445368766786
Ep 1 (Step 000075): Train loss 0.639, Val loss 0.671, Train reward margins 1.231, Val reward margins 0.475, Train chosen reward -0.20761612057685852, Train rejected reward -1.4386221170425415
Ep 1 (Step 000080): Train loss 0.575, Val loss 0.669, Train reward margins 2.781, Val reward margins 0.512, Train chosen reward -0.14114425294101238, Train rejected reward -2.9224249362945556
Ep 1 (Step 000085): Train loss 0.628, Val loss 0.668, Train reward margins 1.561, Val reward margins 0.539, Train chosen reward -0.375816185772419, Train rejected reward -1.9370729684829713
Ep 1 (Step 000090): Train loss 0.624, Val loss 0.667, Train reward margins 1.603, Val reward margins 0.555, Train chosen reward -0.24188705384731293, Train rejected reward -1.8450155377388

As you can see, though the train reward margin is going up, the train chosen reward is fluctuating and going downwards. The overall increasing reward margin is due to the faster decrease of rejected reward margin.

3 replies

rasbt Oct 12, 2024
Maintainer

Yeah, it could also be that the data is not ideal (too small), the model is too simple, and the system is overtrained after the first iterations where the chosen reward becomes negative. Training with DPO is a bit tricky and finicky. I think it's important to not train too long.

jingedawang Oct 13, 2024
Author

Yeah, I tried to modify the loss(for my example, not this one). I add weight for positive reward and negative reward and make the positive weight higher. The training looks better than before. But still not good after long training. It's not easy to observe the effect of the preference training.

I checked the paper you shared above. People add some extra loss items to dpo loss, however, I think they may be equivalent to using different weights for positive reward and negative reward.

rasbt Oct 21, 2024
Maintainer

I strongly agree with

It's not easy to observe the effect of the preference training.

I wonder if it's maybe also due to the SFT model, or maybe the dataset. DPO is kind of a subtle and more flaky method, which is why I also only added it as bonus material but not in the main book.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPO training make model even worse #394

{{title}}

Replies: 2 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

DPO training make model even worse #394

jingedawang Oct 10, 2024

Replies: 2 comments · 10 replies

jingedawang Oct 10, 2024 Author

rasbt Oct 10, 2024 Maintainer

jingedawang Oct 11, 2024 Author

jingedawang Oct 11, 2024 Author

rasbt Oct 11, 2024 Maintainer

jingedawang Oct 12, 2024 Author

jingedawang Oct 12, 2024 Author

rasbt Oct 12, 2024 Maintainer

jingedawang Oct 13, 2024 Author

rasbt Oct 21, 2024 Maintainer

jingedawang
Oct 10, 2024

Replies: 2 comments 10 replies

jingedawang
Oct 10, 2024
Author

rasbt Oct 10, 2024
Maintainer

jingedawang Oct 11, 2024
Author

jingedawang Oct 11, 2024
Author

rasbt Oct 11, 2024
Maintainer

jingedawang Oct 12, 2024
Author

jingedawang
Oct 12, 2024
Author

rasbt Oct 12, 2024
Maintainer

jingedawang Oct 13, 2024
Author

rasbt Oct 21, 2024
Maintainer