You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When fine-tuning XTTS_v2 on LJSpeech following the script (recipes/ljspeech/xtts_v2/train_gpt_xtts.py), the loss is always NaN! So terrible.
I try to reduce the learning rate, change the batch size, change the DDP config, at last it doesn't work!
To Reproduce
Expected behavior
The loss is expected to be normal, but now it is NaN.
Logs
> Training Environment:
|> Backend: Accelerate
|> Mixed precision: False
|> Precision: float32
|> Current device: 0
|> Num. of GPUs: 1
|> Num. of CPUs: 28
|> Num. of Torch Threads: 1
|> Torch seed: 1
|> Torch CUDNN: True
|> Torch CUDNN deterministic: False
|> Torch CUDNN benchmark: False
|> Torch TF32 MatMul: False
> Start Tensorboard: tensorboard --logdir=/home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000
> Model has 518442047 parameters
> EPOCH: 0/1000
--> /home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000
> Filtering invalid eval samples!!> Total eval samples after filtering: 131
> EVALUATION
|> Synthesizing test sentences.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. --> EVAL PERFORMANCE | > avg_loader_time: 0.1370311975479126 (+0) | > avg_loss_text_ce: nan (+0) | > avg_loss_mel_ce: nan (+0) | > avg_loss: nan (+0) > EPOCH: 1/1000 --> /home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000 > Sampling by language: dict_keys(['en']) > TRAINING (2024-09-05 07:57:04) --> TIME: 2024-09-05 07:57:06 -- STEP: 0/406 -- GLOBAL_STEP: 0 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.6947 (0.6947054862976074) | > loader_time: 0.9726 (0.9726102352142334) --> TIME: 2024-09-05 07:57:34 -- STEP: 50/406 -- GLOBAL_STEP: 50 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5593 (0.5495537853240967) | > loader_time: 0.0208 (0.025657353401184083) --> TIME: 2024-09-05 07:58:03 -- STEP: 100/406 -- GLOBAL_STEP: 100 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5421 (0.5507096624374387) | > loader_time: 0.027 (0.024191355705261236) --> TIME: 2024-09-05 07:58:32 -- STEP: 150/406 -- GLOBAL_STEP: 150 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5438 (0.5505051151911414) | > loader_time: 0.0248 (0.023627797762552902) --> TIME: 2024-09-05 07:59:01 -- STEP: 200/406 -- GLOBAL_STEP: 200 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5429 (0.5503179728984828) | > loader_time: 0.0192 (0.023366824388504036) --> TIME: 2024-09-05 07:59:29 -- STEP: 250/406 -- GLOBAL_STEP: 250 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5364 (0.5505284156799314) | > loader_time: 0.024 (0.023049613952636726) --> TIME: 2024-09-05 07:59:58 -- STEP: 300/406 -- GLOBAL_STEP: 300 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5382 (0.5508304643630979) | > loader_time: 0.0228 (0.022726009686787927) --> TIME: 2024-09-05 08:00:27 -- STEP: 350/406 -- GLOBAL_STEP: 350 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5357 (0.5509397418158392) | > loader_time: 0.0224 (0.022749708720615953) --> TIME: 2024-09-05 08:00:56 -- STEP: 400/406 -- GLOBAL_STEP: 400 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5422 (0.5508994698524474) | > loader_time: 0.0199 (0.022739327549934404) > EVALUATION | > Synthesizing test sentences. --> EVAL PERFORMANCE | > avg_loader_time: 0.15962785482406616 (+0.022596657276153564) | > avg_loss_text_ce: nan (+nan) | > avg_loss_mel_ce: nan (+nan) | > avg_loss: nan (+nan) > EPOCH: 2/1000 --> /home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000 > TRAINING (2024-09-05 08:01:04) --> TIME: 2024-09-05 08:01:31 -- STEP: 44/406 -- GLOBAL_STEP: 450 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5423 (0.5685172839598219) | > loader_time: 0.0218 (0.02611354806206443) --> TIME: 2024-09-05 08:02:00 -- STEP: 94/406 -- GLOBAL_STEP: 500 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5477 (0.5601014101758914) | > loader_time: 0.0206 (0.024027073636968085) --> TIME: 2024-09-05 08:02:29 -- STEP: 144/406 -- GLOBAL_STEP: 550 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5499 (0.5568753977616625) | > loader_time: 0.0221 (0.023584759897655915) --> TIME: 2024-09-05 08:02:58 -- STEP: 194/406 -- GLOBAL_STEP: 600 | > loss_text_ce: nan (nan) | > loss_mel_ce: nan (nan) | > loss: nan (nan) | > current_lr: 5e-06 | > step_time: 0.5477 (0.5556149875994809) | > loader_time: 0.0204 (0.023164422241682858)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.
Describe the bug
When fine-tuning XTTS_v2 on LJSpeech following the script (recipes/ljspeech/xtts_v2/train_gpt_xtts.py), the loss is always NaN! So terrible.
I try to reduce the learning rate, change the batch size, change the DDP config, at last it doesn't work!
To Reproduce
Expected behavior
The loss is expected to be normal, but now it is NaN.
Logs
Environment
Additional context
So terrible experience. I try to train it for 3 days, and the loss is always NaN
The text was updated successfully, but these errors were encountered: