[Bug] loss is NaN when fine-tuning XTTS_v2! #3988

r666ay · 2024-09-05T00:04:40Z

Describe the bug

When fine-tuning XTTS_v2 on LJSpeech following the script (recipes/ljspeech/xtts_v2/train_gpt_xtts.py), the loss is always NaN! So terrible.
I try to reduce the learning rate, change the batch size, change the DDP config, at last it doesn't work!

To Reproduce

Expected behavior

The loss is expected to be normal, but now it is NaN.

Logs

> Training Environment:
 | > Backend: Accelerate
 | > Mixed precision: False
 | > Precision: float32
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 28
 | > Num. of Torch Threads: 1
 | > Torch seed: 1
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=/home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000

 > Model has 518442047 parameters

 > EPOCH: 0/1000
 --> /home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000
 > Filtering invalid eval samples!!
 > Total eval samples after filtering: 131

 > EVALUATION 

 | > Synthesizing test sentences.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

  --> EVAL PERFORMANCE
     | > avg_loader_time: 0.1370311975479126 (+0)
     | > avg_loss_text_ce: nan (+0)
     | > avg_loss_mel_ce: nan (+0)
     | > avg_loss: nan (+0)


 > EPOCH: 1/1000
 --> /home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000
 > Sampling by language: dict_keys(['en'])

 > TRAINING (2024-09-05 07:57:04) 

   --> TIME: 2024-09-05 07:57:06 -- STEP: 0/406 -- GLOBAL_STEP: 0
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.6947  (0.6947054862976074)
     | > loader_time: 0.9726  (0.9726102352142334)


   --> TIME: 2024-09-05 07:57:34 -- STEP: 50/406 -- GLOBAL_STEP: 50
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5593  (0.5495537853240967)
     | > loader_time: 0.0208  (0.025657353401184083)


   --> TIME: 2024-09-05 07:58:03 -- STEP: 100/406 -- GLOBAL_STEP: 100
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5421  (0.5507096624374387)
     | > loader_time: 0.027  (0.024191355705261236)


   --> TIME: 2024-09-05 07:58:32 -- STEP: 150/406 -- GLOBAL_STEP: 150
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5438  (0.5505051151911414)
     | > loader_time: 0.0248  (0.023627797762552902)


   --> TIME: 2024-09-05 07:59:01 -- STEP: 200/406 -- GLOBAL_STEP: 200
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5429  (0.5503179728984828)
     | > loader_time: 0.0192  (0.023366824388504036)


   --> TIME: 2024-09-05 07:59:29 -- STEP: 250/406 -- GLOBAL_STEP: 250
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5364  (0.5505284156799314)
     | > loader_time: 0.024  (0.023049613952636726)


   --> TIME: 2024-09-05 07:59:58 -- STEP: 300/406 -- GLOBAL_STEP: 300
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5382  (0.5508304643630979)
     | > loader_time: 0.0228  (0.022726009686787927)


   --> TIME: 2024-09-05 08:00:27 -- STEP: 350/406 -- GLOBAL_STEP: 350
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5357  (0.5509397418158392)
     | > loader_time: 0.0224  (0.022749708720615953)


   --> TIME: 2024-09-05 08:00:56 -- STEP: 400/406 -- GLOBAL_STEP: 400
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5422  (0.5508994698524474)
     | > loader_time: 0.0199  (0.022739327549934404)


 > EVALUATION 

 | > Synthesizing test sentences.

  --> EVAL PERFORMANCE
     | > avg_loader_time: 0.15962785482406616 (+0.022596657276153564)
     | > avg_loss_text_ce: nan (+nan)
     | > avg_loss_mel_ce: nan (+nan)
     | > avg_loss: nan (+nan)


 > EPOCH: 2/1000
 --> /home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000

 > TRAINING (2024-09-05 08:01:04) 

   --> TIME: 2024-09-05 08:01:31 -- STEP: 44/406 -- GLOBAL_STEP: 450
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5423  (0.5685172839598219)
     | > loader_time: 0.0218  (0.02611354806206443)


   --> TIME: 2024-09-05 08:02:00 -- STEP: 94/406 -- GLOBAL_STEP: 500
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5477  (0.5601014101758914)
     | > loader_time: 0.0206  (0.024027073636968085)


   --> TIME: 2024-09-05 08:02:29 -- STEP: 144/406 -- GLOBAL_STEP: 550
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5499  (0.5568753977616625)
     | > loader_time: 0.0221  (0.023584759897655915)


   --> TIME: 2024-09-05 08:02:58 -- STEP: 194/406 -- GLOBAL_STEP: 600
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5477  (0.5556149875994809)
     | > loader_time: 0.0204  (0.023164422241682858)

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A100-SXM4-80GB",
            "NVIDIA A100-SXM4-80GB"
        ],
        "available": true,
        "version": "11.8"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.2+cu118",
        "TTS": "0.22.0",
        "numpy": "1.26.3"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.14",
        "version": "#1 SMP Mon Jul 22 15:34:17 CST 2024"
    }
}

Additional context

So terrible experience. I try to train it for 3 days, and the loss is always NaN

stale · 2024-11-10T12:59:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

r666ay added the bug Something isn't working label Sep 5, 2024

Karliz24 mentioned this issue Sep 5, 2024

### Describe the bug Karliz24/Marley-#19

Open

stale bot added the wontfix This will not be worked on but feel free to help. label Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] loss is NaN when fine-tuning XTTS_v2! #3988

[Bug] loss is NaN when fine-tuning XTTS_v2! #3988

r666ay commented Sep 5, 2024

stale bot commented Nov 10, 2024

[Bug] loss is NaN when fine-tuning XTTS_v2! #3988

[Bug] loss is NaN when fine-tuning XTTS_v2! #3988

Comments

r666ay commented Sep 5, 2024

Describe the bug

To Reproduce

Expected behavior

Logs

Environment

Additional context

stale bot commented Nov 10, 2024