Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

### Describe the bug #19

Open
Karliz24 opened this issue Sep 5, 2024 · 0 comments
Open

### Describe the bug #19

Karliz24 opened this issue Sep 5, 2024 · 0 comments

Comments

@Karliz24
Copy link
Owner

Karliz24 commented Sep 5, 2024

Describe the bug

When fine-tuning XTTS_v2 on LJSpeech following the script (recipes/ljspeech/xtts_v2/train_gpt_xtts.py), the loss is always NaN! So terrible.
I try to reduce the learning rate, change the batch size, change the DDP config, at last it doesn't work!
image

To Reproduce

image

Expected behavior

The loss is expected to be normal, but now it is NaN.

Logs

> Training Environment:
 | > Backend: Accelerate
 | > Mixed precision: False
 | > Precision: float32
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 28
 | > Num. of Torch Threads: 1
 | > Torch seed: 1
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=/home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000

 > Model has 518442047 parameters

 > EPOCH: 0/1000
 --> /home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000
 > Filtering invalid eval samples!!
 > Total eval samples after filtering: 131

 > EVALUATION 

 | > Synthesizing test sentences.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

  --> EVAL PERFORMANCE
     | > avg_loader_time: 0.1370311975479126 (+0)
     | > avg_loss_text_ce: nan (+0)
     | > avg_loss_mel_ce: nan (+0)
     | > avg_loss: nan (+0)


 > EPOCH: 1/1000
 --> /home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000
 > Sampling by language: dict_keys(['en'])

 > TRAINING (2024-09-05 07:57:04) 

   --> TIME: 2024-09-05 07:57:06 -- STEP: 0/406 -- GLOBAL_STEP: 0
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.6947  (0.6947054862976074)
     | > loader_time: 0.9726  (0.9726102352142334)


   --> TIME: 2024-09-05 07:57:34 -- STEP: 50/406 -- GLOBAL_STEP: 50
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5593  (0.5495537853240967)
     | > loader_time: 0.0208  (0.025657353401184083)


   --> TIME: 2024-09-05 07:58:03 -- STEP: 100/406 -- GLOBAL_STEP: 100
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5421  (0.5507096624374387)
     | > loader_time: 0.027  (0.024191355705261236)


   --> TIME: 2024-09-05 07:58:32 -- STEP: 150/406 -- GLOBAL_STEP: 150
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5438  (0.5505051151911414)
     | > loader_time: 0.0248  (0.023627797762552902)


   --> TIME: 2024-09-05 07:59:01 -- STEP: 200/406 -- GLOBAL_STEP: 200
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5429  (0.5503179728984828)
     | > loader_time: 0.0192  (0.023366824388504036)


   --> TIME: 2024-09-05 07:59:29 -- STEP: 250/406 -- GLOBAL_STEP: 250
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5364  (0.5505284156799314)
     | > loader_time: 0.024  (0.023049613952636726)


   --> TIME: 2024-09-05 07:59:58 -- STEP: 300/406 -- GLOBAL_STEP: 300
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5382  (0.5508304643630979)
     | > loader_time: 0.0228  (0.022726009686787927)


   --> TIME: 2024-09-05 08:00:27 -- STEP: 350/406 -- GLOBAL_STEP: 350
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5357  (0.5509397418158392)
     | > loader_time: 0.0224  (0.022749708720615953)


   --> TIME: 2024-09-05 08:00:56 -- STEP: 400/406 -- GLOBAL_STEP: 400
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5422  (0.5508994698524474)
     | > loader_time: 0.0199  (0.022739327549934404)


 > EVALUATION 

 | > Synthesizing test sentences.

  --> EVAL PERFORMANCE
     | > avg_loader_time: 0.15962785482406616 (+0.022596657276153564)
     | > avg_loss_text_ce: nan (+nan)
     | > avg_loss_mel_ce: nan (+nan)
     | > avg_loss: nan (+nan)


 > EPOCH: 2/1000
 --> /home/xj_data/liuwenrui/model/TTS/examples/checkpoints/xtts_ljspeech/GPT_XTTS_LJSpeech_FT-September-05-2024_07+56AM-0000000

 > TRAINING (2024-09-05 08:01:04) 

   --> TIME: 2024-09-05 08:01:31 -- STEP: 44/406 -- GLOBAL_STEP: 450
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5423  (0.5685172839598219)
     | > loader_time: 0.0218  (0.02611354806206443)


   --> TIME: 2024-09-05 08:02:00 -- STEP: 94/406 -- GLOBAL_STEP: 500
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5477  (0.5601014101758914)
     | > loader_time: 0.0206  (0.024027073636968085)


   --> TIME: 2024-09-05 08:02:29 -- STEP: 144/406 -- GLOBAL_STEP: 550
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5499  (0.5568753977616625)
     | > loader_time: 0.0221  (0.023584759897655915)


   --> TIME: 2024-09-05 08:02:58 -- STEP: 194/406 -- GLOBAL_STEP: 600
     | > loss_text_ce: nan  (nan)
     | > loss_mel_ce: nan  (nan)
     | > loss: nan  (nan)
     | > current_lr: 5e-06 
     | > step_time: 0.5477  (0.5556149875994809)
     | > loader_time: 0.0204  (0.023164422241682858)

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A100-SXM4-80GB",
            "NVIDIA A100-SXM4-80GB"
        ],
        "available": true,
        "version": "11.8"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.2+cu118",
        "TTS": "0.22.0",
        "numpy": "1.26.3"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.14",
        "version": "#1 SMP Mon Jul 22 15:34:17 CST 2024"
    }
}

Additional context

So terrible experience. I try to train it for 3 days, and the loss is always NaN

Publicación original de @r666ay en coqui-ai/TTS#3988

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant