You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I try to retrain VPoser on AMASS dataset which I downloaded from the official website, I follow the instruction of README but still got this weird error. After training about 200 epoch, the code line 56 of src/human_body_prior/models/vposer_model.py" torch.distributions.normal.Normal turn to get the Nan value. It seems like it is caused by data issues.
I will appreciate it if anyone can figure out why and how, or give me any insight.
#training_jobs to be done: 1
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1580: UserWarning: GPU available but not used. Set the gpus flag in your trainer `Trainer(gpus=1)` or script `--gpus=1`.
rank_zero_warn(
[V08_16] -- Total Trainable Parameters Count in vp_model is 0.94 M.
| Name | Type | Params
---------------------------------------
0 | vp_model | VPoser | 936 K
1 | bm_train | BodyModel | 0
---------------------------------------
936 K Trainable params
0 Non-trainable params
936 K Total params
3.745 Total estimated model params size (MB)
Validation sanity check: 0%|| 0/2 [00:00<?, ?it/s]loss_kl:0.02 loss_mesh_rec:1.02 matrot:4.36 jtr:0.54 loss_total:5.95
Validation sanity check: 50%|█████████████████████████████████████████████████████████████ | 1/2 [00:02<00:02, 2.05s/it]loss_kl:0.02 loss_mesh_rec:1.00 matrot:4.34 jtr:0.53 loss_total:5.89
Validation sanity check: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.07it/s][V08_16] -- Epoch 0: val_loss:0.51
[V08_16] -- lr is [0.001]
Training: 0it [00:00, ?it/s][V08_16] -- Created a git archive backup at /data/hualin/vposer_train_gen/V08_16/code/vposer_2023_08_17_13_44_54.tar.gz
Epoch 0: 0%|| 0/7637 [00:00<?, ?it/s]loss_kl:0.02 loss_mesh_rec:1.00 matrot:4.30 jtr:0.53 loss_total:5.86
/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/closure.py:35: LightningDeprecationWarning: One of the returned values {'log', 'progress_bar'} has a `grad_fn`. We will detach it automatically but this behaviour will change in v1.6. Please detach it manually: `return {'loss': ..., 'something': something.detach()}` rank_zero_deprecation(Epoch 0: 0%|...Epoch 0: 1%|█▌ | 107/7637 [00:25<30:11, 4.16it/s, loss=0.665, v_num=30]loss_kl:0.08 loss_mesh_rec:0.13 matrot:0.36 jtr:0.11 loss_total:0.69Epoch 0: 1%|█▌ | 108/7637 [00:25<30:09, 4.16it/s, loss=0.663, v_num=30]loss_kl:0.08 loss_mesh_rec:0.13 matrot:0.37 jtr:0.11 loss_total:0.69Epoch 0: 1%|█▌ | 109/7637 [00:26<30:08, 4.16it/s, loss=0.666, v_num=30]Traceback (most recent call last): File "V02_05.py", line 55, in<module>main() File "V02_05.py", line 51, in main train_vposer_once(job) File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 361, in train_vposer_once trainer.fit(model) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit self._call_and_handle_interrupt( File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interruptreturn trainer_fn(*args, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _runself._dispatch() File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch self.training_type_plugin.start_training(self) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training self._results = trainer.run_stage() File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stagereturnself._run_train() File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1315, in _run_trainself.fit_loop.run() File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run self.advance(*args, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance self.epoch_loop.run(data_fetcher) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run self.advance(*args, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 193, in advance batch_output = self.batch_loop.run(batch, batch_idx) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run self.advance(*args, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run self.advance(*args, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 215, in advance result = self._run_optimization( File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 266, in _run_optimization self._optimizer_step(optimizer, opt_idx, batch_idx, closure) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 378, in _optimizer_step lightning_module.optimizer_step( File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1652, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 164, in step trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in optimizer_step self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 163, in optimizer_step optimizer.step(closure=closure, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper out = func(*args, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/optimizer.py", line 23, in _use_grad ret = func(self, *args, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/adam.py", line 183, in step loss = closure() File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 148, in _wrap_closure closure_result = closure() File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 160, in __call__ self._result = self.closure(*args, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 142, in closure step_output = self._step_fn() File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 435, in _training_step training_step_output = self.trainer.accelerator.training_step(step_kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 216, in training_stepreturnself.training_type_plugin.training_step(*step_kwargs.values()) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 213, in training_stepreturn self.model.training_step(*args, **kwargs) File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 232, in training_step drec = self(batch['pose_body'].view(-1, 63)) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_implreturn forward_call(*input, **kwargs) File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 107, in forwardreturn self.vp_model(pose_body) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_implreturn forward_call(*input, **kwargs) File "/home/hualin//vposer_66/src/human_body_prior/models/vposer_model.py", line 121, in forward q_z = self.encode(pose_body) File "/home/hualin//vposer_66/src/human_body_prior/models/vposer_model.py", line 100, in encodereturn self.encoder_net(pose_body) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_implreturn forward_call(*input, **kwargs) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward input = module(input) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_implreturn forward_call(*input, **kwargs) File "/home/hualin///vposer_66/src/human_body_prior/models/vposer_model.py", line 56, in forwardreturn torch.distributions.normal.Normal(self.mu(Xout), F.softplus(self.logvar(Xout))) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/distributions/normal.py", line 56, in __init__ super(Normal, self).__init__(batch_shape, validate_args=validate_args) File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/distributions/distribution.py", line 56, in __init__ raise ValueError(ValueError: Expected parameter loc (Tensor of shape (128, 32)) of distribution Normal(loc: torch.Size([128, 32]), scale: torch.Size([128, 32])) to satisfy the constraint Real(), but found invalid values:tensor([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], grad_fn=<AddmmBackward0>)
The text was updated successfully, but these errors were encountered:
I've successfully identified and fixed a bug in the geodesic_loss_R class, which is part of the loss function used in VPoser. The issue was related to the calculation of the cosine values in the geodesic loss function for rotation matrices. The modified code snippet in src/human_body_prior/tools/angle_continuous_repres.py is shown below:
class geodesic_loss_R(nn.Module):
def __init__(self, reduction='batchmean'):
super(geodesic_loss_R, self).__init__()
self.reduction = reduction
self.eps = 1e-6
# batch geodesic loss for rotation matrices
def bgdR(self, m1, m2):
m = torch.bmm(m1, m2.transpose(1, 2)) # batch*3*3
cos = (m[:, 0, 0] + m[:, 1, 1] + m[:, 2, 2] - 1) / 2
# the fixed bug
cos = torch.clamp(cos, -1 + self.eps, 1 - self.eps)
return torch.acos(cos)
def forward(self, ypred, ytrue):
theta = self.bgdR(ypred, ytrue)
if self.reduction == 'mean':
return torch.mean(theta)
if self.reduction == 'batchmean':
return torch.mean(torch.sum(theta, dim=theta.shape[1:]))
else:
return theta
I try to retrain VPoser on AMASS dataset which I downloaded from the official website, I follow the instruction of README but still got this weird error. After training about 200 epoch, the code line 56 of src/human_body_prior/models/vposer_model.py"
torch.distributions.normal.Normal
turn to get the Nan value. It seems like it is caused by data issues.I will appreciate it if anyone can figure out why and how, or give me any insight.
The text was updated successfully, but these errors were encountered: