Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Early stopping conditioned on metric val_loss which is not available. Pass in or modify your EarlyStopping callback to use any of the following: `` #52

Open
Drow999 opened this issue Jan 18, 2022 · 1 comment

Comments

@Drow999
Copy link

Drow999 commented Jan 18, 2022

Hello,
I'm trying to train the vposer with my own train and val dataset, but it always said val_loss is not available. I guessed it might be caused by the little validate dataset, but after I reduce the batch size, the error still exists. I found some verizon of pytorch-ligthning might have this issue. Could you please tell me the verison you use and give me some advise if you have this issue as well?

Epoch 0: 88%|████████▊ | 15/17 [00:00<00:00, 37.29it/s, loss=89.1, v_num=29]
Validating: 0it [00:00, ?it/s]
Validating: 0%| | 0/2 [00:00<?, ?it/s]{'weighted_loss': {'loss_kl': tensor(0.0516, device='cuda:0'), 'loss_mesh_rec': tensor(81.0408, device='cuda:0'), 'matrot': tensor(3.5944, device='cuda:0'), 'loss_total': tensor(84.6868, device='cuda:0')}, 'unweighted_loss': {'v2v': tensor(55.1946, device='cuda:0'), 'loss_total': tensor([55.1946], device='cuda:0')}}
{'weighted_loss': {'loss_kl': tensor(0.0580, device='cuda:0'), 'loss_mesh_rec': tensor(86.9938, device='cuda:0'), 'matrot': tensor(3.5297, device='cuda:0'), 'loss_total': tensor(90.5815, device='cuda:0')}, 'unweighted_loss': {'v2v': tensor(59.5597, device='cuda:0'), 'loss_total': tensor([59.5597], device='cuda:0')}}
[1] -- Epoch 0: val_loss:57.38
[1] -- lr is [0.001]
Traceback (most recent call last):
File "/home/drow/human_body_prior/src/train.py", line 54, in
main()
File "/home/drow/human_body_prior/src/train.py", line 50, in main
train_vposer_once(job)
File "/home/drow/human_body_prior/src/human_body_prior/train/vposer_trainer.py", line 351, in train_vposer_once
trainer.fit(model)
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
self._run(model)
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run
self._dispatch()
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch
self.accelerator.start_training(self)
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 995, in run_stage
return self._run_train()
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in _run_train
self.fit_loop.run()
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 118, in run
output = self.on_run_end()
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 235, in on_run_end
self._on_train_epoch_end_hook(processed_outputs)
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 276, in _on_train_epoch_end_hook
trainer_hook(processed_epoch_output)
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 109, in on_train_epoch_end
callback.on_train_epoch_end(self, self.lightning_module)
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 170, in on_train_epoch_end
self._run_early_stopping_check(trainer)
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 185, in _run_early_stopping_check
logs
File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 134, in _validate_condition_metric
raise RuntimeError(error_msg)
RuntimeError: Early stopping conditioned on metric val_loss which is not available. Pass in or modify your EarlyStopping callback to use any of the following: ``
Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 35.66it/s, loss=89.1, v_num=29]
Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 32.31it/s, loss=89.1, v_num=29]

@oomq
Copy link

oomq commented Aug 19, 2022

Hello, I'm trying to train the vposer with my own train and val dataset, but it always said val_loss is not available. I guessed it might be caused by the little validate dataset, but after I reduce the batch size, the error still exists. I found some verizon of pytorch-ligthning might have this issue. Could you please tell me the verison you use and give me some advise if you have this issue as well?

Epoch 0: 88%|████████▊ | 15/17 [00:00<00:00, 37.29it/s, loss=89.1, v_num=29] Validating: 0it [00:00, ?it/s] Validating: 0%| | 0/2 [00:00<?, ?it/s]{'weighted_loss': {'loss_kl': tensor(0.0516, device='cuda:0'), 'loss_mesh_rec': tensor(81.0408, device='cuda:0'), 'matrot': tensor(3.5944, device='cuda:0'), 'loss_total': tensor(84.6868, device='cuda:0')}, 'unweighted_loss': {'v2v': tensor(55.1946, device='cuda:0'), 'loss_total': tensor([55.1946], device='cuda:0')}} {'weighted_loss': {'loss_kl': tensor(0.0580, device='cuda:0'), 'loss_mesh_rec': tensor(86.9938, device='cuda:0'), 'matrot': tensor(3.5297, device='cuda:0'), 'loss_total': tensor(90.5815, device='cuda:0')}, 'unweighted_loss': {'v2v': tensor(59.5597, device='cuda:0'), 'loss_total': tensor([59.5597], device='cuda:0')}} [1] -- Epoch 0: val_loss:57.38 [1] -- lr is [0.001] Traceback (most recent call last): File "/home/drow/human_body_prior/src/train.py", line 54, in main() File "/home/drow/human_body_prior/src/train.py", line 50, in main train_vposer_once(job) File "/home/drow/human_body_prior/src/human_body_prior/train/vposer_trainer.py", line 351, in train_vposer_once trainer.fit(model) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit self._run(model) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run self._dispatch() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch self.accelerator.start_training(self) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training self.training_type_plugin.start_training(trainer) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training self._results = trainer.run_stage() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 995, in run_stage return self._run_train() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in _run_train self.fit_loop.run() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, **kwargs) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance epoch_output = self.epoch_loop.run(train_dataloader) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 118, in run output = self.on_run_end() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 235, in on_run_end self._on_train_epoch_end_hook(processed_outputs) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 276, in _on_train_epoch_end_hook trainer_hook(processed_epoch_output) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 109, in on_train_epoch_end callback.on_train_epoch_end(self, self.lightning_module) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 170, in on_train_epoch_end self._run_early_stopping_check(trainer) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 185, in _run_early_stopping_check logs File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 134, in _validate_condition_metric raise RuntimeError(error_msg) RuntimeError: Early stopping conditioned on metric val_loss which is not available. Pass in or modify your EarlyStopping callback to use any of the following: `` Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 35.66it/s, loss=89.1, v_num=29] Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 32.31it/s, loss=89.1, v_num=29]

I meet the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants