Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to recover if training is interrupted #538

Open
xyt000-xjj opened this issue Aug 17, 2023 · 4 comments
Open

How to recover if training is interrupted #538

xyt000-xjj opened this issue Aug 17, 2023 · 4 comments

Comments

@xyt000-xjj
Copy link

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:

  1. In '...' directory, run command '...'
  2. See error (copy&paste full log, including exceptions and stacktraces).

Please copy&paste text instead of screenshots for better searchability.

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. Linux Ubuntu 20.04, Windows 10]
  • PyTorch version (e.g., pytorch 1.9.0)
  • CUDA toolkit version (e.g., CUDA 11.4)
  • NVIDIA driver version
  • GPU [e.g., Titan V, RTX 3090]
  • Docker: did you use Docker? If yes, specify docker image URL (e.g., nvcr.io/nvidia/pytorch:21.08-py3)

Additional context
Add any other context about the problem here.

@PDillis
Copy link

PDillis commented Oct 27, 2023

Just point to the last .pkl that was saved and want to resume from with the --resume argument in train.py. Note that the new training will start from 0, so account for that when setting how many images to train for in --kimg.

@therealjr
Copy link

@PDillis I understand that training will resume from that point. However, when I save a snap at tick 0 it shows nothing but blurred images. Why is it that the images are resetting entirely? Shouldn't it be generating data like it was trained on from the point it left off at?

@dookiethedog
Copy link

My Gan crashed and I was extremely annoyed as I was experiencing the exact same issue so I decided to read into the code. You can set the inital augmentation and kimg in the training_loop.py file, this will help but this will not actually continue the training from when it last ran it will only give it an idea where to start off again. The Dev's don't seem to care if it does crash as there is no proper resume code, I was actually able to modify the code and create a perfect resume function, however, I will not be able to resume from my first Gan as I did not have my code added yet so there is no way to pull the settings needed, but at least for future I will be all good and have everything stored in the pickle file.

@frankthequeen
Copy link

I was actually able to modify the code and create a perfect resume function

Could you share your code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants