Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems in running the forecasting network #7

Open
varun1352 opened this issue Jul 12, 2024 · 1 comment
Open

Problems in running the forecasting network #7

varun1352 opened this issue Jul 12, 2024 · 1 comment

Comments

@varun1352
Copy link

Hi, so I trained the interpolator network on the spring mesh database and killed the run after a few epochs. This saved the last.ckpt file locally in the results/checkpoints directory and gave me a run-id for the same(which weirdly is not an alphanumeric, but rather an 8-digit integer number). So, I can't run the command
python run.py experiment=spring_mesh_dyffusion diffusion.interpolator_run_id=<WANDB_RUN_ID>
because it keeps on giving me an AssertionError: run_id must be a string, but is <class 'int'>: 48422040

So as mentioned in dyffusion.yaml I set the run_id to the mentioned run_id for the run and then file name to the desired file name of "last.ckpt" and ran python run.py experiment=spring_mesh_dyffusion
But am facing this error

File "/home/vd2298/reimplementation/src/train.py", line 66, in run_model
   ckpt_path2 = wandb.restore(ckpt_filename, run_path=wandb.run.path, replace=True, root=os.getcwd()).name
                
 File "/scratch/vd2298/envs/dyffusion/lib/python3.12/site-packages/wandb/sdk/wandb_run.py", line 4225, in restore
   raise ValueError(f"File {name} not found in {run_path or root}.")
ValueError: File last.ckpt not found in vd2298-new-york-university/DYffusion-spring-mesh/48422040.

It seems that while my checkpoint is saved locally and the file last.ckpt is also reflected on my wandb
Screenshot 2024-07-12 at 5 57 22 PM

So I decided to use the other option of mentioning the local path in dyffusion.yaml but that doesn't seem to work as well. it keeps on going back to an older run_id and doesn't want to start the forecasting network at all. Can you please suggest what I should be trying next? or point what am I doing wrong?

@salvaRC
Copy link
Collaborator

salvaRC commented Oct 4, 2024

Hi!

Regarding your first problem. If the ID is a number, can you just make it into a string and see if that fixes it? E.g. do diffusion.interpolator_run_id="<WANDB_RUN_ID>". Let me know if it doesn't.

Regarding your second problem, that sounds weird since it does look like the correct file is saved on wandb. If it's still a problem can you email me so that we can maybe look into it together? Otherwise, it would help if you could provide me with the exact command that you ran and a public wandb link to the problematic run.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants