Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor size issue #1

Open
arunraja-hub opened this issue Nov 28, 2024 · 1 comment
Open

Tensor size issue #1

arunraja-hub opened this issue Nov 28, 2024 · 1 comment

Comments

@arunraja-hub
Copy link

When I was just trying to run the training using python train.py params_x1x3x4_diffusion_mosesaq_20240824 0, as suggested in the readme, I got the following error:

RuntimeError: Trying to resize storage that is not resizable

According to lucidrains/denoising-diffusion-pytorch#248 the solution is to change num_workers in the dataloader to 0 but that resulted in the following error:

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1176 but got size 595 for tensor number 1 in the list.

Could you please provide some guidance on this?

@keiradams
Copy link
Collaborator

Hi! I have not experienced this error, so I suspect it has something to do with our different training setups or package versions.

To help debug, can you try the following:

  • Make sure you can successfully run inference code provided by the RUNME_{}.ipynb notebooks.
  • In train.py, make sure you can call dataset[0] after initializing dataset = HeteroDatset(...)
  • In train.py, make sure you can call next(iter(train_loader)) after initializing train_loader = torch_geometric.loader.DataLoader(...), with batch_size = 0 and batch_size > 0.

If all of that works, then I would guess it is related to an issue with DDPM in Pytorch-Lightning with your particular system set-up. Are you trying to train with 1 GPU? On a CPU? On multiple GPUs? The parameters in parameters/params_x1x3x4_diffusion_mosesaq_20240824.py specify 'num_gpus': 2 and 'multiprocessing_spawn': True. Both of those could be causing issues with your specific setup?

Also, does this error occur at the start of the training epochs? Or mid-way through training?

Additionally, make sure that the versions of your packages are the same as those listed in the README, particularly your Pytorch-Lightning, Pytorch, and PyG versions.

It would also help if you could provide the complete error traceback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants