Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Out of Memory Error #129

Open
Dong-HoSeo opened this issue Jun 18, 2024 · 3 comments
Open

CUDA Out of Memory Error #129

Dong-HoSeo opened this issue Jun 18, 2024 · 3 comments

Comments

@Dong-HoSeo
Copy link

Environment:

GPU: NVIDIA RTX 3060 12GB
PyTorch Version: 2.3.1+cu118
CUDA Version: 11.8
OS: Linux Ubuntu
Python Version: 3.10.13
Amino acid : 651

I am encountering a CUDA Out of Memory error when running the run_inference.py script from the RoseTTAFold-All-Atom repository. The error occurs during the model inference step. Below is the detailed error traceback:
Running PSIPRED
Running hhsearch
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/dhseo/Data_HDD2/RoseTTAFold-All-Atom/rf2aa/run_inference.py", line 206, in main
runner.infer()
File "/home/dhseo/Data_HDD2/RoseTTAFold-All-Atom/rf2aa/run_inference.py", line 155, in infer
outputs = self.run_model_forward(input_feats)
File "/home/dhseo/Data_HDD2/RoseTTAFold-All-Atom/rf2aa/run_inference.py", line 121, in run_model_forward
outputs = recycle_step_legacy(self.model,
File "/home/dhseo/Data_HDD2/RoseTTAFold-All-Atom/rf2aa/training/recycling.py", line 30, in recycle_step_legacy
output_i = ddp_model(**input_i)
File "/home/dhseo/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dhseo/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dhseo/Data_HDD2/RoseTTAFold-All-Atom/rf2aa/model/RoseTTAFoldModel.py", line 364, in forward
pair, state = self.templ_emb(t1d, t2d, alpha_t, xyz_t, mask_t, pair, state, use_checkpoint=use_checkpoint, p2p_crop=p2p_crop)
File "/home/dhseo/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dhseo/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dhseo/Data_HDD2/RoseTTAFold-All-Atom/rf2aa/model/layers/Embeddings.py", line 335, in forward
templ = self.templ_stack(templ, rbf_feat, t1d, use_checkpoint=use_checkpoint, p2p_crop=p2p_crop) # (B, T, L,L, d_templ)
File "/home/dhseo/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dhseo/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dhseo/Data_HDD2/RoseTTAFold-All-Atom/rf2aa/model/layers/Embeddings.py", line 185, in forward
templ = self.block[i_block](templ, rbf_feat, state)
File "/home/dhseo/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/dhseo/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/home/dhseo/Data_HDD2/RoseTTAFold-All-Atom/rf2aa/model/Track_module.py", line 374, in forward
gate = einsum('bli,bmj->blmij', left, right).reshape(B,L,L,-1)
File "/home/dhseo/.local/lib/python3.10/site-packages/opt_einsum/contract.py", line 507, in contract
return _core_contract(operands, contraction_list, backend=backend, **einsum_kwargs)
File "/home/dhseo/.local/lib/python3.10/site-packages/opt_einsum/contract.py", line 591, in _core_contract
new_view = _einsum(einsum_str, *tmp_operands, backend=backend, **einsum_kwargs)
File "/home/dhseo/.local/lib/python3.10/site-packages/opt_einsum/sharing.py", line 151, in cached_einsum
return einsum(*args, **kwargs)
File "/home/dhseo/.local/lib/python3.10/site-packages/opt_einsum/contract.py", line 353, in _einsum
return fn(einsum_str, *operands, **kwargs)
File "/home/dhseo/.local/lib/python3.10/site-packages/opt_einsum/backends/torch.py", line 45, in einsum
return torch.einsum(equation, operands)
File "/home/dhseo/.local/lib/python3.10/site-packages/torch/functional.py", line 380, in einsum
return einsum(equation, *_operands)
File "/home/dhseo/.local/lib/python3.10/site-packages/torch/functional.py", line 385, in einsum
return _VF.einsum(equation, operands) # type: ignore[attr-defined]
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.72 GiB. GPU has a total capacity of 11.76 GiB of which 871.88 MiB is free. Including non-PyTorch memory, this process has 10.90 GiB memory in use. Of the allocated memory 9.07 GiB is allocated by PyTorch, and 873.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Any insights or suggestions on how to address this CUDA out of memory error would be greatly appreciated. Is there any way to further optimize the memory usage or any specific configurations that can help mitigate this issue?

Thank you in advance for your assistance!

@inuyasha10121
Copy link

I was encountering a similar problem, and may have found a potential hot-fix (though this issue can stem from the sequence being too long, in which case this will not help). I have been trying to model a protein-RNA complex and found that even the protein alone (516 residues) would not launch on an identically spec'd GPU to yours. This error can stem from the sequence being too large to fit, but after snooping around with some promiscuous print debugging, I discovered that the model would go through one round of prediction fine, but would crash during the second round. After digging in, I saw that two elements are being popped off the GPU. I suspected that these are not getting caught by the garbage collector fast enough to be properly cleared before the next cycle, so I added a line to tell the GPU to force clear the cache to free up the memory again, and now I'm able to predict the protein structure.

rf2aa/training/recycling.py

def add_recycle_inputs(network_input, output_i, i_cycle, gpu, return_raw=False, use_checkpoint=False):
    input_i = {}
    for key in network_input:
        if key in ['msa_latent', 'msa_full', 'seq']:
            input_i[key] = network_input[key][:,i_cycle].to(gpu, non_blocking=True)
        else:
            input_i[key] = network_input[key]

    L = input_i["msa_latent"].shape[2]
    msa_prev, pair_prev, _, alpha, mask_recycle = output_i
    xyz_prev = ChemData().INIT_CRDS.reshape(1,1,ChemData().NTOTAL,3).repeat(1,L,1,1).to(gpu, non_blocking=True)

    input_i['msa_prev'] = msa_prev
    input_i['pair_prev'] = pair_prev
    input_i['xyz'] = xyz_prev
    input_i['mask_recycle'] = mask_recycle
    input_i['sctors'] = alpha
    input_i['return_raw'] = return_raw
    input_i['use_checkpoint'] = use_checkpoint

    input_i.pop('xyz_prev')
    input_i.pop('alpha_prev')
    torch.cuda.empty_cache() #JME: Force GPU to clear popped tensors
    return input_i

@Sue-Fwl
Copy link

Sue-Fwl commented Jul 17, 2024

I was encountering a similar problem, and may have found a potential hot-fix (though this issue can stem from the sequence being too long, in which case this will not help). I have been trying to model a protein-RNA complex and found that even the protein alone (516 residues) would not launch on an identically spec'd GPU to yours. This error can stem from the sequence being too large to fit, but after snooping around with some promiscuous print debugging, I discovered that the model would go through one round of prediction fine, but would crash during the second round. After digging in, I saw that two elements are being popped off the GPU. I suspected that these are not getting caught by the garbage collector fast enough to be properly cleared before the next cycle, so I added a line to tell the GPU to force clear the cache to free up the memory again, and now I'm able to predict the protein structure.

rf2aa/training/recycling.py

def add_recycle_inputs(network_input, output_i, i_cycle, gpu, return_raw=False, use_checkpoint=False):
    input_i = {}
    for key in network_input:
        if key in ['msa_latent', 'msa_full', 'seq']:
            input_i[key] = network_input[key][:,i_cycle].to(gpu, non_blocking=True)
        else:
            input_i[key] = network_input[key]

    L = input_i["msa_latent"].shape[2]
    msa_prev, pair_prev, _, alpha, mask_recycle = output_i
    xyz_prev = ChemData().INIT_CRDS.reshape(1,1,ChemData().NTOTAL,3).repeat(1,L,1,1).to(gpu, non_blocking=True)

    input_i['msa_prev'] = msa_prev
    input_i['pair_prev'] = pair_prev
    input_i['xyz'] = xyz_prev
    input_i['mask_recycle'] = mask_recycle
    input_i['sctors'] = alpha
    input_i['return_raw'] = return_raw
    input_i['use_checkpoint'] = use_checkpoint

    input_i.pop('xyz_prev')
    input_i.pop('alpha_prev')
    torch.cuda.empty_cache() #JME: Force GPU to clear popped tensors
    return input_i

Unfortunately, I'm still facing the same issue as OP after attempting your solution.

@fglaser
Copy link

fglaser commented Sep 24, 2024

Dear all,

Have this issue been solved?
I am running a complex of a protein with increasing number of the same ligand, 1,2,3 etc. it works perfectly well until 3, but after that 4 and more ligands it fails. I tried as suggested to reduce the number of CYCLES but it did not worked.

But I am not an expert so I need your kind help if there is a workaround. Here is the output I get.

I tried to add
torch.cuda.empty_cache() #JME: Force GPU to clear popped tensors

but did not help, here is the output.

  • 16:32:56.514 INFO: Input file = BSA_F4/A/hhblits/t000_.1e-10.a3m

  • 16:32:56.514 INFO: Output file = BSA_F4/A/hhblits/t000_.1e-10.id90cov75.a3m

  • 16:32:56.574 INFO: Input file = BSA_F4/A/hhblits/t000_.1e-10.a3m

  • 16:32:56.574 INFO: Output file = BSA_F4/A/hhblits/t000_.1e-10.id90cov50.a3m

Running HHblits against UniRef30 with E-value cutoff 1e-6

  • 16:33:52.702 INFO: Input file = BSA_F4/A/hhblits/t000_.1e-6.a3m

  • 16:33:52.702 INFO: Output file = BSA_F4/A/hhblits/t000_.1e-6.id90cov75.a3m

  • 16:33:52.814 INFO: Input file = BSA_F4/A/hhblits/t000_.1e-6.a3m

  • 16:33:52.814 INFO: Output file = BSA_F4/A/hhblits/t000_.1e-6.id90cov50.a3m

Running HHblits against UniRef30 with E-value cutoff 1e-3

  • 16:35:32.972 INFO: Input file = BSA_F4/A/hhblits/t000_.1e-3.a3m

  • 16:35:32.972 INFO: Output file = BSA_F4/A/hhblits/t000_.1e-3.id90cov75.a3m

  • 16:35:33.095 INFO: Input file = BSA_F4/A/hhblits/t000_.1e-3.a3m

  • 16:35:33.095 INFO: Output file = BSA_F4/A/hhblits/t000_.1e-3.id90cov50.a3m

Running HHblits against BFD with E-value cutoff 1e-3

  • 16:40:09.185 INFO: Input file = BSA_F4/A/hhblits/t000_.1e-3.bfd.a3m

  • 16:40:09.185 INFO: Output file = BSA_F4/A/hhblits/t000_.1e-3.bfd.id90cov75.a3m

  • 16:40:09.239 INFO: Input file = BSA_F4/A/hhblits/t000_.1e-3.bfd.a3m

  • 16:40:09.239 INFO: Output file = BSA_F4/A/hhblits/t000_.1e-3.bfd.id90cov50.a3m

Running PSIPRED
Running hhsearch
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/fabian/RoseTTAFold-All-Atom/rf2aa/run_inference.py", line 206, in main
runner.infer()
File "/home/fabian/RoseTTAFold-All-Atom/rf2aa/run_inference.py", line 155, in infer
outputs = self.run_model_forward(input_feats)
File "/home/fabian/RoseTTAFold-All-Atom/rf2aa/run_inference.py", line 121, in run_model_forward
outputs = recycle_step_legacy(self.model,
File "/home/fabian/RoseTTAFold-All-Atom/rf2aa/training/recycling.py", line 30, in recycle_step_legacy
output_i = ddp_model(**input_i)
File "/home/fabian/miniforge3/envs/RFAA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/fabian/RoseTTAFold-All-Atom/rf2aa/model/RoseTTAFoldModel.py", line 364, in forward
pair, state = self.templ_emb(t1d, t2d, alpha_t, xyz_t, mask_t, pair, state, use_checkpoint=use_checkpoint, p2p_crop=p2p_crop)
File "/home/fabian/miniforge3/envs/RFAA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/fabian/RoseTTAFold-All-Atom/rf2aa/model/layers/Embeddings.py", line 335, in forward
templ = self.templ_stack(templ, rbf_feat, t1d, use_checkpoint=use_checkpoint, p2p_crop=p2p_crop) # (B, T, L,L, d_templ)
File "/home/fabian/miniforge3/envs/RFAA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/fabian/RoseTTAFold-All-Atom/rf2aa/model/layers/Embeddings.py", line 185, in forward
templ = self.block[i_block](templ, rbf_feat, state)
File "/home/fabian/miniforge3/envs/RFAA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/fabian/RoseTTAFold-All-Atom/rf2aa/model/Track_module.py", line 412, in forward
pair = pair + self.drop_row(self.row_attn(pair, rbf_feat))
File "/home/fabian/miniforge3/envs/RFAA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/fabian/RoseTTAFold-All-Atom/rf2aa/model/layers/Attention_module.py", line 458, in forward
value = self.to_v(pair).reshape(B, L, L, self.h, self.dim)
File "/home/fabian/miniforge3/envs/RFAA/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/fabian/miniforge3/envs/RFAA/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.91 GiB (GPU 0; 23.68 GiB total capacity; 21.02 GiB already allocated; 1.43 GiB free; 22.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Thanks a lot,
Fabian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants