Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Out of Memory Issue #137

Open
dongho-Han opened this issue Apr 2, 2024 · 6 comments
Open

GPU Out of Memory Issue #137

dongho-Han opened this issue Apr 2, 2024 · 6 comments

Comments

@dongho-Han
Copy link

dongho-Han commented Apr 2, 2024

When I try to evaluate with your code, I met GPU Memory Issue.
Especially, running this code

CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -n 4 python entry.py evaluate --conf_files configs/seem/focalt_unicl_lang_v1.yaml --overrides COCO.INPUT.IMAGE_SIZE 1024 MODEL.DECODER.HIDDEN_DIM 512 MODEL.ENCODER.CONVS_DIM 512 MODEL.ENCODER.MASK_DIM 512 VOC.TEST.BATCH_SIZE_TOTAL 8 TEST.BATCH_SIZE_TOTAL 8 REF.TEST.BATCH_SIZE_TOTAL 8 FP16 True WEIGHT True RESUME_FROM ./pretrained/seem_focalt_v1.pt

Could you share how much memory is needed for evaluation?

Error log:

  File "/home/Segment-Everything-Everywhere-All-At-Once/entry.py", line 75, in <module>
      main()
    File "/home/Segment-Everything-Everywhere-All-At-Once/entry.py", line 70, in main
      trainer.eval()
    File "/home/Segment-Everything-Everywhere-All-At-Once/trainer/default_trainer.py", line 79, in eval
      results = self._eval_on_set(self.save_folder)
    File "/home/Segment-Everything-Everywhere-All-At-Once/trainer/default_trainer.py", line 87, in _eval_on_set
      results = self.pipeline.evaluate_model(self, save_folder)
    File "/home/Segment-Everything-Everywhere-All-At-Once/./pipeline/XDecoderPipeline.py", line 155, in evaluate_model
      outputs = model(batch, mode=eval_type)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
      return forward_call(*args, **kwargs)
    File "/home/Segment-Everything-Everywhere-All-At-Once/modeling/BaseModel.py", line 19, in forward
      outputs = self.model(*inputs, **kwargs)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
      return forward_call(*args, **kwargs)
    File "/home/Segment-Everything-Everywhere-All-At-Once/modeling/architectures/seem_model_v1.py", line 318, in forward
      return self.evaluate(batched_inputs)
    File "/home/Segment-Everything-Everywhere-All-At-Once/modeling/architectures/seem_model_v1.py", line 387, in evaluate
      outputs = self.sem_seg_head(features, target_queries=queries_grounding)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
      return forward_call(*args, **kwargs)
    File "/home/Segment-Everything-Everywhere-All-At-Once/modeling/body/xdecoder_head.py", line 99, in forward
      return self.layers(features, mask, target_queries, target_vlp, task, extra)
    File "/home/Segment-Everything-Everywhere-All-At-Once/modeling/body/xdecoder_head.py", line 102, in layers
      mask_features, transformer_encoder_features, multi_scale_features = self.pixel_decoder.forward_features(features)
    File "/home/Segment-Everything-Everywhere-All-At-Once/modeling/vision/encoder/transformer_encoder_fpn.py", line 293, in forward_features
      cur_fpn = lateral_conv(x)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
      return forward_call(*args, **kwargs)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/detectron2/layers/wrappers.py", line 110, in forward
      x = self.norm(x)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
      return forward_call(*args, **kwargs)
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/modules/normalization.py", line 279, in forward
      return F.group_norm(
    File "/root/anaconda3/envs/seem/lib/python3.9/site-packages/torch/nn/functional.py", line 2558, in group_norm
      return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
  torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.41 GiB. GPU 0 has a total capacty of 23.64 GiB of which 386.50 MiB is free. Process 2385114 has 4.12 GiB memory in use. Process 2385112 has 17.04 GiB memory in use. Process 2385111 has 1.05 GiB memory in use. Process 2385113 has 1.05 GiB memory in use. Of the allocated memory 3.07 GiB is allocated by PyTorch, and 860.55 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I used 4 Titan RTX with 24576MiB.

@dongho-Han
Copy link
Author

dongho-Han commented Apr 4, 2024

@MaureenZOU @jwyang
Could you check on this issue?
I also get the error when using seem_samvitb with the same code as assets/readmes/EVAL.md.
How can I change the values to run your code without GPU memory problem? As a first step, I changed the batch size to 2, but fails.

In INSTALL.md, you mentioned

CUDA enabled GPU with Memory > 8GB (Evaluation)

but I think my setting is doing something wrong.
When I check the status, only 1 GPU is used even when I change the CUDA_VISIBLE_DEIVCES & mpi-run number. And the number of mpirun is only used for the # of concurrent tasks in one GPU.
This image shows the status when I try to evaluate with 8 GPUs.
Did you use mpi for distributed GPUs or CPUs?
스크린샷 2024-04-05 021752

@juju0111
Copy link

juju0111 commented May 3, 2024

same problem!!

@Beck-127
Copy link

Beck-127 commented May 9, 2024

@MaureenZOU @jwyang Could you check on this issue? I also get the error when using seem_samvitb with the same code as assets/readmes/EVAL.md. How can I change the values to run your code without GPU memory problem? As a first step, I changed the batch size to 2, but fails.

In INSTALL.md, you mentioned

CUDA enabled GPU with Memory > 8GB (Evaluation)

but I think my setting is doing something wrong. When I check the status, only 1 GPU is used even when I change the CUDA_VISIBLE_DEIVCES & mpi-run number. And the number of mpirun is only used for the # of concurrent tasks in one GPU. This image shows the status when I try to evaluate with 8 GPUs. Did you use mpi for distributed GPUs or CPUs? 스크린샷 2024-04-05 021752

Have you solved this problem?

@jwyang
Copy link
Collaborator

jwyang commented May 9, 2024

Hi, @dongho-Han , I noticed that in your script you used TEST.BATCH_SIZE_TOTAL 8 on 4 GPUs, can you try change it to 4?

@MaureenZOU
Copy link
Collaborator

Same suggestion, evaluating multiple images on a single image will cause: 1. Inaccurate evaluation (Because of padding). 2. OOM for GPU. I usually use 1 GPU for evaluation.

@tyuvraj
Copy link

tyuvraj commented Sep 17, 2024

Hi I was facing the same issue adding with torch.no_grad() solved the issue. You can find the gist file here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants