Support sequential cpu offloading with torchao quantized tensors #3085

a-r-r-o-w · 2024-09-06T05:08:45Z

What does this PR do?

As discussed internally in https://huggingface.slack.com/archives/C068ZAHJZCZ/p1725454261264919

From the Diffusers-TorchAO experimentation, when modeling components are quantized and sequential cpu offloading is enabled, we get the following error:

Traceback

Traceback (most recent call last):
  File "/home/aryan/work/diffusers/dump10.py", line 228, in <module>
    pipe.enable_sequential_cpu_offload()
  File "/home/aryan/work/diffusers/src/diffusers/pipelines/pipeline_utils.py", line 1104, in enable_sequential_cpu_offload
    cpu_offload(model, device, offload_buffers=offload_buffers)
  File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/accelerate/big_modeling.py", line 204, in cpu_offload
    attach_align_device_hook(
  File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/accelerate/hooks.py", line 512, in attach_align_device_hook
    attach_align_device_hook(
  File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/accelerate/hooks.py", line 512, in attach_align_device_hook
    attach_align_device_hook(
  File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/accelerate/hooks.py", line 512, in attach_align_device_hook
    attach_align_device_hook(
  [Previous line repeated 4 more times]
  File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/accelerate/hooks.py", line 503, in attach_align_device_hook
    add_hook_to_module(module, hook, append=True)
  File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/accelerate/hooks.py", line 161, in add_hook_to_module
    module = hook.init_hook(module)
  File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/accelerate/hooks.py", line 308, in init_hook
    set_module_tensor_to_device(module, name, "meta")
  File "/raid/aryan/nightly-venv/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 440, in set_module_tensor_to_device
    new_value = param_cls(new_value, requires_grad=old_value.requires_grad).to(device)
TypeError: AffineQuantizedTensor.__new__() got an unexpected keyword argument 'requires_grad'

MRE:

import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
from transformers import T5EncoderModel
from torchao.quantization import quantize_, int8_weight_only

# Either "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
model_id = "THUDM/CogVideoX-5b"
quantization = int8_weight_only

text_encoder = T5EncoderModel.from_pretrained(model_id, subfolder="text_encoder", torch_dtype=torch.bfloat16)
quantize_(text_encoder, quantization())

transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
quantize_(transformer, quantization())

vae = AutoencoderKLCogVideoX.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.bfloat16)
quantize_(vae, quantization())

pipe = CogVideoXPipeline.from_pretrained(
    model_id,
    text_encoder=text_encoder,
    transformer=transformer,
    vae=vae,
    torch_dtype=torch.bfloat16,
)
pipe.enable_sequential_cpu_offload()

video = pipe(
    prompt="a panda dancing",
    num_inference_steps=1,
).frames[0]

export_to_video(video, "output.mp4", fps=8)

This PR adds the discussed/proposed fix. On this branch, the memory usage is as follows for CogVideoX-5b:

memory=0.008
max_memory=3.081
max_reserved=3.908

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@SunMarc @jerryzh168 @sayakpaul

HuggingFaceDocBuilderDev · 2024-09-06T05:12:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc

LGTM ! Thanks !

a-r-r-o-w added 2 commits September 6, 2024 07:02

add support for AffineQuantizedTensor

89d012b

ruff

7e92671

sayakpaul requested review from SunMarc and muellerzr September 6, 2024 05:21

SunMarc approved these changes Sep 6, 2024

View reviewed changes

SunMarc merged commit 5ad982a into huggingface:main Sep 6, 2024
25 checks passed

a-r-r-o-w deleted the torchao-diffusers-sequential-cpu-offload branch September 6, 2024 06:49

a-r-r-o-w mentioned this pull request Sep 14, 2024

gradio error- for inference sayakpaul/diffusers-torchao#34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support sequential cpu offloading with torchao quantized tensors #3085

Support sequential cpu offloading with torchao quantized tensors #3085

a-r-r-o-w commented Sep 6, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 6, 2024

SunMarc left a comment

Support sequential cpu offloading with torchao quantized tensors #3085

Support sequential cpu offloading with torchao quantized tensors #3085

Conversation

a-r-r-o-w commented Sep 6, 2024 • edited Loading

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Sep 6, 2024

SunMarc left a comment

Choose a reason for hiding this comment

a-r-r-o-w commented Sep 6, 2024 •

edited

Loading