Revert default behavior of `get_state_dict_from_offload` #3253

kylesayrs · 2024-11-23T02:11:43Z

What does this PR do?

Purpose

Fix bug introduced by slight behavior change of the get_state_dict_from_offload introduced starting in accelerate 1.1.0
This bug affects models which are executed on the GPU with partial cpu/disk offloading only

Background

Starting in accelerate 1.1.0, get_state_dict_from_offload moves state_dict tensors to the CPU by default.

In the case that a module has parameters on the GPU, get_state_dict_from_offload implements the following procedure:

The module starts with a reference to the tensors on the GPU (1)
When get_state_dict_from_offload is called, copies of the tensors are created on the CPU (2) which are returned later
After the tensors are copied to the CPU, the reference to the GPU tensors (1) is dropped
After the CPU tensors (2) are moved into the state_dict, the CPU tensors (2) are copied back to the GPU (3)

This procedure ensures memory efficiency due to how reference to GPU tensors (1) is dropped before GPU tensors (3) are allocated.

However, in the downstream case of transformers, PretrainedModel.save_pretrained keeps a copy of the original GPU tensors (1) for longer than they are needed. This long-lived reference to (1) results in references to both (1) and (3) being alive at the same time, leading to increased memory usage.

Changes

Change default behavior of get_state_dict_from_offload to not move parameter tensors by default
- This ensures compatibility with transformers versions which have been released prior to and after Fix save_pretrained for partially offloaded models transformers#34890

Testing

save_offloaded.py

import torch
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("nvidia/Llama-3.1-Nemotron-70B-Instruct-HF", torch_dtype="auto", device_map="auto")
print(model.hf_device_map)
"""
{'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layer
s.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.l
ayers.10': 0, 'model.layers.11': 0, 'model.layers.12': 0, 'model.layers.13': 0, 'model.layers.14': 0, 'model.layers.15': 0
, 'model.layers.16': 0, 'model.layers.17': 0, 'model.layers.18': 0, 'model.layers.19': 0, 'model.layers.20': 0, 'model.lay
ers.21': 0, 'model.layers.22': 0, 'model.layers.23': 0, 'model.layers.24': 0, 'model.layers.25': 0, 'model.layers.26': 0, 
'model.layers.27': 0, 'model.layers.28': 0, 'model.layers.29': 0, 'model.layers.30': 0, 'model.layers.31': 0, 'model.layer
s.32': 0, 'model.layers.33': 0, 'model.layers.34': 0, 'model.layers.35': 0, 'model.layers.36': 0, 'model.layers.37': 0, 'm
odel.layers.38': 0, 'model.layers.39': 0, 'model.layers.40': 0, 'model.layers.41': 0, 'model.layers.42': 'cpu', 'model.lay
ers.43': 'cpu', 'model.layers.44': 'cpu', 'model.layers.45': 'cpu', 'model.layers.46': 'cpu', 'model.layers.47': 'cpu', 'm
odel.layers.48': 'cpu', 'model.layers.49': 'cpu', 'model.layers.50': 'cpu', 'model.layers.51': 'cpu', 'model.layers.52': '
cpu', 'model.layers.53': 'cpu', 'model.layers.54': 'cpu', 'model.layers.55': 'cpu', 'model.layers.56': 'cpu', 'model.layer
s.57': 'cpu', 'model.layers.58': 'cpu', 'model.layers.59': 'cpu', 'model.layers.60': 'cpu', 'model.layers.61': 'cpu', 'mod
el.layers.62': 'cpu', 'model.layers.63': 'cpu', 'model.layers.64': 'cpu', 'model.layers.65': 'cpu', 'model.layers.66': 'cp
u', 'model.layers.67': 'cpu', 'model.layers.68': 'cpu', 'model.layers.69': 'cpu', 'model.layers.70': 'cpu', 'model.layers.71': 'cpu', 'model.layers.72': 'cpu', 'model.layers.73': 'cpu', 'model.layers.74': 'cpu', 'model.layers.75': 'cpu', 'model.layers.76': 'cpu', 'model.layers.77': 'cpu', 'model.layers.78': 'cpu', 'model.layers.79': 'cpu', 'model.norm': 'cpu', 'model.rotary_emb': 'cpu', 'lm_head': 'cpu'}
"""

torch.cuda.memory._record_memory_history()
model.save_pretrained("save_dir")
torch.cuda.memory._dump_snapshot(f"align.pickle")
torch.cuda.memory._record_memory_history(enabled=None)

transformers <= 4.46.0 | accelerate < 1.1.0

Original references are not moved

transformers <= 4.46.0 | accelerate == 1.1.0, 1.1.1

Results in bug
Original references are not garbage collected
Calling save_pretrained results in OOM (>80GiB)

transformers <= 4.46.0 | accelerate > 1.1.1 (this branch)

Original references are not moved

transformers > 4.46.0 | accelerate > 1.1.1 (this branch)

Original references are not moved

Who can review?

Signed-off-by: Kyle Sayers <[email protected]>

HuggingFaceDocBuilderDev · 2024-11-25T11:21:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr

Oh dear, thanks for the follow up! We have a release planned shortly after thanksgiving so this can be included in it :)

SunMarc

Thanks for the bug fix !

muellerzr · 2024-11-25T14:34:12Z

cc @ArthurZucker for final

change default to None

d1e8ddd

Signed-off-by: Kyle Sayers <[email protected]>

This was referenced Nov 23, 2024

Fix save_pretrained for partially offloaded models huggingface/transformers#34890

Merged

CUDA OOM while saving compressed Llama-3.1-70b with AutoModelForCausalLM vllm-project/llm-compressor#928

Open

kylesayrs marked this pull request as draft November 23, 2024 02:42

kylesayrs added 2 commits November 23, 2024 09:54

introduce move_to_device argument

4460548

Signed-off-by: Kyle Sayers <[email protected]>

remove move_to_device

dba4824

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs marked this pull request as ready for review November 23, 2024 15:36

muellerzr approved these changes Nov 25, 2024

View reviewed changes

muellerzr requested a review from SunMarc November 25, 2024 13:45

SunMarc approved these changes Nov 25, 2024

View reviewed changes

muellerzr requested a review from ArthurZucker November 25, 2024 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert default behavior of `get_state_dict_from_offload` #3253

Revert default behavior of `get_state_dict_from_offload` #3253

kylesayrs commented Nov 23, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 25, 2024

muellerzr left a comment

SunMarc left a comment

muellerzr commented Nov 25, 2024

Revert default behavior of get_state_dict_from_offload #3253

Are you sure you want to change the base?

Revert default behavior of get_state_dict_from_offload #3253

Conversation

kylesayrs commented Nov 23, 2024 • edited Loading

What does this PR do?

Purpose

Background

Changes

Testing

transformers <= 4.46.0 | accelerate < 1.1.0

transformers <= 4.46.0 | accelerate == 1.1.0, 1.1.1

transformers <= 4.46.0 | accelerate > 1.1.1 (this branch)

transformers > 4.46.0 | accelerate > 1.1.1 (this branch)

Who can review?

HuggingFaceDocBuilderDev commented Nov 25, 2024

muellerzr left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

muellerzr commented Nov 25, 2024

Revert default behavior of `get_state_dict_from_offload` #3253

Revert default behavior of `get_state_dict_from_offload` #3253

kylesayrs commented Nov 23, 2024 •

edited

Loading