Community contribution: enabling `device_map="auto"` support for more vision and multimodal models #29786

amyeroberts · 2024-03-21T17:13:59Z

Feature request

Feature Request

transformers models can be easily loaded across multiple devices using device_map="auto". This will automatically allocate weights across available devices e.g. GPUs and offload any weights onto CPU, then disk as necessary. This is useful when doing inference with large models.

To enable this, _no_split_modules has to be defined in the model's pretrained model class e.g. like here for LLaMa. This defines layers which should not be split across devices, and should contain as few layers as possible.

Steps to add

Pick a model to work on and open a PR - comment on this issue to say which model you're working on
Define _no_split_modules in the PreTrainedModel subclass. Try with _no_split_modules = [] first
Enable testing
- Ensure the following tests are not skipped for the model: test_disk_offload_bin, test_disk_offload_safetensors, test_cpu_offload, test_model_parallelism, test_model_parallel_beam_search
- Run the tests in a multi-gpu environment pytest tests/models/{MODEL_NAME}/test_modeling_{MODEL_NAME}.py -vv -k "offload or parallelism"

Models

Motivation

Enable a powerful HF feature for all of our vision models

Your contribution

Ping me for review 🤗

The text was updated successfully, but these errors were encountered:

jla524 · 2024-03-23T01:22:08Z

I'm working on Resnet

edit: I'm running into a strange issue, where the tests would pass on one system and fail on another. I'm going to close to PR for now and investigate further.

tnnandi · 2024-04-07T21:18:41Z

BERT is not included in the above list of models. Does it mean that "device_map='auto'" is available for BERT models in any upcoming version of HF transformers? I still see the message BertForSequenceClassification does not support device_map='auto' with transformers 4.39.3.

amyeroberts · 2024-04-08T07:36:23Z

Hi @tnnandi, the list above is just for vision models that I got from a simple grep and filtering. device_map="auto" isn't yet enabled for BERT, c.f. #25296. If you or anyone in the community would like to add it, we'd be happy to review a PR.

jla524 · 2024-04-10T05:07:03Z

Hi @amyeroberts, hope you are well :) I'm not sure why, but it looks like the unit tests are passing even without defining _no_split_modules. I'm testing on systems with two GPUs. Any idea why this is happening?

$ pytest tests/models/align/test_modeling_align.py -k "parallel or offload"
=================================================================== test session starts ===================================================================
platform linux -- Python 3.10.13, pytest-7.4.4, pluggy-1.4.0
rootdir: /transformers
configfile: pyproject.toml
plugins: hypothesis-6.98.10, xdist-3.5.0, timeout-2.3.1, anyio-4.3.0
collected 262 items / 241 deselected / 21 selected                                                                                                        

tests/models/align/test_modeling_align.py ....................s                                                                                     [100%]

<warnings redacted>
================================================ 20 passed, 1 skipped, 241 deselected, 2 warnings in 5.33s ================================================

$ pytest tests/models/bert/test_modeling_bert.py -k "parallel or offload"
=================================================================== test session starts ===================================================================
platform linux -- Python 3.10.13, pytest-7.4.4, pluggy-1.4.0
rootdir: /transformers
configfile: pyproject.toml
plugins: hypothesis-6.98.10, xdist-3.5.0, timeout-2.3.1, anyio-4.3.0
collected 156 items / 148 deselected / 8 selected                                                                                                         

tests/models/bert/test_modeling_bert.py ........                                                                                                    [100%]

<warnings redacted>
====================================================== 8 passed, 148 deselected, 2 warnings in 4.59s ======================================================

amyeroberts · 2024-04-10T15:32:44Z

@jla524 It's because these tests are skipped if _no_split_modules aren't defined (the model default) e.g. here for test_disk_offload_bin. This is admittedly confusing, and should really be done with self.skipTest

jla524 · 2024-04-19T23:58:19Z

Models updated so far:

Bit
ConvNext
ConvNextv2
Cvt
Donut
Efficientnet
Focalnet
Glpn
Imagegpt
Levit
Maskformer
Mgp_str
Mobilenet_v1
Mobilenet_v2
Mobilevit
Mobilevitv2
Poolformer
Regnet
Resnet
Sam
Swiftfromer
Swin
Swinv2
Timesformer
Timm_backbone
Trocr
Upernet
Yolos

jla524 · 2024-05-17T05:51:56Z

Models remaining:

Blip
Dinat
Dpt
Flava
Git
GroupVit
Layoutlmv3
Mask2former
Maskformer
Nat
Oneformer
Perceiver
Segformer
Swin2sr
Timm_backbone
Tvlt
Tvp
Videomae
Vision_encoder_decoder
Vision_text_dual_encoder
Vit_mae
X_clip

WenheLI · 2024-06-04T02:17:03Z

Hi! Would love to take the following models and give it a try:
Videomae
Vision_encoder_decoder
Vision_text_dual_encoder
Vit_mae
X_clip
Thanks!

WenheLI · 2024-06-06T01:12:14Z

Hi! I encountered an issue while running tests for some models, specifically Vision_text_dual_encoder. Even though I set the _no_split_module, the unit test still skips these tests. Does this mean that test cases for these models are not implemented?

Additionally, I want to know how to define certain models to be skipped in the test. For example, I have ViTMAEForPreTraining, which should not be split across different GPUs IMO. However, this causes the test case to fail because the test expects the model to be split across different devices.

amyeroberts · 2024-06-06T10:52:43Z

@WenheLI Ah, I should take the vision text dual encoder off the list, we can theoretically load any encoder and decoder there, so it's not possible to know the modules that can be split or not, same for vision encoder-decoder

Nech-C · 2024-08-24T20:44:53Z

Hey @amyeroberts, I was experimenting with defining _no_split_modules for the Segformer model, and I encountered some unexpected results when running the tests.

When I set _no_split_modules = [] for the Segformer model, all tests failed because no weights were loaded to the GPUs. Here are the error messages I received:

FAILED tests/models/segformer/test_modeling_segformer.py::SegformerModelTest::test_cpu_offload - AssertionError: Items in the second set but not the first:
FAILED tests/models/segformer/test_modeling_segformer.py::SegformerModelTest::test_disk_offload_bin - ValueError: You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead.
FAILED tests/models/segformer/test_modeling_segformer.py::SegformerModelTest::test_disk_offload_safetensors - ValueError: You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead.
FAILED tests/models/segformer/test_modeling_segformer.py::SegformerModelTest::test_model_parallelism - AssertionError: Items in the second set but not the first:

To investigate further, I ran infer_auto_device_map directly in a Jupyter notebook with GPU memory allocations used in those tests:

With 70% of the model size allocated to GPU:

compute_module_sizes(model)
total_size = compute_module_sizes(model)[""]
max_memory = {0: int(0.7 * total_size), "cpu": total_size * 2}
print(f"Total model size: {total_size}, max memory: {max_memory}")
print(
    infer_auto_device_map(
        model,
        max_memory=max_memory,
        no_split_module_classes=[]
    )
)

Output:

Total model size: 1195632, max memory: {0: 836942, 'cpu': 2391264}
OrderedDict([('', 'cpu')])

With 90% of the model size allocated to GPU:

max_memory = {0: int(0.9 * total_size), "cpu": total_size * 2}
print(f"Total model size: {total_size}, max memory: {max_memory}")
print(
    infer_auto_device_map(
        model,
        max_memory=max_memory,
        no_split_module_classes=[]
    )
)

Output:

Total model size: 1195632, max memory: {0: 1076068, 'cpu': 2391264}  
OrderedDict([('segformer.encoder.patch_embeddings', 0),  
('segformer.encoder.block.0.0.layer_norm_1', 0),  
('segformer.encoder.block.0.0.attention.self.query', 0),  
('segformer.encoder.block.0.0.attention.self.key', 0),  
('segformer.encoder.block.0.0.attention.self.value', 0),  
('segformer.encoder.block.0.0.attention.self.dropout', 0),  
('segformer.encoder.block.0.0.attention.self.sr', 'cpu'), 
('segformer.encoder.block.0.0.attention.self.layer_norm', 'cpu'),  
('segformer.encoder.block.0.0.attention.output', 'cpu'),  
('segformer.encoder.block.0.0.drop_path', 'cpu'),  
('segformer.encoder.block.0.0.layer_norm_2', 'cpu'),  
('segformer.encoder.block.0.0.mlp', 'cpu'),  
('segformer.encoder.block.1', 'cpu'),  
('segformer.encoder.block.2', 'cpu'),  
('segformer.encoder.block.3', 'cpu'),  
('segformer.encoder.layer_norm', 'cpu'),  
('decode_head', 'cpu')])

The model can definitely be split into smaller modules as the 90% split case suggests. The problem with the 70% split case doesn't come from the smaller max_memory assigned for the GPU because the modules allocated to the GPU in the 90% case only account for 21,408 bytes of the total 1,195,632 bytes model size. This number (about 1.8% of the total model size) is significantly smaller than both the 70% (836,942 bytes) and 90% (1,076,068 bytes) max_memory defined for the GPU. Therefore, the problem is not the max_memory defined for the GPU, but rather some issues with the infer_auto_device_map function's allocation strategy.

After looking into the infer_auto_device_map function, I believe the logic might not be working as intended for models with highly imbalanced module sizes like Segformer:

max_layer_size, max_layer_names = get_max_layer_size(modules_to_treat, module_sizes, no_split_module_classes)

while len(modules_to_treat) > 0:
    name, module = modules_to_treat.pop(0)
    module_size = module_sizes[name]
    
    device = devices[current_device]
    current_max_size = max_memory[device] if device != "disk" else None
    current_memory_reserved = 0

    if devices[current_device] in main_devices:
        current_max_size = current_max_size - max_layer_size
        current_memory_reserved = max_layer_size

This code reserves space for the largest layer on each main device. For Segformer, where the decode_head (1,107,984 bytes) is significantly larger than other layers, this approach may be too conservative, leaving little room for other layers on the GPU.

if current_max_size is not None and current_memory_used + module_size > current_max_size:
    # Split or not split?
    modules_children = (
        []
        if isinstance(module, nn.Parameter) or isinstance(module, torch.Tensor)
        else list(module.named_children())
    )
    if len(modules_children) == 0 or module.__class__.__name__ in no_split_module_classes:
        # -> no split, we go to the next device
        device_memory_used[device] = current_memory_used + current_memory_reserved
        current_device += 1
        modules_to_treat = [(name, module)] + modules_to_treat
        current_memory_used = 0
    else:
        # -> split, we replace the module studied by its children + parameters
        modules_children = list(module.named_parameters(recurse=False)) + modules_children
        modules_to_treat = [(f"{name}.{n}", v) for n, v in modules_children] + modules_to_treat

This part of the function decides whether to split a module or move to the next device. However, once it moves to the next device (i.e., CPU), it never goes back to the GPU, even if there's available space. This could explain why smaller modules aren't being allocated to the GPU after the decode_head is moved to the CPU.

@amyeroberts, should we wait until infer_auto_device_map gets modified and works for models with uneven weight distributions before working on them? Or is it better to override those offload test functions from ModelTesterMixin in each model's test script so that we can enable device_map="auto" for models with this problem sooner?

amyeroberts · 2024-08-27T16:42:53Z

Hi @Nech-C, thanks for writing all of this up!

I don't know infer_auto_device_map intimately, so not sure on the overall logic. Just from the code snippet, I'm assuming it's taking a greedy approach to the memory allocation, which won't be optimal in all cases (like this one). Rather than increment the devices like you said, we might want to keep a running count of available memory and go through the devices in decreasing priority, however this would be slower. If you'd like to open a PR to address I'd be happy to take a look, although I'm not a maintainer in accelerate so don't make decision on what should or could get added there :)

Regarding the order of things, having to update the tests I think is a sign that we should wait: infer_auto_device_map is also what will be called when users do device_map="auto". If the model isn't being well allocated on the available devices then enabling this for segformer doesn't make sense.

Nech-C · 2024-11-18T15:23:44Z

Hi @amyeroberts, sorry for getting back to you after so long. While working on the infer_auto_device_map function, I've gained some insights and would appreciate your guidance on how to proceed. However, it seems like you might no longer actively maintain this library. If that's the case, please let me know if I should open a new issue or reach out to the current maintainers.

The infer_auto_device_map function allocates modules sequentially across devices, both in terms of model layers and device hierarchy. It processes modules from the first layer to the last, and allocates them starting with the fastest device (e.g., GPUs) and moving to slower devices (e.g., CPUs, then disks). Although this method reduces the overhead of moving offloaded modules between devices and simplifies memory calculations, it does not make the most efficient use of available memory.

To successfully assign a module to a device, the available memory must exceed the combined size of the current module and the largest subsequent layer. If this condition is not met, the function moves to the next device without revisiting the current one, assuming no module split happens. If the allocation attempt fails on the first module for a device, the resultant device map won't include this device. This causes the tests to fail.

When the max_memory parameter is set too low for a device, module allocation can fail during offload/parallel tests, such as test_cpu_offload and test_model_parallelism. These tests rely on fixed split ratios to distribute the model across GPUs. However, this approach struggles with unbalanced models that contain disproportionately large, unsplittable modules (e.g., in deeper layers).

Here are the code snippets from the test file:

transformers/tests/test_modeling_common.py

Line 207 in 1349321

model_split_percents = [0.5, 0.7, 0.9]

transformers/tests/test_modeling_common.py

Lines 3218 to 3234 in 1349321

    
           def test_cpu_offload(self): 
        
               config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() 
        
               for model_class in self.all_model_classes: 
        
                   if model_class._no_split_modules is None: 
        
                       continue 
        
                   inputs_dict_class = self._prepare_for_class(inputs_dict, model_class) 
        
                   model = model_class(config).eval() 
        
                   model = model.to(torch_device) 
        
                   torch.manual_seed(0) 
        
                   base_output = model(**inputs_dict_class) 
        
                   model_size = compute_module_sizes(model)[""] 
        
                   # We test several splits of sizes to make sure it works. 
        
                   max_gpu_sizes = [int(p * model_size) for p in self.model_split_percents[1:]]

At a minimum, a GPU must have sufficient memory to accommodate the largest layer for inference. To address this, we may need a dynamic function to calculate split sizes during test time instead of relying on fixed ratios.

Thank you for reading through this (I know it’s a bit lengthy!). I’d love to hear your thoughts and recommendations on how to proceed.

qubvel · 2024-11-18T20:52:55Z

Hi @Nech-C, thanks for so detailed description of infer_auto_device_map. Indeed many models struggle to pass test with default model_split_percents values and we are overwriting these values for specific models.

I think computing split sizes during the test time might be not explicit enough. What do you think regarding improving the test behavior with a proper message why test actually fails. Maybe we can check if any module can be fitted into a device?

Nech-C · 2024-11-19T16:43:18Z

Hi @Nech-C, thanks for so detailed description of infer_auto_device_map. Indeed many models struggle to pass test with default model_split_percents values and we are overwriting these values for specific models.

I think computing split sizes during the test time might be not explicit enough. What do you think regarding improving the test behavior with a proper message why test actually fails. Maybe we can check if any module can be fitted into a device?

Hey @qubvel, sure thing! I actually have been working on a PR that adds warning messages for no allocation situations in the infer_auto_device_map function. You can check it out: Add warnings and fallback for unassigned devices in infer_auto_device_map. It logs the information using logging. Here is a demonstration:

INFO:accelerate.utils.modeling:Based on the current allocation process, no modules could be assigned to the following devices due to insufficient memory:
  - 0: 113955432 bytes required
  - cpu: 113955432 bytes required
These minimum requirements are specific to this allocation attempt and may vary. Consider increasing the available memory for these devices to at least the specified minimum, or adjusting the model config.

Just so you know, only the first device's minimum requirement is guaranteed to work since a device's assignment will affect all subsequent devices. For the current implementation, the accurate minimums for later devices cannot be determined unless the function tries to recompute the device map with the new configuration.

I also implemented a more flexible way of allocating in the same PR. When you set fallback_allocation to true, in case a main device receives no module assignment, it will find a module using BFS that fits the memory instead of having to use the first module in the list.

The PR should get merged soon. I plan to continue working on the function to address an improvement that SunMarc brought up. If there is anything you would like me to address/change in the function, please let me know. I will try to accommodate those needs.

amyeroberts added the Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! label Mar 21, 2024

jla524 mentioned this issue Mar 23, 2024

feat: enable multi-device for ResNet #29820

Closed

amyeroberts mentioned this issue Mar 28, 2024

BertForSequenceClassification does not support 'device_map':"auto" yet #25296

Open

4 tasks

jla524 mentioned this issue Apr 2, 2024

Enable multi-device for efficientnet #29989

Merged

amyeroberts closed this as completed in #29989 Apr 3, 2024

amyeroberts reopened this Apr 3, 2024

amyeroberts mentioned this issue Apr 8, 2024

add _no_split_modules attribute to SiglipVisionModel class #30110

Closed

5 tasks

jla524 mentioned this issue Apr 12, 2024

Enable multi-device for some models #30207

Merged

SunMarc mentioned this issue Apr 17, 2024

Adding _tie_weights() to prediction heads to support low_cpu_mem_usage=True #29024

Merged

5 tasks

amyeroberts closed this as completed in #30207 Apr 19, 2024

amyeroberts reopened this Apr 19, 2024

jla524 mentioned this issue Apr 22, 2024

Enable multi-device for more models #30379

Merged

amyeroberts closed this as completed in #30379 Apr 22, 2024

amyeroberts reopened this Apr 22, 2024

jla524 mentioned this issue Apr 23, 2024

Enable multi-device for more models #30409

Merged

amyeroberts closed this as completed in #30409 Apr 30, 2024

amyeroberts reopened this May 14, 2024

OmarManzoor mentioned this issue May 27, 2024

Enable multi-device support for DPT #31066

Closed

5 tasks

amyeroberts mentioned this issue Jun 3, 2024

Community contribution: enable dynamic resolution input for more vision models. #30579

Open

11 tasks

tomaarsen mentioned this issue Jul 4, 2024

Load model all-mpnet-base-v2 with map_device="auto" UKPLab/sentence-transformers#2807

Open

CruelBrutalMan mentioned this issue Jul 17, 2024

Using "accelerate" for multi-GPU calculate (enable device_map = "auto") Fanghua-Yu/SUPIR#130

Open

This was referenced Aug 24, 2024

Enable multi-device for Segformer #33108

Draft

infer_auto_device_map inefficiently allocates GPU memory for models with imbalanced module sizes huggingface/accelerate#3041

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Community contribution: enabling `device_map="auto"` support for more vision and multimodal models #29786

Community contribution: enabling `device_map="auto"` support for more vision and multimodal models #29786

amyeroberts commented Mar 21, 2024 •

edited

Loading

jla524 commented Mar 23, 2024 •

edited

Loading

tnnandi commented Apr 7, 2024

amyeroberts commented Apr 8, 2024

jla524 commented Apr 10, 2024

amyeroberts commented Apr 10, 2024

jla524 commented Apr 19, 2024

jla524 commented May 17, 2024

WenheLI commented Jun 4, 2024

WenheLI commented Jun 6, 2024

amyeroberts commented Jun 6, 2024 •

edited

Loading

Nech-C commented Aug 24, 2024

amyeroberts commented Aug 27, 2024

Nech-C commented Nov 18, 2024

qubvel commented Nov 18, 2024

Nech-C commented Nov 19, 2024

Community contribution: enabling device_map="auto" support for more vision and multimodal models #29786

Community contribution: enabling device_map="auto" support for more vision and multimodal models #29786

Comments

amyeroberts commented Mar 21, 2024 • edited Loading

Feature request

Feature Request

Steps to add

Models

Motivation

Your contribution

jla524 commented Mar 23, 2024 • edited Loading

tnnandi commented Apr 7, 2024

amyeroberts commented Apr 8, 2024

jla524 commented Apr 10, 2024

amyeroberts commented Apr 10, 2024

jla524 commented Apr 19, 2024

jla524 commented May 17, 2024

WenheLI commented Jun 4, 2024

WenheLI commented Jun 6, 2024

amyeroberts commented Jun 6, 2024 • edited Loading

Nech-C commented Aug 24, 2024

amyeroberts commented Aug 27, 2024

Nech-C commented Nov 18, 2024

qubvel commented Nov 18, 2024

Nech-C commented Nov 19, 2024

Community contribution: enabling `device_map="auto"` support for more vision and multimodal models #29786

Community contribution: enabling `device_map="auto"` support for more vision and multimodal models #29786

amyeroberts commented Mar 21, 2024 •

edited

Loading

jla524 commented Mar 23, 2024 •

edited

Loading

amyeroberts commented Jun 6, 2024 •

edited

Loading