Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Community contribution: enabling device_map="auto" support for more vision and multimodal models #29786

Open
23 of 59 tasks
amyeroberts opened this issue Mar 21, 2024 · 15 comments · Fixed by #29989, #30207, #30379 or #30409 · May be fixed by #33108
Open
23 of 59 tasks

Community contribution: enabling device_map="auto" support for more vision and multimodal models #29786

amyeroberts opened this issue Mar 21, 2024 · 15 comments · Fixed by #29989, #30207, #30379 or #30409 · May be fixed by #33108
Labels
Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@amyeroberts
Copy link
Collaborator

amyeroberts commented Mar 21, 2024

Feature request

Feature Request

transformers models can be easily loaded across multiple devices using device_map="auto". This will automatically allocate weights across available devices e.g. GPUs and offload any weights onto CPU, then disk as necessary. This is useful when doing inference with large models.

To enable this, _no_split_modules has to be defined in the model's pretrained model class e.g. like here for LLaMa. This defines layers which should not be split across devices, and should contain as few layers as possible.

Steps to add

  • Pick a model to work on and open a PR - comment on this issue to say which model you're working on
  • Define _no_split_modules in the PreTrainedModel subclass. Try with _no_split_modules = [] first
  • Enable testing
    • Ensure the following tests are not skipped for the model: test_disk_offload_bin, test_disk_offload_safetensors, test_cpu_offload, test_model_parallelism, test_model_parallel_beam_search
    • Run the tests in a multi-gpu environment pytest tests/models/{MODEL_NAME}/test_modeling_{MODEL_NAME}.py -vv -k "offload or parallelism"

Models

Motivation

Enable a powerful HF feature for all of our vision models

Your contribution

Ping me for review 🤗

@amyeroberts amyeroberts added the Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! label Mar 21, 2024
@jla524
Copy link
Contributor

jla524 commented Mar 23, 2024

I'm working on Resnet

edit: I'm running into a strange issue, where the tests would pass on one system and fail on another. I'm going to close to PR for now and investigate further.

@tnnandi
Copy link

tnnandi commented Apr 7, 2024

BERT is not included in the above list of models. Does it mean that "device_map='auto'" is available for BERT models in any upcoming version of HF transformers? I still see the message BertForSequenceClassification does not support device_map='auto' with transformers 4.39.3.

@amyeroberts
Copy link
Collaborator Author

Hi @tnnandi, the list above is just for vision models that I got from a simple grep and filtering. device_map="auto" isn't yet enabled for BERT, c.f. #25296. If you or anyone in the community would like to add it, we'd be happy to review a PR.

@jla524
Copy link
Contributor

jla524 commented Apr 10, 2024

Hi @amyeroberts, hope you are well :) I'm not sure why, but it looks like the unit tests are passing even without defining _no_split_modules. I'm testing on systems with two GPUs. Any idea why this is happening?

$ pytest tests/models/align/test_modeling_align.py -k "parallel or offload"
=================================================================== test session starts ===================================================================
platform linux -- Python 3.10.13, pytest-7.4.4, pluggy-1.4.0
rootdir: /transformers
configfile: pyproject.toml
plugins: hypothesis-6.98.10, xdist-3.5.0, timeout-2.3.1, anyio-4.3.0
collected 262 items / 241 deselected / 21 selected                                                                                                        

tests/models/align/test_modeling_align.py ....................s                                                                                     [100%]

<warnings redacted>
================================================ 20 passed, 1 skipped, 241 deselected, 2 warnings in 5.33s ================================================
$ pytest tests/models/bert/test_modeling_bert.py -k "parallel or offload"
=================================================================== test session starts ===================================================================
platform linux -- Python 3.10.13, pytest-7.4.4, pluggy-1.4.0
rootdir: /transformers
configfile: pyproject.toml
plugins: hypothesis-6.98.10, xdist-3.5.0, timeout-2.3.1, anyio-4.3.0
collected 156 items / 148 deselected / 8 selected                                                                                                         

tests/models/bert/test_modeling_bert.py ........                                                                                                    [100%]

<warnings redacted>
====================================================== 8 passed, 148 deselected, 2 warnings in 4.59s ======================================================

@amyeroberts
Copy link
Collaborator Author

@jla524 It's because these tests are skipped if _no_split_modules aren't defined (the model default) e.g. here for test_disk_offload_bin. This is admittedly confusing, and should really be done with self.skipTest

@jla524
Copy link
Contributor

jla524 commented Apr 19, 2024

Models updated so far:

  • Bit
  • ConvNext
  • ConvNextv2
  • Cvt
  • Donut
  • Efficientnet
  • Focalnet
  • Glpn
  • Imagegpt
  • Levit
  • Maskformer
  • Mgp_str
  • Mobilenet_v1
  • Mobilenet_v2
  • Mobilevit
  • Mobilevitv2
  • Poolformer
  • Regnet
  • Resnet
  • Sam
  • Swiftfromer
  • Swin
  • Swinv2
  • Timesformer
  • Timm_backbone
  • Trocr
  • Upernet
  • Yolos

@jla524
Copy link
Contributor

jla524 commented May 17, 2024

Models remaining:

  • Blip
  • Dinat
  • Dpt
  • Flava
  • Git
  • GroupVit
  • Layoutlmv3
  • Mask2former
  • Maskformer
  • Nat
  • Oneformer
  • Perceiver
  • Segformer
  • Swin2sr
  • Timm_backbone
  • Tvlt
  • Tvp
  • Videomae
  • Vision_encoder_decoder
  • Vision_text_dual_encoder
  • Vit_mae
  • X_clip

@WenheLI
Copy link

WenheLI commented Jun 4, 2024

Hi! Would love to take the following models and give it a try:
Videomae
Vision_encoder_decoder
Vision_text_dual_encoder
Vit_mae
X_clip
Thanks!

@WenheLI
Copy link

WenheLI commented Jun 6, 2024

Hi! I encountered an issue while running tests for some models, specifically Vision_text_dual_encoder. Even though I set the _no_split_module, the unit test still skips these tests. Does this mean that test cases for these models are not implemented?

Additionally, I want to know how to define certain models to be skipped in the test. For example, I have ViTMAEForPreTraining, which should not be split across different GPUs IMO. However, this causes the test case to fail because the test expects the model to be split across different devices.

@amyeroberts
Copy link
Collaborator Author

amyeroberts commented Jun 6, 2024

@WenheLI Ah, I should take the vision text dual encoder off the list, we can theoretically load any encoder and decoder there, so it's not possible to know the modules that can be split or not, same for vision encoder-decoder

@Nech-C
Copy link

Nech-C commented Aug 24, 2024

Hey @amyeroberts, I was experimenting with defining _no_split_modules for the Segformer model, and I encountered some unexpected results when running the tests.

When I set _no_split_modules = [] for the Segformer model, all tests failed because no weights were loaded to the GPUs. Here are the error messages I received:

FAILED tests/models/segformer/test_modeling_segformer.py::SegformerModelTest::test_cpu_offload - AssertionError: Items in the second set but not the first:
FAILED tests/models/segformer/test_modeling_segformer.py::SegformerModelTest::test_disk_offload_bin - ValueError: You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead.
FAILED tests/models/segformer/test_modeling_segformer.py::SegformerModelTest::test_disk_offload_safetensors - ValueError: You are trying to offload the whole model to the disk. Please use the `disk_offload` function instead.
FAILED tests/models/segformer/test_modeling_segformer.py::SegformerModelTest::test_model_parallelism - AssertionError: Items in the second set but not the first:

To investigate further, I ran infer_auto_device_map directly in a Jupyter notebook with GPU memory allocations used in those tests:

  1. With 70% of the model size allocated to GPU:
compute_module_sizes(model)
total_size = compute_module_sizes(model)[""]
max_memory = {0: int(0.7 * total_size), "cpu": total_size * 2}
print(f"Total model size: {total_size}, max memory: {max_memory}")
print(
    infer_auto_device_map(
        model,
        max_memory=max_memory,
        no_split_module_classes=[]
    )
)

Output:

Total model size: 1195632, max memory: {0: 836942, 'cpu': 2391264}
OrderedDict([('', 'cpu')])
  1. With 90% of the model size allocated to GPU:
max_memory = {0: int(0.9 * total_size), "cpu": total_size * 2}
print(f"Total model size: {total_size}, max memory: {max_memory}")
print(
    infer_auto_device_map(
        model,
        max_memory=max_memory,
        no_split_module_classes=[]
    )
)

Output:

Total model size: 1195632, max memory: {0: 1076068, 'cpu': 2391264}  
OrderedDict([('segformer.encoder.patch_embeddings', 0),  
('segformer.encoder.block.0.0.layer_norm_1', 0),  
('segformer.encoder.block.0.0.attention.self.query', 0),  
('segformer.encoder.block.0.0.attention.self.key', 0),  
('segformer.encoder.block.0.0.attention.self.value', 0),  
('segformer.encoder.block.0.0.attention.self.dropout', 0),  
('segformer.encoder.block.0.0.attention.self.sr', 'cpu'), 
('segformer.encoder.block.0.0.attention.self.layer_norm', 'cpu'),  
('segformer.encoder.block.0.0.attention.output', 'cpu'),  
('segformer.encoder.block.0.0.drop_path', 'cpu'),  
('segformer.encoder.block.0.0.layer_norm_2', 'cpu'),  
('segformer.encoder.block.0.0.mlp', 'cpu'),  
('segformer.encoder.block.1', 'cpu'),  
('segformer.encoder.block.2', 'cpu'),  
('segformer.encoder.block.3', 'cpu'),  
('segformer.encoder.layer_norm', 'cpu'),  
('decode_head', 'cpu')])  

The model can definitely be split into smaller modules as the 90% split case suggests. The problem with the 70% split case doesn't come from the smaller max_memory assigned for the GPU because the modules allocated to the GPU in the 90% case only account for 21,408 bytes of the total 1,195,632 bytes model size. This number (about 1.8% of the total model size) is significantly smaller than both the 70% (836,942 bytes) and 90% (1,076,068 bytes) max_memory defined for the GPU. Therefore, the problem is not the max_memory defined for the GPU, but rather some issues with the infer_auto_device_map function's allocation strategy.

After looking into the infer_auto_device_map function, I believe the logic might not be working as intended for models with highly imbalanced module sizes like Segformer:

max_layer_size, max_layer_names = get_max_layer_size(modules_to_treat, module_sizes, no_split_module_classes)

while len(modules_to_treat) > 0:
    name, module = modules_to_treat.pop(0)
    module_size = module_sizes[name]
    
    device = devices[current_device]
    current_max_size = max_memory[device] if device != "disk" else None
    current_memory_reserved = 0

    if devices[current_device] in main_devices:
        current_max_size = current_max_size - max_layer_size
        current_memory_reserved = max_layer_size

This code reserves space for the largest layer on each main device. For Segformer, where the decode_head (1,107,984 bytes) is significantly larger than other layers, this approach may be too conservative, leaving little room for other layers on the GPU.

if current_max_size is not None and current_memory_used + module_size > current_max_size:
    # Split or not split?
    modules_children = (
        []
        if isinstance(module, nn.Parameter) or isinstance(module, torch.Tensor)
        else list(module.named_children())
    )
    if len(modules_children) == 0 or module.__class__.__name__ in no_split_module_classes:
        # -> no split, we go to the next device
        device_memory_used[device] = current_memory_used + current_memory_reserved
        current_device += 1
        modules_to_treat = [(name, module)] + modules_to_treat
        current_memory_used = 0
    else:
        # -> split, we replace the module studied by its children + parameters
        modules_children = list(module.named_parameters(recurse=False)) + modules_children
        modules_to_treat = [(f"{name}.{n}", v) for n, v in modules_children] + modules_to_treat

This part of the function decides whether to split a module or move to the next device. However, once it moves to the next device (i.e., CPU), it never goes back to the GPU, even if there's available space. This could explain why smaller modules aren't being allocated to the GPU after the decode_head is moved to the CPU.

@amyeroberts, should we wait until infer_auto_device_map gets modified and works for models with uneven weight distributions before working on them? Or is it better to override those offload test functions from ModelTesterMixin in each model's test script so that we can enable device_map="auto" for models with this problem sooner?

@amyeroberts
Copy link
Collaborator Author

Hi @Nech-C, thanks for writing all of this up!

I don't know infer_auto_device_map intimately, so not sure on the overall logic. Just from the code snippet, I'm assuming it's taking a greedy approach to the memory allocation, which won't be optimal in all cases (like this one). Rather than increment the devices like you said, we might want to keep a running count of available memory and go through the devices in decreasing priority, however this would be slower. If you'd like to open a PR to address I'd be happy to take a look, although I'm not a maintainer in accelerate so don't make decision on what should or could get added there :)

Regarding the order of things, having to update the tests I think is a sign that we should wait: infer_auto_device_map is also what will be called when users do device_map="auto". If the model isn't being well allocated on the available devices then enabling this for segformer doesn't make sense.

@Nech-C
Copy link

Nech-C commented Nov 18, 2024

Hi @amyeroberts, sorry for getting back to you after so long. While working on the infer_auto_device_map function, I've gained some insights and would appreciate your guidance on how to proceed. However, it seems like you might no longer actively maintain this library. If that's the case, please let me know if I should open a new issue or reach out to the current maintainers.

The infer_auto_device_map function allocates modules sequentially across devices, both in terms of model layers and device hierarchy. It processes modules from the first layer to the last, and allocates them starting with the fastest device (e.g., GPUs) and moving to slower devices (e.g., CPUs, then disks). Although this method reduces the overhead of moving offloaded modules between devices and simplifies memory calculations, it does not make the most efficient use of available memory.

To successfully assign a module to a device, the available memory must exceed the combined size of the current module and the largest subsequent layer. If this condition is not met, the function moves to the next device without revisiting the current one, assuming no module split happens. If the allocation attempt fails on the first module for a device, the resultant device map won't include this device. This causes the tests to fail.

When the max_memory parameter is set too low for a device, module allocation can fail during offload/parallel tests, such as test_cpu_offload and test_model_parallelism. These tests rely on fixed split ratios to distribute the model across GPUs. However, this approach struggles with unbalanced models that contain disproportionately large, unsplittable modules (e.g., in deeper layers).

Here are the code snippets from the test file:

model_split_percents = [0.5, 0.7, 0.9]

def test_cpu_offload(self):
config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
for model_class in self.all_model_classes:
if model_class._no_split_modules is None:
continue
inputs_dict_class = self._prepare_for_class(inputs_dict, model_class)
model = model_class(config).eval()
model = model.to(torch_device)
torch.manual_seed(0)
base_output = model(**inputs_dict_class)
model_size = compute_module_sizes(model)[""]
# We test several splits of sizes to make sure it works.
max_gpu_sizes = [int(p * model_size) for p in self.model_split_percents[1:]]

At a minimum, a GPU must have sufficient memory to accommodate the largest layer for inference. To address this, we may need a dynamic function to calculate split sizes during test time instead of relying on fixed ratios.

Thank you for reading through this (I know it’s a bit lengthy!). I’d love to hear your thoughts and recommendations on how to proceed.

@qubvel
Copy link
Member

qubvel commented Nov 18, 2024

Hi @Nech-C, thanks for so detailed description of infer_auto_device_map. Indeed many models struggle to pass test with default model_split_percents values and we are overwriting these values for specific models.

I think computing split sizes during the test time might be not explicit enough. What do you think regarding improving the test behavior with a proper message why test actually fails. Maybe we can check if any module can be fitted into a device?

@Nech-C
Copy link

Nech-C commented Nov 19, 2024

Hi @Nech-C, thanks for so detailed description of infer_auto_device_map. Indeed many models struggle to pass test with default model_split_percents values and we are overwriting these values for specific models.

I think computing split sizes during the test time might be not explicit enough. What do you think regarding improving the test behavior with a proper message why test actually fails. Maybe we can check if any module can be fitted into a device?

Hey @qubvel, sure thing! I actually have been working on a PR that adds warning messages for no allocation situations in the infer_auto_device_map function. You can check it out: Add warnings and fallback for unassigned devices in infer_auto_device_map. It logs the information using logging. Here is a demonstration:

INFO:accelerate.utils.modeling:Based on the current allocation process, no modules could be assigned to the following devices due to insufficient memory:
  - 0: 113955432 bytes required
  - cpu: 113955432 bytes required
These minimum requirements are specific to this allocation attempt and may vary. Consider increasing the available memory for these devices to at least the specified minimum, or adjusting the model config.

Just so you know, only the first device's minimum requirement is guaranteed to work since a device's assignment will affect all subsequent devices. For the current implementation, the accurate minimums for later devices cannot be determined unless the function tries to recompute the device map with the new configuration.

I also implemented a more flexible way of allocating in the same PR. When you set fallback_allocation to true, in case a main device receives no module assignment, it will find a module using BFS that fits the memory instead of having to use the first module in the list.

The PR should get merged soon. I plan to continue working on the function to address an improvement that SunMarc brought up. If there is anything you would like me to address/change in the function, please let me know. I will try to accommodate those needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
6 participants