-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question on custom models #88
Comments
Hi @vince62s ! |
ok, let me explain a bit. |
I'll have a more detailed look later, but right now you can try and take a look at how quantized deployment is done in tensor_parallel int4 demo on kaggle. |
Are you quantizing the model before making it import torch
import transformers
import accelerate
from transformers.utils.bitsandbytes import replace_with_bnb_linear
import tensor_parallel as tp
# Initialize a model on meta device (zero allocations)
with accelerate.init_empty_weights():
model = transformers.AutoModelForCausalLM.from_config(...)
# Make it tensor_parallel
model = tp.TensorParallelPreTrainedModel(model)
# Mark linear layers to be quantized
model = replace_with_bnb_linear(
model,
quantization_config=transformers.utils.quantization_config.BitsAndBytesConfig(...),
)
# Load, dispatch and quantize the weights
device_map = tp.infer_sharded_device_map(model) # infer where to put each weight
for <EACH CHECKPOINT>:
converted_state_dict = tp.convert_state_dict( # <- tensor_parallel helper function.
torch.load(<CHECKPOINT PATH>), # Creates a tensor_parallel checkpoint form a normal one
...
)
# Dispatch the checkpoint
for param_name, param in converted_state_dict.items():
module_name = param_name
while len(module_name) > 0 and module_name not in device_map:
module_name = ".".join(module_name.split(".")[:-1])
accelerate.utils.set_module_tensor_to_device(model, param_name, device_map[module_name], value=param) From what I see, you're trying to use an already written dispatch code (from within |
I'll try but then you confirm that this line |
It does, unless the model is on |
so it HAS to be on meta otherwise quantization won't work in the snippet above, am I correct ? |
this step is ok File "/home/vincent/nlp/OpenNMT-py/onmt/model_builder.py", line 418, in build_model |
Is the model unfrozen? Is |
I am not seeing such print out. I tried various things as you mentioned but no luck. my steps (without talking about tensor_parallel) are as follows:
I tried to call tp.tensor_parallel between 1) and 2) which seems ok but then what to do with tp.Sharded or tp.infer_sharded_device_map is a mystery to me. |
Firstly, about Secondly, about tldr: call |
but the issue is |
I'll merge a PR #90 fixing this message (and possibly a problem behind it) today. Then I'll be able to tell what goes wrong there. |
well to be sure I just tested the kaggle notebook and to me it is still very unclear what is going on: after the state_dict load both GPU are loaded with 13-14GB which means each of them carries the full model. even more trouble some, in 4-bit I would expect the model footprint to be 4GB, split over two GPUs, hence 2GB each |
Those are |
By the way v1.2.6 has just been released which fixes the error message discussed above and, hopefully, some other aspects of dispatch. |
Then I don't undertsand anything. How is the model sharded on the 2 devices? layer N going to device 0 and layer M going to device 1? or part of layer N going device 0 and otehr part of layer N going to device 1 ? |
I updated the |
Yes, fp4 support for dispatch was added only like a few weeks ago. |
The entire point of this library is that each layer is split individually and each GPU has only a portion of each layer. This way all of the GPUs can be utilized simultaneously when using the model. |
While digging I found the following issue: when using the notebook I am seeing that "tp_wrapped_module" comes before the layer name o_proj for instance. It has an impact in the set_module_tensor_to_device() function which uses getattr(module, split) recursively. I have the feeling that tp_wrapped_module does not handle properly the str method. When I don't use TP the following code gives me the correct class in both cases:
When I switch to tp for my model, then the first print spits out "Linear" instead of "Linea4bit" |
On another note, would that be possible to have the same behavior with a model on "cpu" as on "meta". |
That could come in handy. I'll see what I can do. |
Hi,
without using transformers / accelerate blablabla, what are the constraints on the model to be tensor paralelizable ?
does it need to be a nn.Sequential ? does input dimensions need to be always in the same order ?
I am trying to load a model on two gpus but only the first is being allocated. (both are visible)
The text was updated successfully, but these errors were encountered: