ChatGLM3-6B量化后GPU内存占用的问题 #1341

Wooonster · 2024-12-19T08:38:01Z

Wooonster
Dec 19, 2024

感谢ChatGLM团队的工作，本人小白，学到很多。

我在使用ChatGLM3-6B模型时，对模型进行4/8bit量化的操作，代码如下：

from transformers import AutoModel, AutoTokenizer
import uvicorn, json, datetime
from fastapi import FastAPI, Request

app = FastAPI()

q = 4/8

tokenizer = AutoTokenizer.from_pretrained('THUDM/chatglm3-6b', trust_remote_code=True)
if q:
    model = AutoModel.from_pretrained('THUDM/chatglm3-6b', trust_remote_code=True, device='cuda').half().quantize(q).cuda()
else:
    model = AutoModel.from_pretrained('THUDM/chatglm3-6b', trust_remote_code=True, device='cuda').half().cuda()

model = model.eval()
# response, history = model.chat(tokenizer, "你好 介绍一下林俊杰", history=[])
# print(response)

uvicorn.run(app, host='0.0.0.0', port=18080, workers=1)

理论上，量化后GPU的内存占用应该可以有所下降🤓，但是在查看GPU后 (by watch -n 1 nvidia-smi)，看的的结果并不是🥲

`model = AutoModel.from_pretrained('THUDM/chatglm3-6b', trust_remote_code=True).half().cuda()`
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:17:00.0 Off |                  Off |
| 30%   33C    P8             25W /  425W |   12291MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

`model = AutoModel.from_pretrained('THUDM/chatglm3-6b', trust_remote_code=True, device='cuda').half().quantize(4).cuda()`
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:17:00.0 Off |                  Off |
| 30%   34C    P8             25W /  425W |   12849MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

`model = AutoModel.from_pretrained('THUDM/chatglm3-6b', trust_remote_code=True, device='cuda').half().quantize(8).cuda()`
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:17:00.0 Off |                  Off |
| 30%   35C    P2             61W /  425W |   13063MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|

可以看到不管 q=4还是q=8 都比没有量化时，内存占用更高一点...

请问这是为什么呢非常感谢佬们的解答🙏🙏🙏

Answered by Wooonster

Dec 20, 2024

quantize 后及时清理似乎可以解决：

model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device='cuda').quantize(q).half().cuda()

# Check parameter types
for name, param in model.named_parameters():
    print(f"{q}-quantized: {name}: {param.dtype}")

# Clear cache and synchronize
torch.cuda.empty_cache()
torch.cuda.synchronize()

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr.…

View full answer

Wooonster · 2024-12-20T02:26:01Z

Wooonster
Dec 20, 2024
Author

quantize 后及时清理似乎可以解决：

model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device='cuda').quantize(q).half().cuda()

# Check parameter types
for name, param in model.named_parameters():
    print(f"{q}-quantized: {name}: {param.dtype}")

# Clear cache and synchronize
torch.cuda.empty_cache()
torch.cuda.synchronize()

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.78                 Driver Version: 550.78         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:5A:00.0 Off |                  Off |
| 30%   30C    P8             20W /  425W |    4483MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ChatGLM3-6B量化后GPU内存占用的问题 #1341

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

ChatGLM3-6B量化后GPU内存占用的问题 #1341

Wooonster Dec 19, 2024

Replies: 1 comment

Wooonster Dec 20, 2024 Author

Wooonster
Dec 19, 2024

Wooonster
Dec 20, 2024
Author