- InternVL2-4B model supports using openvino to accelerate the inference process. Currently only verified on Linux system and only tested on CPU platform.
- Support for integrated graphics on MTL, support Windows system.
- Add slice to llm model to optimize calculation
- Add int8 quantization to vision model
git clone https://github.com/zhaohb/InternVL2-4B-OV.git
pip install openvino_dev
pip install nncf
pip install transformers==4.37.2
pip install torch
pip install torchvision
Additional Operations
1. download InternVL2-4B model
2. Replace the modeling_phi3.py in the official model directory with the modeling_phi3.py in this project.
3. Replace the modeling_intern_vit.py in the official model directory with the modeling_intern_vit.py in this project.
4. Delete the CUDA API under the model file.
cd InternVL2-4B-OV
#for MTL iGPU windows
python.exe .\test_ov_internvl2.py -m /path/to/internvl2 -ov ..\internvl2_ov\ -llm_int4_com -vision_int8 -llm_int8_quan -d GPU.0
#output
INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
OpenVINO version
2024.4.0-16419-54f58b86be2
User: Hello, who are you?
Assistant: I am an AI assistant whose name is InternVL, developed jointly by Shanghai AI Lab and SenseTime.
User: <image>
Please describe the image shortly.
Assistant: The image features a close-up view of a red panda resting on a wooden platform. The panda is characterized by its distinctive red fur, white face, and ears. The background shows a natural setting with green foliage and a wooden structure.
User: <image>
Please describe the image shortly.
Assistant: The image features a close-up view of a red panda resting on a wooden platform. The panda is characterized by its distinctive red fur, white face, and ears. The background shows a natural setting with green foliage and a wooden structure.
Vision Pre latency: 226.53 ms, Vision encoder latency: 4881.63 ms, Vision Post latency: 45.78 ms, Vision Mlp latency: 42.10 ms
LLM Model First token latency: 8775.91 ms, Output len: 55, Avage token latency: 84.71 ms
After the command is executed, the IR of OpenVINO will be saved in the directory /path/to/internvl2_ov. If the existence of /path/to/internvl2_ov is detected, the model conversion process will be skipped and the IR of OpenVINO will be loaded directly.
If you only want to convert the model, you can add the -convert_model_only parameter:
python.exe .\test_ov_internvl2.py -m /path/to/internvl2 -ov ..\internvl2_ov\ -llm_int4_com -vision_int8 -llm_int8_quan -convert_model_only
The model: Model link
usage: Export InternVL2 Model to IR [-h] [-m MODEL_ID] -ov OV_IR_DIR [-d DEVICE] [-pic PICTURE] [-p PROMPT] [-max MAX_NEW_TOKENS] [-llm_int4_com] [-vision_int8] [-llm_int8_quant] [-convert_model_only]
options:
-h, --help show this help message and exit
-m MODEL_ID, --model_id MODEL_ID
model_id or directory for loading
-ov OV_IR_DIR, --ov_ir_dir OV_IR_DIR
output directory for saving model
-d DEVICE, --device DEVICE
inference device
-pic PICTURE, --picture PICTURE
picture file
-p PROMPT, --prompt PROMPT
prompt
-max MAX_NEW_TOKENS, --max_new_tokens MAX_NEW_TOKENS
max_new_tokens
-llm_int4_com, --llm_int4_compress
llm int4 weights compress
-vision_int8, --vision_int8_quant
vision int8 weights quantize
-llm_int8_quant, --llm_int8_quant
llm int8 weights quantize
-convert_model_only, --convert_model_only
convert model to ov only, do not do inference test