From 80991a9777965333c9b7e1c2d67a58f48c064816 Mon Sep 17 00:00:00 2001 From: ATMxsp01 Date: Tue, 17 Dec 2024 15:32:51 +0800 Subject: [PATCH 1/2] Add --modelscope for more models --- .../GPU/HuggingFace/LLM/chatglm3/README.md | 4 +- .../GPU/HuggingFace/LLM/codegeex2/README.md | 17 ++++- .../GPU/HuggingFace/LLM/codegeex2/generate.py | 24 +++++-- .../GPU/HuggingFace/LLM/glm4/README.md | 35 ++++------ .../GPU/HuggingFace/LLM/glm4/generate.py | 25 +++++-- .../GPU/HuggingFace/LLM/glm4/streamchat.py | 69 ------------------- .../GPU/HuggingFace/LLM/qwen2.5/README.md | 17 ++++- .../GPU/HuggingFace/LLM/qwen2.5/generate.py | 16 ++++- .../GPU/HuggingFace/LLM/qwen2/README.md | 21 ++++-- .../GPU/HuggingFace/LLM/qwen2/generate.py | 20 ++++-- 10 files changed, 124 insertions(+), 124 deletions(-) delete mode 100644 python/llm/example/GPU/HuggingFace/LLM/glm4/streamchat.py diff --git a/python/llm/example/GPU/HuggingFace/LLM/chatglm3/README.md b/python/llm/example/GPU/HuggingFace/LLM/chatglm3/README.md index 7a018f44e65..c7261a15f5d 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/chatglm3/README.md +++ b/python/llm/example/GPU/HuggingFace/LLM/chatglm3/README.md @@ -108,7 +108,7 @@ python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROM ``` Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the ChatGLM3 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/chatglm3-6b'` for **Hugging Face** or `ZhipuAI/chatglm3-6b` for **ModelScope**. +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the ChatGLM3 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/chatglm3-6b'` for **Hugging Face** or `'ZhipuAI/chatglm3-6b'` for **ModelScope**. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**. @@ -162,7 +162,7 @@ python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question ``` Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the ChatGLM3 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/chatglm3-6b'` for **Hugging Face** or `ZhipuAI/chatglm3-6b` for **ModelScope**. +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the ChatGLM3 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/chatglm3-6b'` for **Hugging Face** or `'ZhipuAI/chatglm3-6b'` for **ModelScope**. - `--question QUESTION`: argument defining the question to ask. It is default to be `"晚上睡不着应该怎么办"`. - `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used. - `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**. diff --git a/python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md b/python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md index 7e44b89b2b9..db985e144dd 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md +++ b/python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md @@ -1,6 +1,6 @@ # CodeGeeX2 -In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on CodeGeeX2 models which is implemented based on the ChatGLM2 architecture trained on more code data on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/codegeex-6b](https://huggingface.co/THUDM/codegeex2-6b) as a reference CodeGeeX2 model. +In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on CodeGeeX2 models which is implemented based on the ChatGLM2 architecture trained on more code data on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/codegeex-6b](https://huggingface.co/THUDM/codegeex2-6b) (or [ZhipuAI/codegeex2-6b](https://www.modelscope.cn/models/ZhipuAI/codegeex2-6b) for ModelScope) as a reference CodeGeeX2 model. ## 0. Requirements To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. @@ -16,6 +16,9 @@ conda create -n llm python=3.11 conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + +# [optional] only needed if you would like to use ModelScope as model hub +pip install modelscope==1.11.0 ``` #### 1.2 Installation on Windows @@ -26,6 +29,9 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + +# [optional] only needed if you would like to use ModelScope as model hub +pip install modelscope==1.11.0 ``` ### 2. Download Model and Replace File @@ -104,14 +110,19 @@ set SYCL_CACHE_PERSISTENT=1 > For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. ### 5. Running examples -``` +```bash +# for Hugging Face model hub python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT + +# for ModelScope model hub +python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --modelscope ``` Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the CodeGeeX2 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/codegeex-6b'`. +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the CodeGeeX2 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/codegeex-6b'` for **Hugging Face** or `'ZhipuAI/codegeex-6b'` for **ModelScope**. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'# language: Python\n# write a bubble sort function\n'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `128`. +- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**. #### Sample Output #### [THUDM/codegeex-6b](https://huggingface.co/THUDM/codegeex-6b) diff --git a/python/llm/example/GPU/HuggingFace/LLM/codegeex2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/codegeex2/generate.py index ddc9dd53c95..cf40c9be9b3 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/codegeex2/generate.py +++ b/python/llm/example/GPU/HuggingFace/LLM/codegeex2/generate.py @@ -28,18 +28,29 @@ if __name__ == '__main__': - parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for ChatGLM2 model') - parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/codegeex2-6b", - help='The huggingface repo id for the CodeGeeX2 model to be downloaded' - ', or the path to the huggingface checkpoint folder') + parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for CodeGeeX2 model') + parser.add_argument('--repo-id-or-model-path', type=str, + help='The Hugging Face or ModelScope repo id for the CodeGeeX2 model to be downloaded' + ', or the path to the checkpoint folder') parser.add_argument('--prompt', type=str, default="# language: Python\n# write a bubble sort function\n", help='Prompt to infer') parser.add_argument('--n-predict', type=int, default=128, help='Max tokens to predict') + parser.add_argument('--modelscope', action="store_true", default=False, + help="Use models from modelscope") args = parser.parse_args() - model_path = args.repo_id_or_model_path + if args.modelscope: + from modelscope import AutoTokenizer + model_hub = 'modelscope' + else: + from transformers import AutoTokenizer + model_hub = 'huggingface' + + model_path = args.repo_id_or_model_path if args.repo_id_or_model_path else \ + ("ZhipuAI/codegeex2-6b" if args.modelscope else "THUDM/codegeex2-6b") + # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. @@ -48,7 +59,8 @@ load_in_4bit=True, optimize_model=True, trust_remote_code=True, - use_cache=True) + use_cache=True, + model_hub=model_hub) model = model.half().to('xpu') # Load tokenizer diff --git a/python/llm/example/GPU/HuggingFace/LLM/glm4/README.md b/python/llm/example/GPU/HuggingFace/LLM/glm4/README.md index aa985dd74f8..2bb58d7a150 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/glm4/README.md +++ b/python/llm/example/GPU/HuggingFace/LLM/glm4/README.md @@ -1,5 +1,5 @@ # GLM-4 -In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on GLM-4 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) as a reference InternLM model. +In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on GLM-4 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) (or [ZhipuAI/glm4-9b-chat](https://www.modelscope.cn/models/ZhipuAI/glm4-9b-chat) for ModelScope) as a reference InternLM model. ## 0. Requirements To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. @@ -15,6 +15,9 @@ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-exte # install packages required for GLM-4, it is recommended to use transformers>=4.44 for THUDM/glm-4-9b-chat updated after August 12, 2024 pip install "tiktoken>=0.7.0" transformers==4.44 "trl<0.12.0" + +# [optional] only needed if you would like to use ModelScope as model hub +pip install modelscope==1.11.0 ``` ### 1.2 Installation on Windows @@ -28,6 +31,9 @@ pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-exte # install packages required for GLM-4, it is recommended to use transformers>=4.44 for THUDM/glm-4-9b-chat updated after August 12, 2024 pip install "tiktoken>=0.7.0" transformers==4.44 "trl<0.12.0" + +# [optional] only needed if you would like to use ModelScope as model hub +pip install modelscope==1.11.0 ``` ## 2. Configures OneAPI environment variables for Linux @@ -98,14 +104,19 @@ set SYCL_CACHE_PERSISTENT=1 ### Example 1: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a GLM-4 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs. -``` +```bash +# for Hugging Face model hub python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT + +# for ModelScope model hub +python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --modelscope ``` Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model (e.g. `THUDM/glm-4-9b-chat`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the GLM-4 model (e.g. `THUDM/glm-4-9b-chat`) to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'` for **Hugging Face** or `'ZhipuAI/glm-4-9b-chat'` for **ModelScope**. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. +- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**. #### Sample Output #### [THUDM/glm-4-9b-chat](https://huggingface.co/THUDM/glm-4-9b-chat) @@ -134,21 +145,3 @@ What is AI? Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term "art ``` - -### Example 2: Stream Chat using `stream_chat()` API -In the example [streamchat.py](./streamchat.py), we show a basic use case for a GLM-4 model to stream chat, with IPEX-LLM INT4 optimizations. - -**Stream Chat using `stream_chat()` API**: -``` -python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION -``` - -**Chat using `chat()` API**: -``` -python ./streamchat.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --question QUESTION --disable-stream -``` - -Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the GLM-4 model to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'THUDM/glm-4-9b-chat'`. -- `--question QUESTION`: argument defining the question to ask. It is default to be `"AI是什么?"`. -- `--disable-stream`: argument defining whether to stream chat. If include `--disable-stream` when running the script, the stream chat is disabled and `chat()` API is used. diff --git a/python/llm/example/GPU/HuggingFace/LLM/glm4/generate.py b/python/llm/example/GPU/HuggingFace/LLM/glm4/generate.py index ebef3dae4c2..c381814f216 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/glm4/generate.py +++ b/python/llm/example/GPU/HuggingFace/LLM/glm4/generate.py @@ -20,7 +20,6 @@ import numpy as np from ipex_llm.transformers import AutoModel -from transformers import AutoTokenizer # you could tune the prompt based on your own model, # here the prompt tuning refers to https://huggingface.co/THUDM/glm-4-9b-chat/blob/main/tokenization_chatglm.py @@ -28,16 +27,27 @@ if __name__ == '__main__': parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for GLM-4 model') - parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/glm-4-9b-chat", - help='The huggingface repo id for the GLM-4 model to be downloaded' - ', or the path to the huggingface checkpoint folder') + parser.add_argument('--repo-id-or-model-path', type=str, + help='The Hugging Face or ModelScope repo id for GLM-4 model model to be downloaded' + ', or the path to the checkpoint folder') parser.add_argument('--prompt', type=str, default="AI是什么?", help='Prompt to infer') parser.add_argument('--n-predict', type=int, default=32, help='Max tokens to predict') + parser.add_argument('--modelscope', action="store_true", default=False, + help="Use models from modelscope") args = parser.parse_args() - model_path = args.repo_id_or_model_path + + if args.modelscope: + from modelscope import AutoTokenizer + model_hub = 'modelscope' + else: + from transformers import AutoTokenizer + model_hub = 'huggingface' + + model_path = args.repo_id_or_model_path if args.repo_id_or_model_path else \ + ("ZhipuAI/glm-4-9b-chat" if args.modelscope else "THUDM/glm-4-9b-chat") # Load model in 4 bit, # which convert the relevant layers in the model into INT4 format @@ -47,8 +57,9 @@ load_in_4bit=True, optimize_model=True, trust_remote_code=True, - use_cache=True) - model = model.to("xpu") + use_cache=True, + model_hub=model_hub) + model = model.half().to("xpu") # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(model_path, diff --git a/python/llm/example/GPU/HuggingFace/LLM/glm4/streamchat.py b/python/llm/example/GPU/HuggingFace/LLM/glm4/streamchat.py deleted file mode 100644 index 31be35d9639..00000000000 --- a/python/llm/example/GPU/HuggingFace/LLM/glm4/streamchat.py +++ /dev/null @@ -1,69 +0,0 @@ -# -# Copyright 2016 The BigDL Authors. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# - -import torch -import time -import argparse -import numpy as np - -from ipex_llm.transformers import AutoModel -from transformers import AutoTokenizer - - -if __name__ == '__main__': - parser = argparse.ArgumentParser(description='Stream Chat for GLM-4 model') - parser.add_argument('--repo-id-or-model-path', type=str, default="THUDM/glm-4-9b-chat", - help='The huggingface repo id for the GLM-4 model to be downloaded' - ', or the path to the huggingface checkpoint folder') - parser.add_argument('--question', type=str, default="晚上睡不着应该怎么办", - help='Qustion you want to ask') - parser.add_argument('--disable-stream', action="store_true", - help='Disable stream chat') - - args = parser.parse_args() - model_path = args.repo_id_or_model_path - disable_stream = args.disable_stream - - # Load model in 4 bit, - # which convert the relevant layers in the model into INT4 format - # When running LLMs on Intel iGPUs for Windows users, we recommend setting `cpu_embedding=True` in the from_pretrained function. - # This will allow the memory-intensive embedding layer to utilize the CPU instead of iGPU. - model = AutoModel.from_pretrained(model_path, - trust_remote_code=True, - load_in_4bit=True, - optimize_model=True, - use_cache=True, - cpu_embedding=True) - - model = model.to('xpu') - - # Load tokenizer - tokenizer = AutoTokenizer.from_pretrained(model_path, - trust_remote_code=True) - - with torch.inference_mode(): - if disable_stream: - # Chat - response, history = model.chat(tokenizer, args.question, history=[]) - print('-'*20, 'Chat Output', '-'*20) - print(response) - else: - # Stream chat - response_ = "" - print('-'*20, 'Stream Chat Output', '-'*20) - for response, history in model.stream_chat(tokenizer, args.question, history=[]): - print(response.replace(response_, ""), end="") - response_ = response diff --git a/python/llm/example/GPU/HuggingFace/LLM/qwen2.5/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen2.5/README.md index ed4778551db..a51c4e6fc19 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/qwen2.5/README.md +++ b/python/llm/example/GPU/HuggingFace/LLM/qwen2.5/README.md @@ -1,5 +1,5 @@ # Qwen2.5 -In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2.5 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) as reference Qwen2.5 models. +In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2.5 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct), [Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct) (or [Qwen/Qwen2.5-3B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2.5-3B-Instruct), [Qwen/Qwen2.5-7B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct) and [Qwen/Qwen2.5-14B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2.5-14B-Instruct) for ModelScope) as reference Qwen2.5 models. ## 0. Requirements To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. @@ -14,6 +14,9 @@ conda create -n llm python=3.11 conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + +# [optional] only needed if you would like to use ModelScope as model hub +pip install modelscope==1.11.0 ``` #### 1.2 Installation on Windows @@ -24,6 +27,9 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + +# [optional] only needed if you would like to use ModelScope as model hub +pip install modelscope==1.11.0 ``` ### 2. Configures OneAPI environment variables for Linux @@ -91,14 +97,19 @@ set SYCL_CACHE_PERSISTENT=1 > For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. ### 4. Running examples -``` +```bash +# for Hugging Face model hub python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT + +# for ModelScope model hub +python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --modelscope ``` Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Qwen2.5 model (e.g. `Qwen/Qwen2.5-7B-Instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'Qwen/Qwen2.5-7B-Instruct'`. +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the Qwen2.5 model (e.g. `Qwen/Qwen2.5-7B-Instruct`) to be downloaded, or the path to the checkpoint folder. It is default to be `'Qwen/Qwen2.5-7B-Instruct'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. +- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**. #### Sample Output ##### [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct) diff --git a/python/llm/example/GPU/HuggingFace/LLM/qwen2.5/generate.py b/python/llm/example/GPU/HuggingFace/LLM/qwen2.5/generate.py index d1befbcb30c..13d31b7ed1c 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/qwen2.5/generate.py +++ b/python/llm/example/GPU/HuggingFace/LLM/qwen2.5/generate.py @@ -18,20 +18,29 @@ import time import argparse -from transformers import AutoTokenizer if __name__ == '__main__': parser = argparse.ArgumentParser(description='Predict Tokens using generate() API for Qwen2.5 model') parser.add_argument('--repo-id-or-model-path', type=str, default="Qwen/Qwen2.5-7B-Instruct", - help='The huggingface repo id for the Qwen2.5 model to be downloaded' + help='The Hugging Face or ModelScope repo id for the Qwen2.5 model to be downloaded' ', or the path to the huggingface checkpoint folder') parser.add_argument('--prompt', type=str, default="AI是什么?", help='Prompt to infer') parser.add_argument('--n-predict', type=int, default=32, help='Max tokens to predict') + parser.add_argument('--modelscope', action="store_true", default=False, + help="Use models from modelscope") args = parser.parse_args() + + if args.modelscope: + from modelscope import AutoTokenizer + model_hub = 'modelscope' + else: + from transformers import AutoTokenizer + model_hub = 'huggingface' + model_path = args.repo_id_or_model_path @@ -42,7 +51,8 @@ load_in_4bit=True, optimize_model=True, trust_remote_code=True, - use_cache=True) + use_cache=True, + model_hub=model_hub) model = model.half().to("xpu") # Load tokenizer diff --git a/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md index 829139d943d..987bb7bd3c3 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md +++ b/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md @@ -1,5 +1,5 @@ # Qwen2 -In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) as reference Qwen2 models. +In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on Qwen2 models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) and [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) (or [Qwen/Qwen2-7B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2-7B-Instruct) and [Qwen/Qwen2-1.5B-Instruct](https://www.modelscope.cn/models/Qwen/Qwen2-1.5B-Instruct) for ModelScope) as reference Qwen2 models. ## 0. Requirements To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. @@ -16,6 +16,9 @@ conda activate llm pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install transformers==4.37.0 # install transformers which supports Qwen2 + +# [optional] only needed if you would like to use ModelScope as model hub +pip install modelscope==1.11.0 ``` #### 1.2 Installation on Windows @@ -28,6 +31,9 @@ conda activate llm pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install transformers==4.37.0 # install transformers which supports Qwen2 + +# [optional] only needed if you would like to use ModelScope as model hub +pip install modelscope==1.11.0 ``` ### 2. Configures OneAPI environment variables for Linux @@ -95,17 +101,22 @@ set SYCL_CACHE_PERSISTENT=1 > For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or Pro A60, it may take several minutes to compile. ### 4. Running examples -``` +```bash +# for Hugging Face model hub python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT + +# for ModelScope model hub +python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT --modelscope ``` Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Qwen2 model (e.g. `Qwen/Qwen2-7B-Instruct`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'Qwen/Qwen2-7B-Instruct'`. +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the Qwen2 model (e.g. `Qwen/Qwen2-7B-Instruct`) to be downloaded, or the path to the checkpoint folder. It is default to be `'Qwen/Qwen2-7B-Instruct'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. +- `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**. #### Sample Output -##### [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) +##### Qwen/Qwen2-7B-Instruct ([Hugging Face](https://huggingface.co/Qwen/Qwen2-7B-Instruct) or [ModelScope](https://www.modelscope.cn/models/Qwen/Qwen2-7B-Instruct)) ```log Inference time: xxxx s -------------------- Prompt -------------------- @@ -122,7 +133,7 @@ What is AI? AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans and mimic their actions. The term may ``` -##### [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) +##### Qwen/Qwen2-1.5B-Instruct ([Hugging Face](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) or [ModelScope](https://www.modelscope.cn/models/Qwen/Qwen2-1.5B-Instruct)) ```log Inference time: xxxx s -------------------- Prompt -------------------- diff --git a/python/llm/example/GPU/HuggingFace/LLM/qwen2/generate.py b/python/llm/example/GPU/HuggingFace/LLM/qwen2/generate.py index 7d0d1ed072b..fed121290aa 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/qwen2/generate.py +++ b/python/llm/example/GPU/HuggingFace/LLM/qwen2/generate.py @@ -18,21 +18,30 @@ import time import argparse -from transformers import AutoTokenizer import numpy as np if __name__ == '__main__': - parser = argparse.ArgumentParser(description='Qwen2-7B-Instruct') + parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Qwen2 model') parser.add_argument('--repo-id-or-model-path', type=str, default="Qwen/Qwen2-7B-Instruct", - help='The huggingface repo id for the Qwen2 model to be downloaded' - ', or the path to the huggingface checkpoint folder') + help='The Hugging Face or ModelScope repo id for the Qwen2 model to be downloaded' + ', or the path to the checkpoint folder') parser.add_argument('--prompt', type=str, default="AI是什么?", help='Prompt to infer') parser.add_argument('--n-predict', type=int, default=32, help='Max tokens to predict') + parser.add_argument('--modelscope', action="store_true", default=False, + help="Use models from modelscope") args = parser.parse_args() + + if args.modelscope: + from modelscope import AutoTokenizer + model_hub = 'modelscope' + else: + from transformers import AutoTokenizer + model_hub = 'huggingface' + model_path = args.repo_id_or_model_path @@ -43,7 +52,8 @@ load_in_4bit=True, optimize_model=True, trust_remote_code=True, - use_cache=True) + use_cache=True, + model_hub=model_hub) model = model.half().to("xpu") # Load tokenizer From 8d216d063782f791f7b769e6e1cc30cfa13fc808 Mon Sep 17 00:00:00 2001 From: ATMxsp01 Date: Thu, 19 Dec 2024 09:49:51 +0800 Subject: [PATCH 2/2] imporve readme --- .../example/GPU/HuggingFace/LLM/codegeex2/README.md | 10 +++++----- python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md | 4 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md b/python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md index db985e144dd..6514e1bfc80 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md +++ b/python/llm/example/GPU/HuggingFace/LLM/codegeex2/README.md @@ -1,6 +1,6 @@ # CodeGeeX2 -In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on CodeGeeX2 models which is implemented based on the ChatGLM2 architecture trained on more code data on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/codegeex-6b](https://huggingface.co/THUDM/codegeex2-6b) (or [ZhipuAI/codegeex2-6b](https://www.modelscope.cn/models/ZhipuAI/codegeex2-6b) for ModelScope) as a reference CodeGeeX2 model. +In this directory, you will find examples on how you could apply IPEX-LLM INT4 optimizations on CodeGeeX2 models which is implemented based on the ChatGLM2 architecture trained on more code data on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [THUDM/codegeex2-6b](https://huggingface.co/THUDM/codegeex2-6b) (or [ZhipuAI/codegeex2-6b](https://www.modelscope.cn/models/ZhipuAI/codegeex2-6b) for ModelScope) as a reference CodeGeeX2 model. ## 0. Requirements To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. @@ -35,7 +35,7 @@ pip install modelscope==1.11.0 ``` ### 2. Download Model and Replace File -If you select the codegeex2-6b model ([THUDM/codegeex-6b](https://huggingface.co/THUDM/codegeex2-6b)), please note that their code (`tokenization_chatglm.py`) initialized tokenizer after the call of `__init__` of its parent class, which may result in error during loading tokenizer. To address issue, we have provided an updated file ([tokenization_chatglm.py](./codegeex2-6b/tokenization_chatglm.py)) +If you select the codegeex2-6b model ([THUDM/codegeex2-6b](https://huggingface.co/THUDM/codegeex2-6b) (for **Hugging Face**) or [ZhipuAI/codegeex2-6b](https://www.modelscope.cn/models/ZhipuAI/codegeex2-6b) (for **ModelScope**)), please note that their code (`tokenization_chatglm.py`) initialized tokenizer after the call of `__init__` of its parent class, which may result in error during loading tokenizer. To address issue, we have provided an updated file ([tokenization_chatglm.py](./codegeex2-6b/tokenization_chatglm.py)) ```python def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces=False, **kwargs): @@ -43,7 +43,7 @@ def __init__(self, vocab_file, padding_side="left", clean_up_tokenization_spaces super().__init__(padding_side=padding_side, clean_up_tokenization_spaces=clean_up_tokenization_spaces, **kwargs) ``` -You could download the model from [THUDM/codegeex-6b](https://huggingface.co/THUDM/codegeex2-6b), and replace the file `tokenization_chatglm.py` with [tokenization_chatglm.py](./codegeex2-6b/tokenization_chatglm.py). +You could download the model from [THUDM/codegeex2-6b](https://huggingface.co/THUDM/codegeex2-6b) (for **Hugging Face**) or [ZhipuAI/codegeex2-6b](https://www.modelscope.cn/models/ZhipuAI/codegeex2-6b) (for **ModelScope**), and replace the file `tokenization_chatglm.py` with [tokenization_chatglm.py](./codegeex2-6b/tokenization_chatglm.py). ### 3. Configures OneAPI environment variables for Linux @@ -119,13 +119,13 @@ python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROM ``` Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the CodeGeeX2 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/codegeex-6b'` for **Hugging Face** or `'ZhipuAI/codegeex-6b'` for **ModelScope**. +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the **Hugging Face** or **ModelScope** repo id for the CodeGeeX2 model to be downloaded, or the path to the checkpoint folder. It is default to be `'THUDM/codegeex2-6b'` for **Hugging Face** or `'ZhipuAI/codegeex-6b'` for **ModelScope**. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'# language: Python\n# write a bubble sort function\n'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `128`. - `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**. #### Sample Output -#### [THUDM/codegeex-6b](https://huggingface.co/THUDM/codegeex-6b) +#### [THUDM/codegeex2-6b](https://huggingface.co/THUDM/codegeex2-6b) ```log Inference time: xxxx s -------------------- Prompt -------------------- diff --git a/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md index 987bb7bd3c3..ebf094442bc 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md +++ b/python/llm/example/GPU/HuggingFace/LLM/qwen2/README.md @@ -116,7 +116,7 @@ Arguments info: - `--modelscope`: using **ModelScope** as model hub instead of **Hugging Face**. #### Sample Output -##### Qwen/Qwen2-7B-Instruct ([Hugging Face](https://huggingface.co/Qwen/Qwen2-7B-Instruct) or [ModelScope](https://www.modelscope.cn/models/Qwen/Qwen2-7B-Instruct)) +##### [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) ```log Inference time: xxxx s -------------------- Prompt -------------------- @@ -133,7 +133,7 @@ What is AI? AI, or Artificial Intelligence, refers to the simulation of human intelligence in machines that are programmed to think and learn like humans and mimic their actions. The term may ``` -##### Qwen/Qwen2-1.5B-Instruct ([Hugging Face](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) or [ModelScope](https://www.modelscope.cn/models/Qwen/Qwen2-1.5B-Instruct)) +##### [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) ```log Inference time: xxxx s -------------------- Prompt --------------------