[ChatGLM3] support chatglm3

zifeng-radxa · Feb 8, 2024 · 35617f4 · 35617f4
1 parent c1ccd53
commit 35617f4
Show file tree

Hide file tree

Showing 41 changed files with 11,492 additions and 2 deletions.
diff --git a/models/ChatGLM3/README.md b/models/ChatGLM3/README.md
@@ -0,0 +1,219 @@
+![](./assets/sophgo_chip.png)
+
+# ChatGLM3-TPU
+
+本项目实现BM1684X部署语言大模型[ChatGLM3-6B](https://huggingface.co/THUDM/chatglm3-6b)。通过[TPU-MLIR](https://github.com/sophgo/tpu-mlir)编译器将模型转换成bmodel，并采用c++代码将其部署到BM1684X的PCIE环境，或者SoC环境。
+
+
+在知乎上写了关于`ChatGLM`的解读，方便大家理解源码：
+
+[ChatGLM2流程解析与TPU-MLIR部署](https://zhuanlan.zhihu.com/p/641975976)
+
+
+## 开发环境
+
+
+1. 下载docker，启动容器，如下：
+
+``` shell
+docker pull sophgo/tpuc_dev:latest
+
+# myname1234 is just an example, you can set your own name
+docker run --privileged --name myname1234 -v $PWD:/workspace -it sophgo/tpuc_dev:latest
+```
+后文假定环境都在docker的`/workspace`目录。
+
+
+2. 下载`ChatGLM3-6B`，比较大，会花较长时间
+
+``` shell
+git lfs install
+git clone [email protected]:THUDM/chatglm3-6b
+```
+
+并对该工程做三点修改：
+- 将`config.json`文件中`seq_length`配置为512；
+
+- 将`modeling_chatglm.py`文件中的如下代码：
+
+```python
+if attention_mask is not None:
+    attention_scores = attention_scores.masked_fill(attention_mask, float("-inf"))
+```
+
+修改为：
+
+```python
+if attention_mask is not None:
+    attention_scores = attention_scores + (attention_mask * -10000.0)
+```
+
+这样修改可以提升效率，使用`masked_fill`效率低下；另一方面`masked_fill`转ONNX存在些bug。
+
+- 将`modeling_chatglm.py`文件中的如下代码：
+
+```python
+pytorch_major_version = int(torch.__version__.split('.')[0])
+if pytorch_major_version >= 2:
+```
+
+修改为：
+
+```python
+pytorch_major_version = int(torch.__version__.split('.')[0])
+if False:
+```
+
+这是因为ONNX无法支持`torch.nn.functional.scaled_dot_product_attention`算子的转换。
+
+3. 下载`TPU-MLIR`代码并编译，(也可以直接下载编译好的release包解压)
+
+``` shell
+git clone [email protected]:sophgo/tpu-mlir.git
+cd tpu-mlir
+source ./envsetup.sh
+./build.sh
+```
+
+4. 下载本项目`ChatGLM3-TPU`，如下：
+
+``` shell
+git clone [email protected]:sophgo/ChatGLM3-TPU.git
+```
+
+## 编译模型
+
+1. 指定`ChatGLM3-6B`的python路径
+
+``` shell
+export PYTHONPATH=/workspace/chatglm3-6b:$PYTHONPATH
+```
+
+2. 导出所有onnx模型，如果过程中提示缺少某些组件，直接`pip install 组件`即可
+
+``` shell
+cd ChatGLM3-TPU/compile
+python3 export_onnx.py
+```
+此时有大量onnx模型被导出到tmp目录。
+
+3. 对onnx模型进行编译
+
+目前TPU-MLIR支持对ChatGLM3进行F16、INT8和INT4量化，且支持多芯分布式推理，默认情况下会进行F16量化和单芯推理，最终生成`chatglm3-6b.bmodel`文件
+
+```shell
+./compile.sh
+```
+
+若想进行INT8或INT4量化，则执行以下命令，最终生成`chatglm3-6b_int8.bmodel`或`chatglm3-6b_int4.bmodel`文件，如下命令：
+
+```shell
+./compile.sh --mode int8 # or int4
+```
+
+若想进行2芯推理，则执行以下命令，最终生成`chatglm3-6b_f16_2dev.bmodel`文件，4芯8芯同理：
+
+```shell
+./compile.sh --num_device 2
+```
+
+## 编译程序(C++版本)
+
+执行如下编译，默认是PCIE版本：
+
+```shell
+cd ChatGLM3-TPU/demo
+mkdir build
+cd build
+cmake ..
+make
+```
+
+如果是SoC版本，有两种编译方法：
+
+方法1：直接将demo目录拷贝到SoC环境，按以上步骤编译(推荐)
+
+方法2：docker中交叉编译，如下操作
+
+```shell
+wget https://releases.linaro.org/components/toolchain/binaries/7.5-2019.12/aarch64-linux-gnu/gcc-linaro-7.5.0-2019.12-x86_64_aarch64-linux-gnu.tar.xz
+tar -xvf gcc-linaro-7.5.0-2019.12-x86_64_aarch64-linux-gnu.tar.xz
+mv gcc-linaro-7.5.0-2019.12-x86_64_aarch64-linux-gnu /opt/aarch64-linux-gnu-7.5.0
+cd ChatGLM3-TPU/demo
+mkdir build
+cd build
+cmake .. -DTARGET_ARCH=soc
+make -j
+
+```
+
+编译生成chatglm可执行程序，将`chatglm`放到/ChatGLM3-TPU/demo目录下，同时按照下列方式指定芯片数量和bmodel路径。
+运行`chatglm`，默认单芯运行`chatglm3-6b.bmodel`:
+```shell
+./chatglm --model chatglm3-6b.bmodel
+```
+
+如果是要运行INT8或INT4模型，则命令如下：
+```shell
+./chatglm --model chatglm3-6b_int8.bmodel # same with int4
+```
+
+如果是2芯分布式推理，使用如下命令(比如指定在2号和3号芯片上运行, 用`source /etc/profiel`后使用`bm-smi`查询芯片id号)：
+```shell
+./chatglm --model chatglm3-6b_f16_2dev.bmodel --devid 2,3
+```
+
+## 编译程序(Python Web版本)
+
+```shell
+pip install gradio==3.39.0
+cd ChatGLM3-TPU/web_demo
+mkdir build
+cd build
+cmake ..
+make -j
+```
+
+编译成功会生成`libtpuchat.so*`, 在web_demo.py中指定bmodel\_path token\_path device\_id, lib_path(编译生产的.so文件), 以及dev_id。
+```python
+python web_demo.py --dev 0 --bmodel_path your_bmodel_path
+```
+即可成功运行web的demo。
+
+如果是SoC环境，参考C++版本
+
+PS：尽量下载gradio==3.39.0版本，不然会出现各种问题！！
+
+## 运行效果
+
+以下为单芯片下INT8量化模式的运行效果：
+
+![](./assets/chatglm.jpg)
+
+## 常见问题
+
+#### sentencepiece是怎么来的
+
+工程中已经有编译好的，所以不需要编译，如果好奇的话，参考如下步骤。
+
+下载[sentencepiece](https://github.com/google/sentencepiece)，并编译得到`libsentencepiece.a`
+
+```shell
+git clone [email protected]:google/sentencepiece.git
+cd sentencepiece
+mkdir build
+cd build
+cmake ..
+make -j
+```
+
+如果要编译SoC环境，则参考demo的编译方式，在makefile中指定交叉编译器
+
+#### demo程序无法正常运行
+
+如果demo程序拷贝到运行环境提示无法运行，比如接口找不到等等错误。
+原因是运行环境的库有所不同，将demo中的`lib_pcie`（PCIE）或者 `lib_soc`(SoC)里面的so文件拷贝到运行环境，链接到里面的so即可。
+
+
+## 工具调用
+参考：[工具调用](./tools_using/README.md)
diff --git a/models/ChatGLM3/compile/compile.sh b/models/ChatGLM3/compile/compile.sh
@@ -0,0 +1,174 @@
+#!/bin/bash
+set -ex
+models=
+mode="f16"
+folder="tmp"
+num_device=1
+mode_args=""
+device_args=""
+quantize_args="--quantize F16"
+name=""
+num_layers=
+out_model=$name.bmodel
+
+while [[ $# -gt 0 ]]; do
+    key="$1"
+
+    case $key in
+        --mode)
+            mode="$2"
+            shift 2
+            ;;
+        --num_device)
+            num_device="$2"
+            shift 2
+            ;;
+        --name)
+            name="$2"
+            shift 2
+            ;;
+        *)
+            echo "Invalid option: $key" >&2
+            exit 1
+            ;;
+        :)
+            echo "Option -$OPTARG requires an argument." >&2
+            exit 1
+            ;;
+    esac
+done
+
+if [ "$name" = "chatglm3-6b" ]; then
+  num_layers=27
+  echo "Compile ChatGLM3-6B"
+else
+  >&2 echo -e "Error: Invalid name $name, the input name must be \033[31mchatglm3-6b\033[0m"
+  exit 1
+fi
+
+if [ x$mode == x"int8" ]; then
+    quantize_args="--quantize W8F16"
+elif [ x$mode == x"bf16" ]; then
+    quantize_args="--quantize F16"
+elif [ x$mode == x"int4" ]; then
+    quantize_args="--quantize W4F16 --q_group_size 64"
+else
+    echo "Error, unknown quantize mode"
+    exit 1
+fi
+
+if [ x$num_device != x1 ]; then
+    device_args="--num_device $num_device"
+    out_model=$name'_'$mode'_'$num_device'dev.bmodel'
+else
+    out_model=$name'_'$mode'_1dev.bmodel'
+fi
+
+outdir=${folder}/embedding
+mkdir -p $outdir
+pushd $outdir
+
+model_transform.py \
+    --model_name embedding \
+    --model_def ../onnx/embedding.onnx \
+    --mlir embedding.mlir
+
+
+model_deploy.py \
+    --mlir embedding.mlir \
+    --quantize F16 \
+    --quant_input \
+    --quant_output \
+    --chip bm1684x \
+    $device_args \
+    --model embedding.bmodel
+
+model_transform.py \
+    --model_name embedding_cache \
+    --model_def ../onnx/embedding.onnx \
+    --input_shapes [[1,1]] \
+    --mlir embedding_cache.mlir
+
+
+model_deploy.py \
+    --mlir embedding_cache.mlir \
+    --quantize F16 \
+    --quant_input \
+    --quant_output \
+    --chip bm1684x \
+    --model embedding_cache.bmodel
+
+rm *.npz
+
+models=$models' '$outdir'/embedding.bmodel '$outdir'/embedding_cache.bmodel '
+
+popd
+
+echo $models
+
+outdir=tmp/$mode"_"$num_device"dev"/lm_head
+mkdir -p $outdir
+pushd $outdir
+
+model_transform.py \
+    --model_name lm_head \
+    --model_def ../../lm_head.onnx \
+    --mlir lm_head.mlir
+
+model_deploy.py \
+    --mlir lm_head.mlir \
+    $quantize_args \
+    --quant_input \
+    --quant_output \
+    --chip bm1684x \
+    $device_args \
+    --model lm_head.bmodel
+
+rm *.npz
+
+models=${models}${outdir}'/lm_head.bmodel '
+popd
+
+echo $models
+
+outdir=tmp/$mode"_"$num_device"dev"/block
+mkdir -p $outdir
+
+pushd $outdir
+mkdir -p $outdir
+
+for ((i=0; i<=$num_layers; i++)); do
+
+    model_transform.py \
+        --model_name block_$i \
+        --model_def ../../block_$i.onnx \
+        --mlir block_$i.mlir
+
+    model_deploy.py \
+        --mlir block_$i.mlir \
+        $quantize_args \
+        --chip bm1684x \
+        $device_args \
+        --model block_$i.bmodel
+
+    model_transform.py \
+        --model_name block_cache_$i \
+        --model_def ../../block_cache_$i.onnx \
+        --mlir block_cache_$i.mlir
+
+    model_deploy.py \
+        --mlir block_cache_$i.mlir \
+        $quantize_args \
+        --chip bm1684x \
+        $device_args \
+        --model block_cache_$i.bmodel
+
+    rm *.npz
+
+    models=${models}${outdir}'/block_'$i'.bmodel '$outdir'/block_cache_'$i'.bmodel '
+
+done
+popd
+echo $models
+
+model_tool --combine $models -o $out_model