[ChatGLM] add demo_parallel.cpp

zifeng-radxa · Feb 19, 2024 · cdc20d9 · cdc20d9
1 parent 35617f4
commit cdc20d9
Show file tree

Hide file tree

Showing 38 changed files with 7,159 additions and 41 deletions.
diff --git a/.gitignore b/.gitignore
diff --git a/.gitmodules b/.gitmodules
diff --git a/README.md b/README.md
diff --git a/models/Baichuan2/README.md b/models/Baichuan2/README.md
@@ -0,0 +1,182 @@
+![image](./assets/sophgo_chip.png)
+
+# Baichuan2-TPU
+
+本项目实现BM1684X部署语言大模型[Baichuan2-7B](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat)。通过[TPU-MLIR](https://github.com/sophgo/tpu-mlir)编译器将模型转换成bmodel，并采用c++代码将其部署到BM1684X的PCIE环境，或者SoC环境。
+
+下文中默认是PCIE环境；如果是SoC环境，按提示操作即可。
+
+# 目录说明
+```
+.
+├── README.md                           #使用说明
+├── requirements.txt                    #需要使用的python wheel包
+├── assets
+├── compile
+│   ├── compile.sh                      #用来编译TPU模型的脚本
+│   ├── export_onnx_fast.py             #用来导出onnx的脚本
+│   ├── modeling_baichuan.py            #替换Baichuan2-7B-chat的对应文件的备份
+│   └── torch_inference.py              #torch推理脚本
+├── demo                                #Baichuan2 c++代码文件
+│   ├── CMakeLists.txt
+│   └── demo.cpp                        #主程序
+├── src                                 #编译依赖库
+│   ├── include
+│   ├── lib_pcie
+│   └── lib_soc
+├── model                               #模型文件（bmodel需下载）
+│   ├── baichuan2-7b-test_int8.bmodel
+│   └── tokenizer.model
+└── web_demo                            #web demo，提供网页对话示例
+    ├── chat.cpp
+    ├── chat.py
+    ├── CMakeLists.txt
+    └── web_demo.py
+```
+----------------------------
+
+# 【阶段一】模型编译
+
+## 注意点
+* 模型编译必须要在docker内完成，无法在docker外操作
+
+### 步骤一：模型下载
+Baichuan2模型在hugging face上完全开源，供用户下载使用。请根据官网下载步骤进行模型与权重的下载。
+```bash
+# Make sure you have git-lfs installed (https://git-lfs.com)
+git lfs install
+git clone https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat
+
+# if you want to clone without large files – just their pointers
+# prepend your git clone with the following env var:
+GIT_LFS_SKIP_SMUDGE=1
+```
+
+### 步骤二：下载docker
+
+下载docker，启动容器，如下：
+
+``` shell
+docker pull sophgo/tpuc_dev:latest
+
+# myname1234 is just an example, you can set your own name
+docker run --privileged --name myname1234 -v $PWD:/workspace -it sophgo/tpuc_dev:latest
+```
+
+### 步骤三：下载TPU-MLIR代码并编译
+
+``` shell
+git clone [email protected]:sophgo/tpu-mlir.git
+cd tpu-mlir
+source ./envsetup.sh
+./build.sh
+```
+* PS：重新进入docker环境并且需要编译模型时，必须在此路径下执行上述`source ./envsetup.sh` 和 `./build.sh`才能完成后续模型编译。
+
+### 步骤四：下载本项目，安装requirements.txt
+下载transfomers、sentencepiece、Baichuan2-TPU以及百度网盘里的.bin模型，并替换transformers里面的modeling_baichuan.py
+
+``` shell
+git clone https://github.com/sophgo/Baichuan2-TPU.git
+cd Baichuan2
+pip install -r requirements.txt
+```
+
+### 步骤五：替换modeling_baichuan.py, 修改config.json, 生成onnx文件
+修改Baichuan2-7B-chat项目中config.json文件中max_position_embeddings与model_max_length，从4096变为512
+
+``` shell
+cd compile
+cp modeling_baichuan.py $BAICHUAN2_PATH
+python export_onnx_fast.py --model_path your_model_path
+```
+
+* PS1：your_model_path 指的是原模型下载后的地址, 如:"../../torch2onnx/Baichuan2-7B-Chat", 可以根据需要选择使用7b模型还是13b模型。
+* PS2：如果你想要debug，而不是一下子生成完成全部的onnx模型，可以将240行的num_layers改成1, 并结合函数对比单个block情况下是否可以和
+
+### 步骤六：生成bmodel文件
+
+生成模型
+
+``` shell
+./compile.sh --mode int8
+```
+
+* PS1：编译完成后最终会在Llama2-TPU/compile路径下生成名为baichuan2-{X}b_{Y}_{Z}dev.bmodel,其中X为7或13，Y为`compile.sh`时选择的`mode`的数据类型,Z为推理的芯片数量(如果不指定num_device, 会省略{Z}dev的部分)
+* PS2：生成bmodel耗时大概3小时以上，建议64G内存以及200G以上硬盘空间，不然很可能OOM或者no space left
+* PS3：目前给定的lib_pcie和lib_soc部分仅包含单芯的动态库，多芯部分会在后续更新
+
+----------------------------
+
+# 阶段二：可执行文件生成（可以跳过）
+
+## 准备
+* bmodel模型准备：经过阶段一后将得到编译好的bmodel文件【也可以使用我们提供的现成编译好的bmodel文件】，下载方式为:
+```shell
+cd Baichuan2-TPU/model
+pip3 install dfss
+# baichuan2-7B
+python3 -m dfss [email protected]:sophon-demo/baichuan2/baichuan2-7b-test_int8.bmodel
+```
+将得到编译好的int8单芯bmodel模型文件。
+
+## 编译程序(C++版本)
+
+执行如下编译，默认是PCIE版本：
+
+```shell
+cd Baichuan2-TPU/demo
+mkdir build
+cd build
+cmake ..
+make
+```
+
+如果是SoC版本，有两种编译方法：
+
+方法1：直接将demo目录拷贝到SoC环境，按以上步骤编译(推荐)
+
+方法2：docker中交叉编译，如下操作
+
+```shell
+wget https://releases.linaro.org/components/toolchain/binaries/7.5-2019.12/aarch64-linux-gnu/gcc-linaro-7.5.0-2019.12-x86_64_aarch64-linux-gnu.tar.xz
+tar -xvf gcc-linaro-7.5.0-2019.12-x86_64_aarch64-linux-gnu.tar.xz
+mv gcc-linaro-7.5.0-2019.12-x86_64_aarch64-linux-gnu /opt/aarch64-linux-gnu-7.5.0
+cd Baichuan2-TPU/demo
+mkdir build
+cd build
+cmake .. -DTARGET_ARCH=soc # soc 只有一颗芯片，因此不支持多芯编译
+make -j
+```
+
+编译生成llama2可执行程序。
+
+运行`baichuan2`:
+```shell
+./baichuan2 --model ../model/baichuan2-7b-test_int8.bmodel --dev dev_id
+```
+
+## 编译程序(Python Web版本)【单芯】
+
+```shell
+pip install gradio==3.39.0
+cd Baichuan2-TPU/web_demo
+mkdir build
+cd build
+cmake ..
+make -j
+```
+
+编译成功会在`build`文件夹下生成`libtpuchat.so*`, 此时可以在web_demo.py中指定bmodel\_path token\_path device\_id, lib_path(编译生产的`libtpuchat.so*`文件, 默认路径是`./build`下), 以及dev_id。
+```python
+python web_demo.py
+```
+即可成功运行web的demo。
+* PS：在用户不修改上述token\_path的lib\_path的存放路径前提下只需指定bmodel\_path即可运行程序。
+
+如果是SoC环境，参考C++版本
+
+* PS：尽量下载gradio==3.39.0版本，不然会出现各种问题！！
+
+# 常见问题
+* 请根据实际block数目调整`demo/chat`中或者`web_demo/chat.cpp`中的NUM_LAYERS，默认是使用Baichuan2-7B(NUM_LAYERS=32)
diff --git a/models/Baichuan2/compile/compile.sh b/models/Baichuan2/compile/compile.sh
@@ -0,0 +1,186 @@
+#!/bin/bash
+set -ex
+models=
+mode="f16"
+folder="tmp"
+num_device=1
+mode_args=""
+device_args=""
+quantize_args="--quantize F16"
+name=""
+num_layers=
+out_model=$name.bmodel
+
+if [ -z "$name" ]; then
+    name="baichuan2-7b"
+    echo "Compile Baichuan2-7B"
+else
+    name="baichuan2-13b"
+    echo "Compile Baichuan2-13B"
+fi
+
+while [[ $# -gt 0 ]]; do
+    key="$1"
+
+    case $key in
+        --mode)
+            mode="$2"
+            shift 2
+            ;;
+        --num_device)
+            num_device="$2"
+            shift 2
+            ;;
+        --name)
+            name="$2"
+            shift 2
+            ;;
+        *)
+            echo "Invalid option: $key" >&2
+            exit 1
+            ;;
+        :)
+            echo "Option -$OPTARG requires an argument." >&2
+            exit 1
+            ;;
+    esac
+done
+
+if [ x$mode == x"int8" ] || [ x$mode == x"int4" ]; then
+    if [ x$mode == x"int8" ]; then
+        quantize_args="--quantize W8F16"
+    else
+        quantize_args="--quantize W4BF16 --q_group_size 64"
+    fi
+    out_model=$name'_'$mode'.bmodel'
+fi
+
+if [ x$name == x"baichuan2-7b" ] || [ x$name == x"baichuan2-13b" ]; then
+    if [ x$name == x"baichuan2-7b" ]; then
+        num_layers=32
+    else
+        num_layers=40
+    fi
+fi
+
+if [ x$num_device != x1 ]; then
+    device_args="--num_device $num_device"
+    out_model=$name'_'$mode'_'$num_device'dev.bmodel'
+else
+    out_model=$name'_'$mode'_1dev.bmodel'
+fi
+
+outdir=${folder}/embedding
+mkdir -p $outdir
+pushd $outdir
+
+seqlen=512
+model_transform.py \
+    --model_name embedding \
+    --model_def ../embedding.onnx \
+    --input_shapes [[$seqlen]] \
+    --mlir embedding_${seqlen}.mlir
+
+
+model_deploy.py \
+    --mlir embedding_$seqlen.mlir \
+    --quantize F16 \
+    --chip bm1684x \
+    $device_args \
+    --model embedding_${seqlen}_f16.bmodel
+
+model_transform.py \
+    --model_name embedding_cache \
+    --model_def ../embedding.onnx \
+    --input_shapes [[1]] \
+    --mlir embedding_1.mlir
+
+
+model_deploy.py \
+    --mlir embedding_1.mlir \
+    --quantize F16 \
+    --chip bm1684x \
+    $device_args \
+    --model embedding_1_f16.bmodel
+
+rm *.npz
+
+models=$models' '$outdir'/embedding_1_f16.bmodel '$outdir'/embedding_'$seqlen'_f16.bmodel '
+
+popd
+
+echo $models
+
+outdir=${folder}/$mode"_"$num_device"dev"/lm_head
+mkdir -p $outdir
+pushd $outdir
+
+model_transform.py \
+    --model_name lm_head \
+    --model_def ../../lm_head.onnx \
+    --mlir lm_head.mlir
+
+
+model_deploy.py \
+    --mlir lm_head.mlir \
+    --quantize F16 \
+    --chip bm1684x \
+    --model lm_head.bmodel
+
+rm *.npz
+
+models=${models}${outdir}'/lm_head.bmodel '
+popd
+
+echo $models
+
+outdir=${folder}/$mode"_"$num_device"dev"/block
+mkdir -p $outdir
+
+pushd $outdir
+mkdir -p $outdir
+
+for ((i=0; i<$num_layers; i++))
+do
+
+model_transform.py \
+    --model_name block_$i \
+    --model_def ../../block_$i.onnx \
+    --mlir block_$i.mlir
+
+model_deploy.py \
+    --mlir block_$i.mlir \
+    $quantize_args \
+    --chip bm1684x \
+    --quant_output \
+    --quant_output_list 2,3 \
+    $device_args \
+    --model block_$i.bmodel
+
+model_transform.py \
+    --model_name block_cache_$i \
+    --model_def ../../block_cache_${i}.onnx \
+    --mlir block_cache_$i.mlir
+
+model_deploy.py \
+    --mlir block_cache_$i.mlir \
+    $quantize_args \
+    --chip bm1684x \
+    --quant_input \
+    --quant_output \
+    --quant_input_list 4,5 \
+    --quant_output_list 2,3 \
+    $device_args \
+    --model block_cache_$i.bmodel
+
+rm *.npz
+# rm ../../block_$i.onnx
+# rm ../../block_cache_$i.onnx
+
+models=${models}${outdir}'/block_'$i'.bmodel '$outdir'/block_cache_'$i'.bmodel '
+
+done
+popd
+echo $models
+
+model_tool --combine $models -o $out_model