Serving配置

(简体中文|English)

简介

本文主要介绍C++ Serving以及Python Pipeline的各项配置:

模型配置文件: 转换模型时自动生成，描述模型输入输出信息
C++ Serving: 用于高性能场景，介绍了快速启动以及自定义配置方法
Python Pipeline: 用于单算子多模型组合场景

模型配置文件

在开始介绍Server配置之前，先来介绍一下模型配置文件。我们在将模型转换为PaddleServing模型时，会生成对应的serving_client_conf.prototxt以及serving_server_conf.prototxt，两者内容一致，为模型输入输出的参数信息，方便用户拼装参数。该配置文件用于Server以及Client，并不需要用户自行修改。转换方法参考文档《怎样保存用于Paddle Serving的模型》。protobuf格式可参考core/configure/proto/general_model_config.proto。

样例如下：

feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
  name: "concat_1.tmp_0"
  alias_name: "concat_1.tmp_0"
  is_lod_tensor: false
  fetch_type: 1
  shape: 3
  shape: 640
  shape: 640
}

其中

feed_var：模型输入
fetch_var：模型输出
name：名称
alias_name：别名，与名称对应
is_lod_tensor：是否为lod，具体可参考《Lod字段说明》
feed_type：数据类型

feed_type	类型
0	INT64
1	FLOAT32
2	INT32
3	FP64
4	INT16
5	FP16
6	BF16
7	UINT8
8	INT8
20	STRING

shape：数据维度

C++ Serving

1.快速启动与关闭

可以通过配置模型及端口号快速启动服务，启动命令如下：

python3 -m paddle_serving_server.serve --model serving_model --port 9393

该命令会自动生成配置文件，并使用生成的配置文件启动C++ Serving。例如上述启动命令会自动生成workdir_9393目录，其结构如下

workdir_9393
├── general_infer_0
│   ├── fluid_time_file
│   ├── general_model.prototxt
│   └── model_toolkit.prototxt
├── infer_service.prototxt
├── resource.prototxt
└── workflow.prototxt

更多启动参数详见下表：

Argument	Type	Default	Description
`--thread`	int	`2`	Number of brpc service thread
`--runtime_thread_num`	int[]	`0`	Thread Number for each model in asynchronous mode
`--batch_infer_size`	int[]	`32`	Batch Number for each model in asynchronous mode
`--gpu_ids`	str[]	`"-1"`	Gpu card id for each model
`--port`	int	`9292`	Exposed port of current service to users
`--model`	str[]	`""`	Path of paddle model directory to be served
`--mem_optim_off`	-	-	Disable memory / graphic memory optimization
`--ir_optim`	bool	False	Enable analysis and optimization of calculation graph
`--use_mkl` (Only for cpu version)	-	-	Run inference with MKL. Need open with ir_optim.
`--use_trt` (Only for trt version)	-	-	Run inference with TensorRT. Need open with ir_optim.
`--use_lite` (Only for Intel x86 CPU or ARM CPU)	-	-	Run PaddleLite inference. Need open with ir_optim.
`--use_xpu`	-	-	Run PaddleLite inference with Baidu Kunlun XPU. Need open with ir_optim.
`--precision`	str	FP32	Precision Mode, support FP32, FP16, INT8
`--use_calib`	bool	False	Use TRT int8 calibration
`--gpu_multi_stream`	bool	False	EnableGpuMultiStream to get larger QPS
`--use_ascend_cl`	bool	False	Enable for ascend910; Use with use_lite for ascend310
`--request_cache_size`	int	`0`	Bytes size of request cache. By default, the cache is disabled
`--enable_prometheus`	bool	False	Use Prometheus
`--prometheus_port`	int	19393	Port of the Prometheus
`--use_dist_model`	bool	False	Use distributed model or not
`--dist_carrier_id`	str	""	Carrier id of distributed model
`--dist_cfg_file`	str	""	Config file of distributed model
`--dist_endpoints`	str	""	Endpoints of distributed model. splited by comma
`--dist_nranks`	int	0	The number of rank in the distributed model
`--dist_subgraph_index`	int	-1	The subgraph index of distributed model
`--dist_master_serving`	bool	False	The master serving of distributed inference
`--min_subgraph_size`	str	""	The min size of subgraph
`--gpu_memory_mb`	int	50	Initially allocate GPU storage size, 50 MB default.
`--cpu_math_thread_num`	int	1	Initialize the number of CPU computing threads
`--trt_workspace_size`	int	33554432	Initialize allocation 1 << 25 GPU storage size for tensorRT
`--trt_use_static`	bool	False	Initialize TRT with static data

当您的某个模型想使用多张GPU卡部署时.

python3 -m paddle_serving_server.serve --model serving_model --thread 10 --port 9292 --gpu_ids 0,1,2

当您的一个服务包含两个模型部署时.

python3 -m paddle_serving_server.serve --model serving_model_1 serving_model_2 --thread 10 --port 9292

当您想要关闭Serving服务时（在Serving启动目录或环境变量SERVING_HOME路径下，执行以下命令）.

python3 -m paddle_serving_server.serve stop

stop参数发送SIGINT至C++ Serving，若改成kill则发送SIGKILL信号至C++ Serving

2.自定义配置启动

一般情况下，自动生成的配置可以应对大部分场景。对于特殊场景，用户也可自行定义配置文件。这些配置文件包括service.prototxt、workflow.prototxt、resource.prototxt、model_toolkit.prototxt、proj.conf。启动命令如下:

/bin/serving --flagfile=proj.conf

2.1 proj.conf

proj.conf用于传入服务参数，并指定了其他相关配置文件的路径。如果重复传入参数，则以最后序参数值为准。

# for paddle inference
--precision=fp32
--use_calib=False
--reload_interval_s=10
# for brpc
--max_concurrency=0
--num_threads=10
--bthread_concurrency=10
--max_body_size=536870912
# default path
--inferservice_path=conf
--inferservice_file=infer_service.prototxt
--resource_path=conf
--resource_file=resource.prototxt
--workflow_path=conf
--workflow_file=workflow.prototxt

各项参数的描述及默认值详见下表：

name	Default	Description
precision	"fp32"	Precision Mode, support FP32, FP16, INT8
use_calib	False	Only for deployment with TensorRT
reload_interval_s	10	Reload interval
max_concurrency	0	Limit of request processing in parallel, 0: unlimited
num_threads	10	Number of brpc service thread
bthread_concurrency	10	Number of bthread
max_body_size	536870912	Max size of brpc message
inferservice_path	"conf"	Path of inferservice conf
inferservice_file	"infer_service.prototxt"	Filename of inferservice conf
resource_path	"conf"	Path of resource conf
resource_file	"resource.prototxt"	Filename of resource conf
workflow_path	"conf"	Path of workflow conf
workflow_file	"workflow.prototxt"	Filename of workflow conf

2.2 service.prototxt

service.prototxt用于配置Paddle Serving实例挂载的service列表。通过--inferservice_path和--inferservice_file指定加载路径。protobuf格式可参考core/configure/server_configure.protobuf的InferServiceConf。示例如下：

port: 8010
services {
  name: "GeneralModelService"
  workflows: "workflow1"
}

其中：

port: 用于配置Serving实例监听的端口号。
services: 使用默认配置即可，不可修改。name指定service名称，workflow1的具体定义在workflow.prototxt

2.3 workflow.prototxt

workflow.prototxt用来描述具体的workflow。通过--workflow_path和--workflow_file指定加载路径。protobuf格式可参考configure/server_configure.protobuf的Workflow类型。如下示例，workflow由3个OP构成，GeneralReaderOp用于读取数据，GeneralInferOp依赖于GeneralReaderOp并进行预测，GeneralResponseOp将预测结果返回：

workflows {
  name: "workflow1"
  workflow_type: "Sequence"
  nodes {
    name: "general_reader_0"
    type: "GeneralReaderOp"
  }
  nodes {
    name: "general_infer_0"
    type: "GeneralInferOp"
    dependencies {
      name: "general_reader_0"
      mode: "RO"
    }
  }
  nodes {
    name: "general_response_0"
    type: "GeneralResponseOp"
    dependencies {
      name: "general_infer_0"
      mode: "RO"
    }
  }
}

其中：

name: workflow名称，用于从service.prototxt索引到具体的workflow
workflow_type: 只支持"Sequence"
nodes: 用于串联成workflow的所有节点，可配置多个nodes。nodes间通过配置dependencies串联起来
node.name: 与node.type一一对应，具体可参考python/paddle_serving_server/dag.py
node.type: 当前node所执行OP的类名称，与serving/op/下每个具体的OP类的名称对应
node.dependencies: 依赖的上游node列表
node.dependencies.name: 与workflow内节点的name保持一致
node.dependencies.mode: RO-Read Only, RW-Read Write

2.4 resource.prototxt

resource.prototxt，用于指定模型配置文件。通过--resource_path和--resource_file指定加载路径。它的protobuf格式参考core/configure/proto/server_configure.proto的ResourceConf。示例如下：

model_toolkit_path: "conf"
model_toolkit_file: "general_infer_0/model_toolkit.prototxt"
general_model_path: "conf"
general_model_file: "general_infer_0/general_model.prototxt"

其中：

model_toolkit_path:用来指定model_toolkit.prototxt所在的目录
model_toolkit_file: 用来指定model_toolkit.prototxt所在的文件名
general_model_path: 用来指定general_model.prototxt所在的目录
general_model_file: 用来指定general_model.prototxt所在的文件名

2.5 model_toolkit.prototxt

用来配置模型信息和预测引擎。它的protobuf格式参考core/configure/proto/server_configure.proto的ModelToolkitConf。model_toolkit.protobuf的磁盘路径不能通过命令行参数覆盖。示例如下：

engines {
  name: "general_infer_0"
  type: "PADDLE_INFER"
  reloadable_meta: "uci_housing_model/fluid_time_file"
  reloadable_type: "timestamp_ne"
  model_dir: "uci_housing_model"
  gpu_ids: -1
  enable_memory_optimization: true
  enable_ir_optimization: false
  use_trt: false
  use_lite: false
  use_xpu: false
  use_gpu: false
  combined_model: false
  gpu_multi_stream: false
  use_ascend_cl: false
  runtime_thread_num: 0
  batch_infer_size: 32
  enable_overrun: false
  allow_split_request: true
}

其中

name: 引擎名称，与workflow.prototxt中的node.name以及所在目录名称对应
type: 预测引擎的类型。当前只支持”PADDLE_INFER“
reloadable_meta: 目前实际内容无意义，用来通过对该文件的mtime判断是否超过reload时间阈值
reloadable_type: 检查reload条件：timestamp_ne/timestamp_gt/md5sum/revision/none

reloadable_type	含义
timestamp_ne	reloadable_meta所指定文件的mtime时间戳发生变化
timestamp_gt	reloadable_meta所指定文件的mtime时间戳大于等于上次检查时记录的mtime时间戳
md5sum	目前无用，配置后永远不reload
revision	目前无用，配置后用于不reload

model_dir: 模型文件路径
gpu_ids: 引擎运行时使用的GPU device id，支持指定多个，如：

# 指定GPU0，1，2
gpu_ids: 0
gpu_ids: 1
gpu_ids: 2

enable_memory_optimization: 是否开启memory优化
enable_ir_optimization: 是否开启ir优化
use_trt: 是否开启TensorRT，需同时开启use_gpu
use_lite: 是否开启PaddleLite
use_xpu: 是否使用昆仑XPU
use_gpu:是否使用GPU
combined_model: 是否使用组合模型文件
gpu_multi_stream: 是否开启gpu多流模式
use_ascend_cl: 是否使用昇腾,单独开启适配昇腾910，同时开启lite适配310
runtime_thread_num: 若大于0，则启用Async异步模式，并创建对应数量的predictor实例。
batch_infer_size: Async异步模式下的最大batch数
enable_overrun: Async异步模式下总是将整个任务放入任务队列
allow_split_request: Async异步模式下允许拆分任务

2.6 general_model.prototxt

general_model.prototxt内容与模型配置serving_server_conf.prototxt相同，用了描述模型输入输出参数信息。示例如下：

feed_var {
  name: "x"
  alias_name: "x"
  is_lod_tensor: false
  feed_type: 1
  shape: 13
}
fetch_var {
  name: "fc_0.tmp_1"
  alias_name: "price"
  is_lod_tensor: false
  fetch_type: 1
  shape: 1
}

Python Pipeline

快速启动与关闭

Python Pipeline启动命令如下：

python3 web_service.py

当您想要关闭Serving服务时（在Pipeline启动目录下或环境变量SERVING_HOME路径下，执行以下命令）：

python3 -m paddle_serving_server.serve stop

stop参数发送SIGINT至Pipeline Serving，若改成kill则发送SIGKILL信号至Pipeline Serving

配置文件

Python Pipeline提供了用户友好的多模型组合服务编程框架，适用于多模型组合应用的场景。其配置文件为YAML格式，一般默认为config.yaml。示例如下：

#rpc端口, rpc_port和http_port不允许同时为空。当rpc_port为空且http_port不为空时，会自动将rpc_port设置为http_port+1
rpc_port: 18090

#http端口, rpc_port和http_port不允许同时为空。当rpc_port可用且http_port为空时，不自动生成http_port
http_port: 9999

#worker_num, 最大并发数。当build_dag_each_worker=True时, 框架会创建worker_num个进程，每个进程内构建grpcSever和DAG
##当build_dag_each_worker=False时，框架会设置主线程grpc线程池的max_workers=worker_num
worker_num: 20

#build_dag_each_worker, False，框架在进程内创建一条DAG；True，框架会每个进程内创建多个独立的DAG
build_dag_each_worker: false

dag:
    #op资源类型, True, 为线程模型；False，为进程模型
    is_thread_op: False

    #重试次数
    retry: 1

    #使用性能分析, True，生成Timeline性能数据，对性能有一定影响；False为不使用
    use_profile: false
    tracer:
        interval_s: 10

    #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
    #client_type: local_predictor

    #channel的最大长度，默认为0
    #channel_size: 0

    #针对大模型分布式场景tensor并行，接收第一个返回结果后其他结果丢弃来提供速度
    #channel_recv_frist_arrive: False

op:
    det:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 6

        #Serving IPs
        #server_endpoints: ["127.0.0.1:9393"]

        #Fetch结果列表，以client_config中fetch_var的alias_name为准
        #fetch_list: ["concat_1.tmp_0"]

        #det模型client端配置
        #client_config: serving_client_conf.prototxt

        #Serving交互超时时间, 单位ms
        #timeout: 3000

        #Serving交互重试次数，默认不重试
        #retry: 1

        # 批量查询Serving的数量, 默认1。batch_size>1要设置auto_batching_timeout，否则不足batch_size时会阻塞
        #batch_size: 2

        # 批量查询超时，与batch_size配合使用
        #auto_batching_timeout: 2000

        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
        local_service_conf:
            #client类型，包括brpc, grpc和local_predictor.local_predictor不启动Serving服务，进程内预测
            client_type: local_predictor

            #det模型路径
            model_config: ocr_det_model

            #Fetch结果列表，以client_config中fetch_var的alias_name为准
            fetch_list: ["concat_1.tmp_0"]

            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
            device_type: 0

            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
            devices: ""

            #use_mkldnn, 开启mkldnn时，必须同时设置ir_optim=True，否则无效
            #use_mkldnn: True

            #ir_optim, 开启TensorRT时，必须同时设置ir_optim=True，否则无效
            ir_optim: True
            
            #CPU 计算线程数，在CPU场景开启会降低单次请求响应时长
            #thread_num: 10
            
            #precsion, 预测精度，降低预测精度可提升预测速度
            #GPU 支持: "fp32"(default), "fp16", "int8"；
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"

            #mem_optim, memory / graphic memory optimization
            #mem_optim: True

            #use_calib, Use TRT int8 calibration
            #use_calib: False

            #use_mkldnn, Use mkldnn for cpu
            #use_mkldnn: False

            #The cache capacity of different input shapes for mkldnn
            #mkldnn_cache_capacity: 0

            #mkldnn_op_list, op list accelerated using MKLDNN, None default
            #mkldnn_op_list: []

            #mkldnn_bf16_op_list,op list accelerated using MKLDNN bf16, None default.
            #mkldnn_bf16_op_list: []

            #min_subgraph_size,the minimal subgraph size for opening tensorrt to optimize, 3 default
            #min_subgraph_size: 3
    rec:
        #并发数，is_thread_op=True时，为线程并发；否则为进程并发
        concurrency: 3

        #超时时间, 单位ms
        timeout: -1

        #Serving交互重试次数，默认不重试
        retry: 1

        #当op配置没有server_endpoints时，从local_service_conf读取本地服务配置
        local_service_conf:

            #client类型，包括brpc, grpc和local_predictor。local_predictor不启动Serving服务，进程内预测
            client_type: local_predictor

            #rec模型路径
            model_config: ocr_rec_model

            #Fetch结果列表，以client_config中fetch_var的alias_name为准
            fetch_list: ["ctc_greedy_decoder_0.tmp_0", "softmax_0.tmp_0"]

            # device_type, 0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
            device_type: 0

            #计算硬件ID，当devices为""或不写时为CPU预测；当devices为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
            devices: ""

            #use_mkldnn, 开启mkldnn时，必须同时设置ir_optim=True，否则无效
            #use_mkldnn: True

            #ir_optim, 开启TensorRT时，必须同时设置ir_optim=True，否则无效
            ir_optim: True
            
            #CPU 计算线程数，在CPU场景开启会降低单次请求响应时长
            #thread_num: 10
            
            #precsion, 预测精度，降低预测精度可提升预测速度
            #GPU 支持: "fp32"(default), "fp16", "int8"；
            #CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
            precision: "fp32"

单机多卡

单机多卡推理，M个OP进程与N个GPU卡绑定，需要在config.ymal中配置3个参数。首先选择进程模式，这样并发数即进程数，然后配置devices。绑定方法是进程启动时遍历GPU卡ID，例如启动7个OP进程，设置了0，1，2三个device id，那么第1、4、7个启动的进程与0卡绑定，第2、5进程与1卡绑定，3、6进程与卡2绑定。

#op资源类型, True, 为线程模型；False，为进程模型
is_thread_op: False

#并发数，is_thread_op=True时，为线程并发；否则为进程并发
concurrency: 7

devices: "0,1,2"

异构硬件

Python Pipeline除了支持CPU、GPU之外，还支持多种异构硬件部署。在config.yaml中由device_type和devices控制。优先使用device_type指定，当其空缺时根据devices自动判断类型。device_type描述如下：

CPU(Intel) : 0
GPU : 1
TensorRT : 2
CPU(Arm) : 3
XPU : 4
Ascend310(Arm) : 5
Ascend910(Arm) : 6

config.yml中硬件配置：

#计算硬件类型: 空缺时由devices决定(CPU/GPU)，0=cpu, 1=gpu, 2=tensorRT, 3=arm cpu, 4=kunlun xpu, 5=arm ascend310, 6=arm ascend910
device_type: 0
#计算硬件ID，优先由device_type决定硬件类型。devices为""或空缺时为CPU预测；当为"0", "0,1,2"时为GPU预测，表示使用的GPU卡
devices: "" # "0,1"

低精度推理

Python Pipeline支持低精度推理，CPU、GPU和TensoRT支持的精度类型如下所示：

CPU
- fp32(default)
- fp16
- bf16(mkldnn)
GPU
- fp32(default)
- fp16(TRT下有效)
- int8
Tensor RT
- fp32(default)
- fp16
- int8

#precsion, 预测精度，降低预测精度可提升预测速度
#GPU 支持: "fp32"(default), "fp16(TensorRT)", "int8"；
#CPU 支持: "fp32"(default), "fp16", "bf16"(mkldnn); 不支持: "int8"
precision: "fp32"

#cablic, open it when using int8
use_calib: True

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serving_Configure_CN.md

Serving_Configure_CN.md

Serving配置

简介

模型配置文件

C++ Serving

1.快速启动与关闭

当您的某个模型想使用多张GPU卡部署时.

当您的一个服务包含两个模型部署时.

当您想要关闭Serving服务时（在Serving启动目录或环境变量SERVING_HOME路径下，执行以下命令）.

2.自定义配置启动

2.1 proj.conf

2.2 service.prototxt

2.3 workflow.prototxt

2.4 resource.prototxt

2.5 model_toolkit.prototxt

2.6 general_model.prototxt

Python Pipeline

快速启动与关闭

配置文件

单机多卡

异构硬件

低精度推理

Files

Serving_Configure_CN.md

Latest commit

History

Serving_Configure_CN.md

File metadata and controls

Serving配置

简介

模型配置文件

C++ Serving

1.快速启动与关闭

当您的某个模型想使用多张GPU卡部署时.

当您的一个服务包含两个模型部署时.

当您想要关闭Serving服务时（在Serving启动目录或环境变量SERVING_HOME路径下，执行以下命令）.

2.自定义配置启动

2.1 proj.conf

2.2 service.prototxt

2.3 workflow.prototxt

2.4 resource.prototxt

2.5 model_toolkit.prototxt

2.6 general_model.prototxt

Python Pipeline

快速启动与关闭

配置文件

单机多卡

异构硬件

低精度推理