Skip to content

llm inference docs #8976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 27, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,18 @@ Unified Checkpoint 大模型存储格式在模型参数分布上支持动态扩
| Yuan2 | ✅ | ✅ | ✅ | 🚧 | 🚧 | 🚧 | 🚧 | ✅ |
------------------------------------------------------------------------------------------

* [大模型推理](./llm/docs/predict/inference.md)已支持 LLaMA 系列、Qwen 系列、Mistral 系列、ChatGLM 系列、Bloom 系列和Baichuan 系列,支持Weight Only INT8及INT4推理,支持WAC(权重、激活、Cache KV)进行INT8、FP8量化的推理,【LLM】模型推理支持列表如下:

| 模型名称/量化类型支持 | FP16/BF16 | WINT8 | WINT4 | INT8-A8W8 | FP8-A8W8 | INT8-A8W8C8 |
|:--------------------------------------------:|:---------:|:-----:|:-----:|:---------:|:--------:|:-----------:|
| [LLaMA](./llm/docs/predict/llama.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [Qwen](./llm/docs/predict/qwen.md) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| [Qwen-Moe](./llm/docs/predict/qwen.md) | ✅ | ✅ | ✅ | 🚧 | 🚧 | 🚧 |
| [Mixtral](./llm/docs/predict/mixtral.md) | ✅ | ✅ | ✅ | 🚧 | 🚧 | 🚧 |
| ChatGLM | ✅ | ✅ | ✅ | 🚧 | 🚧 | 🚧 |
| Bloom | ✅ | ✅ | ✅ | 🚧 | 🚧 | 🚧 |
| BaiChuan | ✅ | ✅ | ✅ | ✅ | ✅ | 🚧 |

## 安装

### 环境依赖
Expand Down
2 changes: 1 addition & 1 deletion docs/llm/docs/inference.md
25 changes: 5 additions & 20 deletions llm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -226,22 +226,7 @@ python run_finetune.py ./config/llama/ptq_argument.json

### 5. 推理

PaddleNLP 除了提供常用模型推理外,还提供了高性能推理,内置动态插入和全环节算子融合策略,极大加快并行推理的速度。

- **常用模型推理**:PaddleNLP 提供了动态图推理和静态图推理两种方式,方便用户快速验证模型推理效果(包含 LoRA、PrefixTuning)。

```shell
# 动态图模型推理命令参考
python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --data_file ./data/dev.json --dtype float16

# 静态图模型推理命令参考
# step1 : 静态图导出
python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --output_path ./inference --dtype float16
# step2: 静态图推理
python ./predict/predictor.py --model_name_or_path ./inference --data_file ./data/dev.json --dtype float16 --mode static
```

- **InferenceModel 高性能推理**:PaddleNLP 还提供了高性能推理模型加快并行推理的速度,同时支持 FP16、Prefix Tuning、WINT8、A8W8多种推理方式。
PaddleNLP 提供高性能推理,内置动态插入和全环节算子融合策略,极大加快并行推理的速度,同时支持 FP16/BF16、WINT8、WINT4、A8W8、A8W8C8多种推理方式。

<div align="center">
<img width="500" alt="llm" src="https://github.com/PaddlePaddle/PaddleNLP/assets/63761690/fb248224-0ad1-4d6a-a1ca-3a8dd765c41d">
Expand All @@ -253,17 +238,17 @@ python ./predict/predictor.py --model_name_or_path ./inference --data_file ./dat
</div>

```shell
# 高性能动态图模型推理命令参考
# 动态图模型推理命令参考
python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --dtype float16

# 高性能静态图模型推理命令参考
# 静态图模型推理命令参考
# step1 : 静态图导出
python ./predict/export_model.py --model_name_or_path meta-llama/Llama-2-7b-chat --inference_model --output_path ./inference --dtype float16
# step2: 静态图推理
python ./predict/predictor.py --model_name_or_path ./inference --inference_model --dtype "float16" --mode "static"
```

更多常用模型推理和高性能模型使用方法详见[大模型推理文档](./docs/inference.md)。
更多模型推理使用方法详见[大模型推理文档](./docs/predict/inference.md)。

### 6. 服务化部署

Expand All @@ -287,7 +272,7 @@ python -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" ./predict/flask_ser

- `port`: Gradio UI 服务端口号,默认8011。
- `flask_port`: Flask 服务端口号,默认8010。
- 其他参数请参见[推理文档](./docs/inference.md)中推理参数配置。
- 其他参数请参见[推理文档](./docs/predict/inference.md)中推理参数配置。

此外,如果想通过 API 脚本的方式跑推理,可参考:`./predict/request_flask_server.py` 文件。

Expand Down
2 changes: 1 addition & 1 deletion llm/docs/dcu_install.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,4 +64,4 @@ cd -
```

### 高性能推理:
海光的推理命令与GPU推理命令一致,请参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/inference.md).
海光的推理命令与GPU推理命令一致,请参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md).
Loading
Loading