llm inference docs #8976

Sunny-bot1 · 2024-08-21T07:19:28Z

PR types

Others

PR changes

Docs

Description

update llm inferece docs

paddle-bot · 2024-08-21T07:19:32Z

Thanks for your contribution!

codecov · 2024-08-21T08:04:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.98%. Comparing base (24fa97e) to head (32232fb).
Report is 228 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8976      +/-   ##
===========================================
- Coverage    54.05%   53.98%   -0.08%     
===========================================
  Files          650      650              
  Lines       103883   104167     +284     
===========================================
+ Hits         56157    56235      +78     
- Misses       47726    47932     +206

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

…nto llm_docs

llm/docs/predict/inference.md

yuanlehome · 2024-08-26T14:24:59Z

llm/docs/predict/inference.md

+- [昇腾NPU](../../npu/llama/README.md)
+- [海光K100](../dcu_install.md)
+- [燧原GCU](../../gcu/llama/README.md)
+- [X86 CPU](../../../csrc/cpu/README.md)


yuanlehome · 2024-08-26T14:26:06Z

llm/docs/predict/inference.md

+### 4.1 环境准备
+
+git clone 代码到本地：
+
+```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+export PYTHONPATH=/path/to/PaddleNLP:$PYTHONPATH
+```
+
+PaddleNLP 针对于Transformer 系列编写了高性能自定义算子，提升模型在推理和解码过程中的性能，使用之前需要预先安装自定义算子库：
+
+
+```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP
+#GPU设备安装自定义算子
+cd ./paddlenlp/csrc && python setup_cuda.py install
+#XPU设备安装自定义算子
+cd ./paddlenlp/csrc/xpu/src && sh cmake_build.sh
+#DCU设备安装自定义算子
+cd ./paddlenlp/csrc && python setup_hip.py install
+```
+
+到达运行目录，即可开始：


环境准备这一节重复了，这里可以删掉

链接到llm/docs/predict/installation.md

yuanlehome · 2024-08-26T14:34:54Z

README.md

@@ -127,6 +127,18 @@ Unified Checkpoint 大模型存储格式在模型参数分布上支持动态扩
 |       Yuan2        |    ✅     |  ✅  |  ✅   |      🚧       | 🚧  |  🚧  |      🚧      |       ✅       |
 ------------------------------------------------------------------------------------------

+* 大模型推理已支持 LLaMA 系列、Qwen 系列、Mistral 系列、ChatGLM 系列、Bloom 系列和Baichuan 系列，支持Weight Only INT8及INT4推理，支持WAC（权重、激活、Cache KV）进行INT8、FP8量化的推理，【LLM】模型推理支持列表如下：


“大模型推理”链接到llm/docs/predict/inference.md

yuanlehome · 2024-08-26T14:42:39Z

llm/docs/predict/inference.md

+
+1. `quant_type`可选的数值有`weight_only_int8`、`weight_only_int4`、`a8w8`和`a8w8_fp8`。
+2. `a8w8`与`a8w8_fp8`需要额外的act和weight的scale校准表，推理传入的 `model_name_or_path` 为PTQ校准产出的量化模型。量化模型导出参考[大模型量化教程](../quantization.md)。
+3. `cachekv_int8_type`可选`dynamic`和`static`两种，`static`需要额外的cache kv的scale校准表，传入的 `model_name_or_path` 为PTQ校准产出的量化模型。量化模型导出参考[大模型量化教程](../quantization.md)。


这里说明下 dynamic已不再维护，不建议使用吧，

DrownFish19 · 2024-08-27T02:33:20Z

llm/docs/predict/inference.md

+
+PaddleNLP 提供了多种硬件平台和精度支持，包括：
+
+| Precision      | Ada | Ampere | Turing | Volta | 昆仑XPU | 昇腾NPU | 海光K100 | 燧原GCU | x86 CPU |


推荐表格居中

DrownFish19 · 2024-08-27T02:43:31Z

llm/docs/predict/inference.md

+PaddleNLP 中已经添加高性能推理模型相关实现，支持：
+| Models | Example Models |
+|--------|----------------|
+|Llama 3.1, Llama 3, Llama 2|`meta-llama/Meta-Llama-3-8B`, `meta-llama/Meta-Llama-3.1-8B`, `meta-llama/Meta-Llama-3.1-405B`,  etc.|


推荐这里把模型所有size全部列出，更方便查找

DrownFish19 · 2024-08-27T02:44:40Z

llm/docs/predict/inference.md

+
+- `model_name_or_path`: 必需，预训练模型名称或者本地的模型路径，用于热启模型和分词器，默认为None。
+
+- `dtype`: 必需，模型参数dtype，默认为None。如果没有传入`lora_path`、`prefix_path`则必须传入。


如果没有传入lora_path或prefix_path则必须传入dtype参数。

DrownFish19 · 2024-08-27T02:45:58Z

llm/docs/predict/inference.md

+
+- `batch_size`: 批处理大小，默认为1。该参数越大，占用显存越高；该参数越小，占用显存越低。
+
+- `data_file`: 待推理json文件，默认为None。


建议这里给出文件内容格式。

DrownFish19 · 2024-08-27T02:55:22Z

llm/docs/predict/inference.md

+
+- `output_file`: 保存推理结果文件，默认为output.json。
+
+- `device`: 运行环境，默认为gpu。可选的数值有`cpu`, `gpu`, `xpu`。


device只有这三种吗？和上述支持硬件支持表格不相符。

DrownFish19 · 2024-08-27T02:56:08Z

llm/docs/predict/inference.md

+
+- `mode`: 使用动态图或者静态图推理，可选值有`dynamic`、 `static`，默认为`dynamic`。
+
+- `avx_model`: 当使用CPU推理时，是否使用AvxModel，默认为False。参考[CPU推理教程]()。


此处缺少链接

DrownFish19 · 2024-08-27T03:00:16Z

llm/docs/predict/installation.md

+PaddleNLP 针对于Transformer 系列编写了高性能自定义算子，提升模型在推理和解码过程中的性能，使用之前需要预先安装自定义算子库：
+
+```shell
+git clone https://github.com/PaddlePaddle/PaddleNLP


这里clone了两次

DrownFish19 · 2024-08-27T03:01:09Z

llm/docs/predict/installation.md

+
+```shell
+cd PaddleNLP/llm
+```


建议这里跳转到最佳实践或者其他应用比较好

llm/docs/predict/llama.md

DrownFish19 · 2024-08-27T03:06:16Z

llm/docs/predict/mixtral.md

+```shell
+# 动态图推理
+export DEVICES=0,1
+python -m paddle.distributed.launch \


缺一些介绍说明，例如使用哪些参数，参数的具体含义，可以简单说明一下

参数的具体含义在inference.md页面已介绍过，这里不再重复

qingqing01 · 2024-08-27T05:17:08Z

llm/docs/predict/inference.md

+
+- 支持多硬件大模型推理，包括[昆仑XPU](../../xpu/llama/README.md)、[昇腾NPU](../../npu/llama/README.md)、[海光K100](../dcu_install.md)、[燧原GCU](../../gcu/llama/README.md)、[X86 CPU](../cpu_install.md)等
+
+- 提供面向服务器场景的部署服务，支持连续批处理(continuous batching)、流式输出等功能，支持HTTP、RPC、RESTful多种Clent端形式


支持HTTP、RPC、RESTful多种Clent端形式 -> 支持gRPC、HTTP协议的服务接口

支持HTTP、RPC、RESTful多种Clent端形式 -> 支持gRPC、HTTP协议的服务接口

已修改

qingqing01 · 2024-08-27T05:17:48Z

llm/docs/predict/inference.md

+
+- 基于Transformer的大模型（如Llama、Qwen）
+
+- 混合专家大模型（如Mixtral）


25-29行可以去掉，下面1中也有模型支持，可以看到支持的模型列表。

25-29行可以去掉，下面1中也有模型支持，可以看到支持的模型列表。

已删除

DrownFish19

LGTM

* update inference docs * update * update * update * update * fix comments * fix comments * fix comments * update inference.md

Sunny-bot1 added 2 commits August 21, 2024 15:03

update inference docs

21b824d

update

1c0b056

Sunny-bot1 added 2 commits August 26, 2024 20:54

update

3a0d2e6

Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleNLP i…

d5fd9bf

…nto llm_docs

Sunny-bot1 force-pushed the llm_docs branch from 66d33ca to d5fd9bf Compare August 26, 2024 12:58

Sunny-bot1 added 2 commits August 26, 2024 21:02

update

de7883f

update

595626d

yuanlehome reviewed Aug 26, 2024

View reviewed changes

llm/docs/predict/inference.md Outdated Show resolved Hide resolved

yuanlehome reviewed Aug 26, 2024

View reviewed changes

fix comments

85a8ae7

DrownFish19 reviewed Aug 27, 2024

View reviewed changes

qingqing01 reviewed Aug 27, 2024

View reviewed changes

Sunny-bot1 and others added 3 commits August 27, 2024 13:21

fix comments

0efaae0

fix comments

9dae2ac

update inference.md

32232fb

yuanlehome approved these changes Aug 27, 2024

View reviewed changes

DrownFish19 approved these changes Aug 27, 2024

View reviewed changes

wawltor merged commit a2f9558 into PaddlePaddle:develop Aug 27, 2024
10 of 12 checks passed

lixcli pushed a commit to lixcli/PaddleNLP that referenced this pull request Aug 28, 2024

llm inference docs (PaddlePaddle#8976)

69a42c5

* update inference docs * update * update * update * update * fix comments * fix comments * fix comments * update inference.md

lixcli pushed a commit to lixcli/PaddleNLP that referenced this pull request Aug 28, 2024

llm inference docs (PaddlePaddle#8976)

93e80b8

* update inference docs * update * update * update * update * fix comments * fix comments * fix comments * update inference.md

lixcli pushed a commit to lixcli/PaddleNLP that referenced this pull request Aug 28, 2024

llm inference docs (PaddlePaddle#8976)

d82e324

* update inference docs * update * update * update * update * fix comments * fix comments * fix comments * update inference.md

Mangodadada pushed a commit to Mangodadada/PaddleNLP that referenced this pull request Sep 10, 2024

llm inference docs (PaddlePaddle#8976)

63e059f

* update inference docs * update * update * update * update * fix comments * fix comments * fix comments * update inference.md


		PaddleNLP 提供了多种硬件平台和精度支持，包括：

		\| Precision \| Ada \| Ampere \| Turing \| Volta \| 昆仑XPU \| 昇腾NPU \| 海光K100 \| 燧原GCU \| x86 CPU \|


		- `model_name_or_path`: 必需，预训练模型名称或者本地的模型路径，用于热启模型和分词器，默认为None。

		- `dtype`: 必需，模型参数dtype，默认为None。如果没有传入`lora_path`、`prefix_path`则必须传入。


		- `batch_size`: 批处理大小，默认为1。该参数越大，占用显存越高；该参数越小，占用显存越低。

		- `data_file`: 待推理json文件，默认为None。


		- `output_file`: 保存推理结果文件，默认为output.json。

		- `device`: 运行环境，默认为gpu。可选的数值有`cpu`, `gpu`, `xpu`。


		- `mode`: 使用动态图或者静态图推理，可选值有`dynamic`、 `static`，默认为`dynamic`。

		- `avx_model`: 当使用CPU推理时，是否使用AvxModel，默认为False。参考[CPU推理教程]()。


		- 支持多硬件大模型推理，包括[昆仑XPU](../../xpu/llama/README.md)、[昇腾NPU](../../npu/llama/README.md)、[海光K100](../dcu_install.md)、[燧原GCU](../../gcu/llama/README.md)、[X86 CPU](../cpu_install.md)等

		- 提供面向服务器场景的部署服务，支持连续批处理(continuous batching)、流式输出等功能，支持HTTP、RPC、RESTful多种Clent端形式


		- 基于Transformer的大模型（如Llama、Qwen）

		- 混合专家大模型（如Mixtral）

llm inference docs #8976

llm inference docs #8976

Uh oh!

Conversation

Sunny-bot1 commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Aug 21, 2024

Uh oh!

codecov bot commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DrownFish19 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Sunny-bot1 commented Aug 21, 2024 •

edited

Loading

codecov bot commented Aug 21, 2024 •

edited

Loading