Skip to content

llm inference docs #8976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 27, 2024
Merged

llm inference docs #8976

merged 10 commits into from
Aug 27, 2024

Conversation

Sunny-bot1
Copy link
Contributor

@Sunny-bot1 Sunny-bot1 commented Aug 21, 2024

PR types

Others

PR changes

Docs

Description

update llm inferece docs

Copy link

paddle-bot bot commented Aug 21, 2024

Thanks for your contribution!

Copy link

codecov bot commented Aug 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 53.98%. Comparing base (24fa97e) to head (32232fb).
Report is 228 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8976      +/-   ##
===========================================
- Coverage    54.05%   53.98%   -0.08%     
===========================================
  Files          650      650              
  Lines       103883   104167     +284     
===========================================
+ Hits         56157    56235      +78     
- Misses       47726    47932     +206     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

- [昇腾NPU](../../npu/llama/README.md)
- [海光K100](../dcu_install.md)
- [燧原GCU](../../gcu/llama/README.md)
- [X86 CPU](../../../csrc/cpu/README.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Comment on lines 129 to 151
### 4.1 环境准备

git clone 代码到本地:

```shell
git clone https://github.com/PaddlePaddle/PaddleNLP.git
export PYTHONPATH=/path/to/PaddleNLP:$PYTHONPATH
```

PaddleNLP 针对于Transformer 系列编写了高性能自定义算子,提升模型在推理和解码过程中的性能,使用之前需要预先安装自定义算子库:


```shell
git clone https://github.com/PaddlePaddle/PaddleNLP
#GPU设备安装自定义算子
cd ./paddlenlp/csrc && python setup_cuda.py install
#XPU设备安装自定义算子
cd ./paddlenlp/csrc/xpu/src && sh cmake_build.sh
#DCU设备安装自定义算子
cd ./paddlenlp/csrc && python setup_hip.py install
```

到达运行目录,即可开始:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

环境准备这一节重复了,这里可以删掉

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

链接到llm/docs/predict/installation.md

README.md Outdated
@@ -127,6 +127,18 @@ Unified Checkpoint 大模型存储格式在模型参数分布上支持动态扩
| Yuan2 | ✅ | ✅ | ✅ | 🚧 | 🚧 | 🚧 | 🚧 | ✅ |
------------------------------------------------------------------------------------------

* 大模型推理已支持 LLaMA 系列、Qwen 系列、Mistral 系列、ChatGLM 系列、Bloom 系列和Baichuan 系列,支持Weight Only INT8及INT4推理,支持WAC(权重、激活、Cache KV)进行INT8、FP8量化的推理,【LLM】模型推理支持列表如下:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“大模型推理”链接到llm/docs/predict/inference.md


1. `quant_type`可选的数值有`weight_only_int8`、`weight_only_int4`、`a8w8`和`a8w8_fp8`。
2. `a8w8`与`a8w8_fp8`需要额外的act和weight的scale校准表,推理传入的 `model_name_or_path` 为PTQ校准产出的量化模型。量化模型导出参考[大模型量化教程](../quantization.md)。
3. `cachekv_int8_type`可选`dynamic`和`static`两种,`static`需要额外的cache kv的scale校准表,传入的 `model_name_or_path` 为PTQ校准产出的量化模型。量化模型导出参考[大模型量化教程](../quantization.md)。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里说明下 dynamic已不再维护,不建议使用吧,


PaddleNLP 提供了多种硬件平台和精度支持,包括:

| Precision | Ada | Ampere | Turing | Volta | 昆仑XPU | 昇腾NPU | 海光K100 | 燧原GCU | x86 CPU |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

推荐表格居中

PaddleNLP 中已经添加高性能推理模型相关实现,支持:
| Models | Example Models |
|--------|----------------|
|Llama 3.1, Llama 3, Llama 2|`meta-llama/Meta-Llama-3-8B`, `meta-llama/Meta-Llama-3.1-8B`, `meta-llama/Meta-Llama-3.1-405B`, etc.|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

推荐这里把模型所有size全部列出,更方便查找


- `model_name_or_path`: 必需,预训练模型名称或者本地的模型路径,用于热启模型和分词器,默认为None。

- `dtype`: 必需,模型参数dtype,默认为None。如果没有传入`lora_path`、`prefix_path`则必须传入。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果没有传入lora_pathprefix_path则必须传入dtype参数。


- `batch_size`: 批处理大小,默认为1。该参数越大,占用显存越高;该参数越小,占用显存越低。

- `data_file`: 待推理json文件,默认为None。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议这里给出文件内容格式。


- `output_file`: 保存推理结果文件,默认为output.json。

- `device`: 运行环境,默认为gpu。可选的数值有`cpu`, `gpu`, `xpu`。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device只有这三种吗?和上述支持硬件支持表格不相符。


- `mode`: 使用动态图或者静态图推理,可选值有`dynamic`、 `static`,默认为`dynamic`。

- `avx_model`: 当使用CPU推理时,是否使用AvxModel,默认为False。参考[CPU推理教程]()。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处缺少链接

PaddleNLP 针对于Transformer 系列编写了高性能自定义算子,提升模型在推理和解码过程中的性能,使用之前需要预先安装自定义算子库:

```shell
git clone https://github.com/PaddlePaddle/PaddleNLP
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里clone了两次


```shell
cd PaddleNLP/llm
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议这里跳转到最佳实践或者其他应用比较好

```shell
# 动态图推理
export DEVICES=0,1
python -m paddle.distributed.launch \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

缺一些介绍说明,例如使用哪些参数,参数的具体含义,可以简单说明一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数的具体含义在inference.md页面已介绍过,这里不再重复


- 支持多硬件大模型推理,包括[昆仑XPU](../../xpu/llama/README.md)、[昇腾NPU](../../npu/llama/README.md)、[海光K100](../dcu_install.md)、[燧原GCU](../../gcu/llama/README.md)、[X86 CPU](../cpu_install.md)等

- 提供面向服务器场景的部署服务,支持连续批处理(continuous batching)、流式输出等功能,支持HTTP、RPC、RESTful多种Clent端形式
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

支持HTTP、RPC、RESTful多种Clent端形式 -> 支持gRPC、HTTP协议的服务接口

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

支持HTTP、RPC、RESTful多种Clent端形式 -> 支持gRPC、HTTP协议的服务接口

已修改


- 基于Transformer的大模型(如Llama、Qwen)

- 混合专家大模型(如Mixtral)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25-29行可以去掉,下面1中也有模型支持,可以看到支持的模型列表。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

25-29行可以去掉,下面1中也有模型支持,可以看到支持的模型列表。

已删除

Copy link
Collaborator

@DrownFish19 DrownFish19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wawltor wawltor merged commit a2f9558 into PaddlePaddle:develop Aug 27, 2024
10 of 12 checks passed
lixcli pushed a commit to lixcli/PaddleNLP that referenced this pull request Aug 28, 2024
* update inference docs

* update

* update

* update

* update

* fix comments

* fix comments

* fix comments

* update inference.md
lixcli pushed a commit to lixcli/PaddleNLP that referenced this pull request Aug 28, 2024
* update inference docs

* update

* update

* update

* update

* fix comments

* fix comments

* fix comments

* update inference.md
lixcli pushed a commit to lixcli/PaddleNLP that referenced this pull request Aug 28, 2024
* update inference docs

* update

* update

* update

* update

* fix comments

* fix comments

* fix comments

* update inference.md
Mangodadada pushed a commit to Mangodadada/PaddleNLP that referenced this pull request Sep 10, 2024
* update inference docs

* update

* update

* update

* update

* fix comments

* fix comments

* fix comments

* update inference.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants