Skip to content

Commit 5e1f01f

Browse files
[Custom Devices] feat(sdaa): support sdaa backend infer (#9570)
1.add sdaa python paddlenlp_ops setup and README 2.update llm scripts and README
1 parent 98a1cdc commit 5e1f01f

File tree

10 files changed

+341
-8
lines changed

10 files changed

+341
-8
lines changed

csrc/sdaa/README.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# PaddleNLP 自定义 OP
2+
3+
此文档介绍如何编译安装 PaddleNLP SDAA 自定义 OP。
4+
5+
# 1. 安装 PaddleCustomDevice
6+
7+
参考 [PaddleCustomDevice SDAA 安装文档](https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/sdaa/README_cn.md) 进行安装
8+
9+
10+
# 2. 安装 paddlenlp_ops
11+
```shell
12+
python setup_sdaa.py build bdist_wheel
13+
14+
pip install dist/paddlenlp_ops*.whl
15+
```
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from paddle_sdaa.sdaa_ext import *

csrc/sdaa/setup_sdaa.py

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import os
16+
17+
from setuptools import Distribution, setup
18+
19+
packages = []
20+
package_data = {}
21+
22+
23+
class BinaryDistribution(Distribution):
24+
def has_ext_modules(self):
25+
return True
26+
27+
28+
def main():
29+
setup(
30+
name="paddlenlp_ops",
31+
version="0.0.0",
32+
description="PaddleNLP SDAA CustomOps",
33+
long_description="",
34+
long_description_content_type="text/markdown",
35+
author_email="Paddle-better@baidu.com",
36+
maintainer="PaddlePaddle",
37+
maintainer_email="Paddle-better@baidu.com",
38+
project_urls={},
39+
license="Apache Software License",
40+
packages=[
41+
"paddlenlp_ops",
42+
],
43+
include_package_data=True,
44+
package_data={
45+
"": ["*.py"],
46+
},
47+
package_dir={
48+
"": "python",
49+
},
50+
zip_safe=False,
51+
distclass=BinaryDistribution,
52+
entry_points={"console_scripts": []},
53+
classifiers=[],
54+
keywords="PaddleNLP SDAA CustomOps",
55+
)
56+
57+
58+
if __name__ == "__main__":
59+
main()

docs/llm/sdaa/llama/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../../llm/sdaa/llama/README.md

llm/docs/predict/inference.md

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -39,13 +39,13 @@ PaddleNLP 中已经添加高性能推理模型相关实现,已验证过的模
3939

4040
PaddleNLP 提供了多种硬件平台和精度支持,包括:
4141

42-
| Precision | Hopper| Ada | Ampere | Turing | Volta | 昆仑XPU | 昇腾NPU | 海光K100 | 燧原GCU | x86 CPU |
43-
|:--------------:|:-----:|:---:|:------:|:------:|:-----:|:------:|:-------:|:-------:|:------:|:-------:|
44-
| FP32 |||||| | | |||
45-
| FP16 |||||| | | |||
46-
| BF16 |||||| | | |||
47-
| INT8 |||||| | | |||
48-
| FP8 | 🚧 ||||| | | |||
42+
| Precision | Hopper| Ada | Ampere | Turing | Volta | 昆仑XPU | 昇腾NPU | 海光K100 | 燧原GCU | 太初SDAA| x86 CPU |
43+
|:--------------:|:-----:|:---:|:------:|:------:|:-----:|:------:|:-------:|:-------:|:------:|:------:|:-------:|
44+
| FP32 ||||||||| | ||
45+
| FP16 ||||||||| | ||
46+
| BF16 ||||||||| | ||
47+
| INT8 ||||||||| | ||
48+
| FP8 | 🚧 |||||||| | ||
4949

5050

5151
## 3. 推理参数
@@ -196,6 +196,7 @@ python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --
196196
- [昇腾NPU](../../npu/llama/README.md)
197197
- [海光K100](../dcu_install.md)
198198
- [燧原GCU](../../gcu/llama/README.md)
199+
- [太初SDAA](../../sdaa/llama/README.md)
199200
- [X86 CPU](../cpu_install.md)
200201

201202
## 致谢

llm/docs/predict/installation.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,8 @@ cd PaddleNLP/csrc && python setup_cuda.py install
1616
cd PaddleNLP/csrc/xpu/src && sh cmake_build.sh
1717
#DCU设备安装自定义算子
1818
cd PaddleNLP/csrc && python setup_hip.py install
19+
#SDAA设备安装自定义算子
20+
cd PaddleNLP/csrc/sdaa && python setup_sdaa.py install
1921
```
2022

2123
到达运行目录,即可开始:
@@ -32,4 +34,4 @@ cd PaddleNLP/llm
3234

3335
获取最佳推理性能:
3436

35-
- [最佳实践](./best_practices.md)
37+
- [最佳实践](./best_practices.md)

llm/sdaa/llama/README.md

Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
## 🚣‍♂️ 使用PaddleNLP在太初sdaa 下运行Llama-2-13b-chat模型 🚣
2+
3+
PaddleNLP在太初sdaa上对Llama-2-13b-chat模型进行了深度适配和优化,实现了sdaa device推理入口和GPU的基本统一,仅需修改device即可完成推理任务的迁移。
4+
5+
## 🚀 快速开始 🚀
6+
7+
### 0. 机器准备。快速开始之前,您需要准备一台插有太初T100加速卡的机器,要求如下:
8+
9+
| 芯片类型 | 驱动版本 |
10+
| --- | --- |
11+
| 太初T100 | 1.3.0|
12+
13+
14+
### 1. 环境准备:(这将花费您5~15min 时间)
15+
16+
#### 1.1 拉取镜像
17+
```bash
18+
# 注意此镜像包含预编译的飞桨安装包, TecoDriver, TecoToolKit等,可以一键运行paddlenlp模型
19+
wget http://mirrors.tecorigin.com/repository/teco-3rd-repo/custom_device/ubuntu22.04/x86_64/1.3.0/paddle_sdaa_1.3.0_llm_infer.tar
20+
docker load < paddle_sdaa_1.3.0_llm_infer.tar
21+
```
22+
23+
#### 1.2 参考如下命令启动容器
24+
```bash
25+
docker run -itd --name="paddle-sdaa-dev" --net=host --privileged --cap-add SYS_PTRACE --cap-add SYS_ADMIN --shm-size 128g jfrog.tecorigin.net/tecotp-docker/release/ubuntu22.04/x86_64/paddle_sdaa:1.3.0-llm-infer /bin/bash
26+
```
27+
28+
#### 1.3 下载PaddleNLP仓库代码,并安装依赖
29+
```bash
30+
# PaddleNLP是基于PaddlePaddle『飞桨』的自然语言处理和大语言模型(LLM)开发库,存放了基于『飞桨』框架实现的各种大模型,Llama-2-13b-chat模型也包含其中。为了便于您更好地使用PaddleNLP,您需要clone整个仓库。
31+
git clone https://github.com/PaddlePaddle/PaddleNLP.git
32+
cd PaddleNLP
33+
export PYTHONPATH=/path/to/PaddleNLP:$PYTHONPATH
34+
pip install -r requirements.txt
35+
cd csrc/sdaa && python setup_sdaa.py install && cd ../../llm/sdaa/llama
36+
```
37+
### 2. 推理:(这将花费您15~30min时间)
38+
#### 2.1 动态图分布式推理
39+
40+
执行如下命令进行推理:
41+
```bash
42+
bash dynamic_infer_llama_sdaa.sh
43+
```
44+
首次推理会自动下载权重,可以使用自动下载的权重,或者下载之后指定权重路径。成功运行后,可以查看到推理结果的生成。
45+
46+
样例将下载的权重meta-llama/Llama-2-13b-chat文件夹保存到/workspace/weights,示例如下:
47+
```
48+
[2024-12-10 15:42:51,992] [ INFO] - set state for layer 30
49+
[2024-12-10 15:42:53,666] [ INFO] - set state for layer 31
50+
[2024-12-10 15:42:55,202] [ INFO] - set state for layer 32
51+
[2024-12-10 15:42:56,724] [ INFO] - set state for layer 33
52+
[2024-12-10 15:42:58,314] [ INFO] - set state for layer 34
53+
[2024-12-10 15:43:00,041] [ INFO] - set state for layer 35
54+
[2024-12-10 15:43:01,515] [ INFO] - set state for layer 36
55+
[2024-12-10 15:43:03,034] [ INFO] - set state for layer 37
56+
[2024-12-10 15:43:04,746] [ INFO] - set state for layer 38
57+
[2024-12-10 15:43:06,390] [ INFO] - set state for layer 39
58+
[2024-12-10 15:43:08,682] [ INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load '/workspace/weights/meta-llama/Llama-2-13b-chat'.
59+
[2024-12-10 15:43:08,682] [ INFO] - Loading configuration file /workspace/weights/meta-llama/Llama-2-13b-chat/config.json
60+
[2024-12-10 15:43:08,683] [ INFO] - Loading configuration file /workspace/weights/meta-llama/Llama-2-13b-chat/generation_config.json
61+
[2024-12-10 15:43:08,752] [ INFO] - Start predict
62+
[2024-12-10 15:43:08,789] [ INFO] - We are using <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load '/workspace/weights/meta-llama/Llama-2-13b-chat'.
63+
[2024-12-10 15:43:08,806] [ INFO] - Start read result message
64+
[2024-12-10 15:43:08,806] [ INFO] - Current path is /workspace/paddlenlp/llm
65+
[2024-12-10 15:43:29,178] [ INFO] - running spend 20.372194528579712
66+
[2024-12-10 15:43:29,187] [ INFO] - Finish read result message
67+
[2024-12-10 15:43:29,192] [ INFO] - End predict
68+
***********Source**********
69+
解释一下温故而知新
70+
***********Target**********
71+
72+
***********Output**********
73+
"温故而知新" (wēn gù er zhī xīn) is a Chinese idiom that means "to understand the old in order to know the new." It is often used to convey the idea that one must have a deep understanding of the past and traditional ways of doing things in order to truly appreciate and understand new ideas and innovations.
74+
75+
The phrase is often used in the context of education, where students are encouraged to study the classics and learn from the past in order to gain a solid foundation for understanding new concepts and ideas. It is also used in business and technology, where companies may look to the past for inspiration and guidance as they develop new products and services.
76+
77+
In essence, "温故而知新" suggests that one cannot truly understand the new without first understanding the old, and that a deep appreciation for the past is essential for making progress and innovation.
78+
```
79+
#### 2.2 静态图分布式推理
80+
81+
##### 2.2.1 静态图导出
82+
83+
执行如下命令进行静态图导出,为静态图分布式推理做好准备:
84+
```bash
85+
bash static_export_llama_sdaa.sh
86+
```
87+
成功运行后,可以查看到模型导出的结果,样例如下:
88+
```bash
89+
[2024-12-10 15:30:28,991] [ INFO] - set state for layer 24
90+
[2024-12-10 15:30:30,246] [ INFO] - set state for layer 25
91+
[2024-12-10 15:30:31,586] [ INFO] - set state for layer 26
92+
[2024-12-10 15:30:32,892] [ INFO] - set state for layer 27
93+
[2024-12-10 15:30:34,228] [ INFO] - set state for layer 28
94+
[2024-12-10 15:30:35,530] [ INFO] - set state for layer 29
95+
[2024-12-10 15:30:36,925] [ INFO] - set state for layer 30
96+
[2024-12-10 15:30:38,233] [ INFO] - set state for layer 31
97+
[2024-12-10 15:30:39,635] [ INFO] - set state for layer 32
98+
[2024-12-10 15:30:40,992] [ INFO] - set state for layer 33
99+
[2024-12-10 15:30:42,375] [ INFO] - set state for layer 34
100+
[2024-12-10 15:30:43,717] [ INFO] - set state for layer 35
101+
[2024-12-10 15:30:45,076] [ INFO] - set state for layer 36
102+
[2024-12-10 15:30:46,423] [ INFO] - set state for layer 37
103+
[2024-12-10 15:30:47,827] [ INFO] - set state for layer 38
104+
[2024-12-10 15:30:49,216] [ INFO] - set state for layer 39
105+
[2024-12-10 15:30:51,136] [ INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load '/workspace/weights/meta-llama/Llama-2-13b-chat'.
106+
[2024-12-10 15:30:51,136] [ INFO] - Loading configuration file /workspace/weights/meta-llama/Llama-2-13b-chat/config.json
107+
[2024-12-10 15:30:51,137] [ INFO] - Loading configuration file /workspace/weights/meta-llama/Llama-2-13b-chat/generation_config.json
108+
/root/miniconda3/envs/paddle_env/lib/python3.10/site-packages/paddle/jit/dy2static/program_translator.py:747: UserWarning: full_graph=False don't support input_spec arguments. It will not produce any effect.
109+
You can set full_graph=True, then you can assign input spec.
110+
111+
warnings.warn(
112+
/root/miniconda3/envs/paddle_env/lib/python3.10/site-packages/paddle/jit/api.py:1106: UserWarning: What you save is a function, and `jit.save` will generate the name of the model file according to `path` you specify. When loading these files with `jit.load`, you get a `TranslatedLayer` whose inference result is the same as the inference result of the function you saved.
113+
warnings.warn(
114+
I1210 15:30:58.707722 1174678 program_interpreter.cc:242] New Executor is Running.
115+
[2024-12-10 15:31:10,381] [ INFO] - Configuration saved in ./output_dir/exported_model/llama2_13b_chat_wint8_block_size32/config.json
116+
[2024-12-10 15:31:10,382] [ INFO] - Configuration saved in ./output_dir/exported_model/llama2_13b_chat_wint8_block_size32/generation_config.json
117+
[2024-12-10 15:31:10,382] [ INFO] - tokenizer config file saved in ./output_dir/exported_model/llama2_13b_chat_wint8_block_size32/tokenizer_config.json
118+
[2024-12-10 15:31:10,382] [ INFO] - Special tokens file saved in ./output_dir/exported_model/llama2_13b_chat_wint8_block_size32/special_tokens_map.json
119+
[2024-12-10 15:31:10,383] [ INFO] - Chat-template config file saved in ./output_dir/exported_model/llama2_13b_chat_wint8_block_size32/chat_template.json
120+
LAUNCH INFO 2024-12-10 15:31:12,346 Pod completed
121+
LAUNCH INFO 2024-12-10 15:31:12,347 Exit code 0
122+
```
123+
##### 2.2.2 静态图分布式推理
124+
125+
执行如下命令进行静态图分布式推理:
126+
```bash
127+
bash static_infer_llama_sdaa.sh
128+
```
129+
成功运行后,可以查看到推理结果的生成,样例如下:
130+
```bash
131+
[2024-12-10 15:36:24,150] [ INFO] topology.py:370 - Total 4 data comm group(s) create successfully!
132+
[2024-12-10 15:36:24,150] [ INFO] topology.py:370 - Total 1 model comm group(s) create successfully!
133+
[2024-12-10 15:36:24,150] [ INFO] topology.py:370 - Total 4 sharding comm group(s) create successfully!
134+
[2024-12-10 15:36:24,150] [ INFO] topology.py:290 - HybridParallelInfo: rank_id: 0, mp_degree: 4, sharding_degree: 1, pp_degree: 1, dp_degree: 1, sep_degree: 1, mp_group: [0, 1, 2, 3], sharding_group: [0], pp_group: [0], dp_group: [0], sep:group: None, check/clip group: [0, 1, 2, 3]
135+
[2024-12-10 15:36:24,152] [ INFO] - We are using <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load 'output_dir/exported_model/llama2_13b_chat_wint8_block_size32'.
136+
[2024-12-10 15:36:24,164] [ INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load 'output_dir/exported_model/llama2_13b_chat_wint8_block_size32'.
137+
[2024-12-10 15:36:24,164] [ INFO] - Loading configuration file output_dir/exported_model/llama2_13b_chat_wint8_block_size32/config.json
138+
[2024-12-10 15:36:24,165] [ INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load 'output_dir/exported_model/llama2_13b_chat_wint8_block_size32'.
139+
[2024-12-10 15:36:24,165] [ INFO] - Loading configuration file output_dir/exported_model/llama2_13b_chat_wint8_block_size32/config.json
140+
[2024-12-10 15:36:24,198] [ INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load 'output_dir/exported_model/llama2_13b_chat_wint8_block_size32'.
141+
[2024-12-10 15:36:24,198] [ INFO] - Loading configuration file output_dir/exported_model/llama2_13b_chat_wint8_block_size32/config.json
142+
[2024-12-10 15:36:24,199] [ INFO] - Loading configuration file output_dir/exported_model/llama2_13b_chat_wint8_block_size32/generation_config.json
143+
I1210 15:36:24.239424 1334951 analysis_predictor.cc:2142] MKLDNN is enabled
144+
I1210 15:36:24.239473 1334951 analysis_predictor.cc:2167] CustomDevice is enabled
145+
I1210 15:36:24.239486 1334951 analysis_predictor.cc:2210] Model is mixed precision type with float16, we will use a new PassStrategy. Note that only GPU/XPU backend is supported for now.
146+
I1210 15:36:24.239490 1334951 analysis_predictor.cc:2259] Ir optimization is turned off, no ir pass will be executed.
147+
--- Running analysis [ir_graph_build_pass]
148+
I1210 15:36:24.260483 1334951 executor.cc:183] Old Executor is Running.
149+
--- Running analysis [ir_analysis_pass]
150+
--- Running analysis [ir_params_sync_among_devices_pass]
151+
I1210 15:36:25.863914 1334951 ir_params_sync_among_devices_pass.cc:140] Sync params from CPU to sdaa:0
152+
--- Running analysis [adjust_cudnn_workspace_size_pass]
153+
--- Running analysis [inference_op_replace_pass]
154+
--- Running analysis [save_optimized_model_pass]
155+
--- Running analysis [ir_graph_to_program_pass]
156+
I1210 15:36:29.991195 1334951 analysis_predictor.cc:2348] ======= ir optimization completed =======
157+
I1210 15:36:30.000306 1334951 gen_comm_id_helper.cc:212] Server listening on: 127.0.1.1:36942 successful.
158+
I1210 15:36:30.088883 1334951 task_node.cc:43] Constructing TaskNode for DistModelInf. The TaskNode's id is: 0. And the TaskNode's max_run_time and max_slot_num will be set to 1.
159+
LAUNCH INFO 2024-12-10 15:37:24,254 Pod completed
160+
LAUNCH INFO 2024-12-10 15:37:24,254 Exit code 0
161+
I1210 15:36:30.189157 1334951 server.cpp:1107] Server[paddle::distributed::MessageServiceImpl] is serving on port=36942.
162+
I1210 15:36:30.189195 1334951 server.cpp:1110] Check out http://dmx-19:36942 in web browser.
163+
I1210 15:36:30.189320 1334951 message_bus.cc:201] Message bus's listen port thread starts successful.
164+
[2024-12-10 15:36:31,284] [ INFO] - Start predict
165+
[2024-12-10 15:36:31,296] [ INFO] - preprocess spend 0.010512113571166992
166+
[2024-12-10 15:36:31,355] [ INFO] - We are using <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load 'output_dir/exported_model/llama2_13b_chat_wint8_block_size32'.
167+
[2024-12-10 15:36:31,378] [ INFO] - Start read result message
168+
[2024-12-10 15:36:31,378] [ INFO] - Current path is /workspace/paddlenlp/llm
169+
[2024-12-10 15:37:22,118] [ INFO] - running spend 50.736462116241455
170+
[2024-12-10 15:37:22,125] [ INFO] - Finish read result message
171+
[2024-12-10 15:37:22,132] [ INFO] - End predict
172+
***********Source**********
173+
解释一下温故而知新
174+
***********Target**********
175+
176+
***********Output**********
177+
"温故而知新" (wēn gù er zhī xīn) is a Chinese idiom that means "to know the old in order to discern the new." It is often used to describe the idea that one can gain a deeper understanding of something new by studying and appreciating the past.
178+
179+
The word "" (wēn) in this idiom means "old" or "past," and "" (gù) means "olden days" or "former times." The word "" (zhī) means "to know" or "to understand," and "" (xīn) means "new."
180+
181+
The idiom "温故而知新" suggests that by studying and understanding the past, one can gain a deeper appreciation for the present and make more informed decisions about the future. It is often used in the context of learning from history, understanding cultural traditions, and appreciating the value of experience and wisdom.
182+
183+
For example, if someone is trying a new type of food for the first time, they might say "I need to study the old recipes to know the new flavors" (我需要学习古老的菜谱,才能了解新的味道). This means that by understanding the traditional methods and ingredients used in the past, they can better appreciate the new dish and its unique qualities.
184+
185+
Overall, "温故而知新" is a reminder that understanding the past can help us navigate the present and make more informed decisions about the future.
186+
I1210 15:37:22.926474 1334951 server.cpp:1167] Server[paddle::distributed::MessageServiceImpl] is going to quit
187+
```

0 commit comments

Comments
 (0)