[Custom Devices] feat(sdaa): support sdaa backend infer (#9570)

thinking-computer · web-flow · commit 5e1f01fd974f · 2024-12-16T15:11:42.000+08:00
1.add sdaa python paddlenlp_ops setup and README
2.update llm scripts and README
diff --git a/csrc/sdaa/README.md b/csrc/sdaa/README.md
@@ -0,0 +1,15 @@
+# PaddleNLP 自定义 OP
+
+此文档介绍如何编译安装 PaddleNLP SDAA 自定义 OP。
+
+# 1. 安装 PaddleCustomDevice
+
+参考 [PaddleCustomDevice SDAA 安装文档](https://github.com/PaddlePaddle/PaddleCustomDevice/blob/develop/backends/sdaa/README_cn.md) 进行安装
+
+
+# 2. 安装 paddlenlp_ops
+```shell
+python setup_sdaa.py build bdist_wheel
+
+pip install dist/paddlenlp_ops*.whl
+```
diff --git a/csrc/sdaa/python/paddlenlp_ops/__init__.py b/csrc/sdaa/python/paddlenlp_ops/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from paddle_sdaa.sdaa_ext import *
diff --git a/csrc/sdaa/setup_sdaa.py b/csrc/sdaa/setup_sdaa.py
@@ -0,0 +1,59 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+
+from setuptools import Distribution, setup
+
+packages = []
+package_data = {}
+
+
+class BinaryDistribution(Distribution):
+    def has_ext_modules(self):
+        return True
+
+
+def main():
+    setup(
+        name="paddlenlp_ops",
+        version="0.0.0",
+        description="PaddleNLP SDAA CustomOps",
+        long_description="",
+        long_description_content_type="text/markdown",
+        author_email="Paddle-better@baidu.com",
+        maintainer="PaddlePaddle",
+        maintainer_email="Paddle-better@baidu.com",
+        project_urls={},
+        license="Apache Software License",
+        packages=[
+            "paddlenlp_ops",
+        ],
+        include_package_data=True,
+        package_data={
+            "": ["*.py"],
+        },
+        package_dir={
+            "": "python",
+        },
+        zip_safe=False,
+        distclass=BinaryDistribution,
+        entry_points={"console_scripts": []},
+        classifiers=[],
+        keywords="PaddleNLP SDAA CustomOps",
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/docs/llm/sdaa/llama/README.md b/docs/llm/sdaa/llama/README.md
@@ -0,0 +1 @@
+../../../../llm/sdaa/llama/README.md
diff --git a/llm/docs/predict/inference.md b/llm/docs/predict/inference.md
@@ -39,13 +39,13 @@ PaddleNLP 中已经添加高性能推理模型相关实现，已验证过的模
 
 PaddleNLP 提供了多种硬件平台和精度支持，包括：
 
-| Precision      | Hopper| Ada | Ampere | Turing | Volta | 昆仑XPU | 昇腾NPU | 海光K100 | 燧原GCU | x86 CPU |
-|:--------------:|:-----:|:---:|:------:|:------:|:-----:|:------:|:-------:|:-------:|:------:|:-------:|
-| FP32           |  ✅   |  ✅ | ✅     | ✅      | ✅    | ✅      |  ✅     | ✅      | ✅      |   ✅    |
-| FP16           |  ✅   |  ✅ | ✅     | ✅      | ✅    | ✅      |  ✅     | ✅      | ✅      |   ✅    |
-| BF16           |  ✅   |  ✅ | ✅     | ❌      | ❌    | ❌      |  ❌     | ❌      | ❌      |   ✅    |
-| INT8           |  ✅   |  ✅ | ✅     | ✅      | ✅    | ✅      |  ✅     | ✅      | ❌      |   ✅    |
-| FP8            |  🚧   |  ✅ | ❌     | ❌      | ❌    | ❌      |  ❌     | ❌      | ❌      |   ❌    |
+| Precision      | Hopper| Ada | Ampere | Turing | Volta | 昆仑XPU | 昇腾NPU | 海光K100 | 燧原GCU  | 太初SDAA| x86 CPU |
+|:--------------:|:-----:|:---:|:------:|:------:|:-----:|:------:|:-------:|:-------:|:------:|:------:|:-------:|
+| FP32           |  ✅   |  ✅ | ✅     | ✅      | ✅    | ✅    |  ✅    | ✅    | ✅   |  ✅    |   ✅    |
+| FP16           |  ✅   |  ✅ | ✅     | ✅      | ✅    | ✅    |  ✅    | ✅    | ✅   |  ✅    |   ✅    |
+| BF16           |  ✅   |  ✅ | ✅     | ❌      | ❌    | ❌    |  ❌    | ❌    | ❌   |  ❌    |   ✅    |
+| INT8           |  ✅   |  ✅ | ✅     | ✅      | ✅    | ✅    |  ✅    | ✅    | ❌   |  ✅    |   ✅    |
+| FP8            |  🚧   |  ✅ | ❌     | ❌      | ❌    | ❌    |  ❌    | ❌    | ❌   |  ❌    |   ❌    |
 
 
 ## 3. 推理参数
@@ -196,6 +196,7 @@ python ./predict/predictor.py --model_name_or_path meta-llama/Llama-2-7b-chat --
 - [昇腾NPU](../../npu/llama/README.md)
 - [海光K100](../dcu_install.md)
 - [燧原GCU](../../gcu/llama/README.md)
+- [太初SDAA](../../sdaa/llama/README.md)
 - [X86 CPU](../cpu_install.md)
 
 ## 致谢
diff --git a/llm/docs/predict/installation.md b/llm/docs/predict/installation.md
@@ -16,6 +16,8 @@ cd PaddleNLP/csrc && python setup_cuda.py install
 cd PaddleNLP/csrc/xpu/src && sh cmake_build.sh
 #DCU设备安装自定义算子
 cd PaddleNLP/csrc && python setup_hip.py install
+#SDAA设备安装自定义算子
+cd PaddleNLP/csrc/sdaa && python setup_sdaa.py install
 ```
 
 到达运行目录，即可开始：
@@ -32,4 +34,4 @@ cd PaddleNLP/llm
 
 获取最佳推理性能：
 
-- [最佳实践](./best_practices.md)
+- [最佳实践](./best_practices.md)
diff --git a/llm/sdaa/llama/README.md b/llm/sdaa/llama/README.md
@@ -0,0 +1,187 @@
+## 🚣‍♂️ 使用PaddleNLP在太初sdaa 下运行Llama-2-13b-chat模型 🚣
+
+PaddleNLP在太初sdaa上对Llama-2-13b-chat模型进行了深度适配和优化，实现了sdaa device推理入口和GPU的基本统一，仅需修改device即可完成推理任务的迁移。
+
+## 🚀 快速开始 🚀
+
+### 0. 机器准备。快速开始之前，您需要准备一台插有太初T100加速卡的机器，要求如下：
+
+| 芯片类型 | 驱动版本 |
+| --- | --- |
+| 太初T100 | 1.3.0| 
+
+
+### 1. 环境准备：(这将花费您5～15min 时间)
+
+#### 1.1 拉取镜像
+```bash
+# 注意此镜像包含预编译的飞桨安装包, TecoDriver, TecoToolKit等，可以一键运行paddlenlp模型
+wget http://mirrors.tecorigin.com/repository/teco-3rd-repo/custom_device/ubuntu22.04/x86_64/1.3.0/paddle_sdaa_1.3.0_llm_infer.tar 
+docker load < paddle_sdaa_1.3.0_llm_infer.tar
+```
+
+#### 1.2 参考如下命令启动容器
+```bash
+docker run -itd --name="paddle-sdaa-dev" --net=host --privileged --cap-add SYS_PTRACE --cap-add SYS_ADMIN --shm-size 128g jfrog.tecorigin.net/tecotp-docker/release/ubuntu22.04/x86_64/paddle_sdaa:1.3.0-llm-infer /bin/bash
+```
+
+#### 1.3 下载PaddleNLP仓库代码，并安装依赖
+```bash
+# PaddleNLP是基于PaddlePaddle『飞桨』的自然语言处理和大语言模型(LLM)开发库，存放了基于『飞桨』框架实现的各种大模型，Llama-2-13b-chat模型也包含其中。为了便于您更好地使用PaddleNLP，您需要clone整个仓库。
+git clone https://github.com/PaddlePaddle/PaddleNLP.git
+cd PaddleNLP
+export PYTHONPATH=/path/to/PaddleNLP:$PYTHONPATH
+pip install -r requirements.txt
+cd csrc/sdaa && python setup_sdaa.py install && cd ../../llm/sdaa/llama
+```
+### 2. 推理：(这将花费您15~30min时间)
+#### 2.1 动态图分布式推理
+
+执行如下命令进行推理：
+```bash
+bash dynamic_infer_llama_sdaa.sh
+```
+首次推理会自动下载权重，可以使用自动下载的权重，或者下载之后指定权重路径。成功运行后，可以查看到推理结果的生成。
+
+样例将下载的权重meta-llama/Llama-2-13b-chat文件夹保存到/workspace/weights，示例如下：
+```
+[2024-12-10 15:42:51,992] [    INFO] - set state for layer 30
+[2024-12-10 15:42:53,666] [    INFO] - set state for layer 31
+[2024-12-10 15:42:55,202] [    INFO] - set state for layer 32
+[2024-12-10 15:42:56,724] [    INFO] - set state for layer 33
+[2024-12-10 15:42:58,314] [    INFO] - set state for layer 34
+[2024-12-10 15:43:00,041] [    INFO] - set state for layer 35
+[2024-12-10 15:43:01,515] [    INFO] - set state for layer 36
+[2024-12-10 15:43:03,034] [    INFO] - set state for layer 37
+[2024-12-10 15:43:04,746] [    INFO] - set state for layer 38
+[2024-12-10 15:43:06,390] [    INFO] - set state for layer 39
+[2024-12-10 15:43:08,682] [    INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load '/workspace/weights/meta-llama/Llama-2-13b-chat'.
+[2024-12-10 15:43:08,682] [    INFO] - Loading configuration file /workspace/weights/meta-llama/Llama-2-13b-chat/config.json
+[2024-12-10 15:43:08,683] [    INFO] - Loading configuration file /workspace/weights/meta-llama/Llama-2-13b-chat/generation_config.json
+[2024-12-10 15:43:08,752] [    INFO] - Start predict
+[2024-12-10 15:43:08,789] [    INFO] - We are using <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load '/workspace/weights/meta-llama/Llama-2-13b-chat'.
+[2024-12-10 15:43:08,806] [    INFO] - Start read result message
+[2024-12-10 15:43:08,806] [    INFO] - Current path is /workspace/paddlenlp/llm
+[2024-12-10 15:43:29,178] [    INFO] - running spend 20.372194528579712
+[2024-12-10 15:43:29,187] [    INFO] - Finish read result message
+[2024-12-10 15:43:29,192] [    INFO] - End predict
+***********Source**********
+解释一下温故而知新
+***********Target**********
+
+***********Output**********
+ "温故而知新" (wēn gù er zhī xīn) is a Chinese idiom that means "to understand the old in order to know the new." It is often used to convey the idea that one must have a deep understanding of the past and traditional ways of doing things in order to truly appreciate and understand new ideas and innovations.
+
+The phrase is often used in the context of education, where students are encouraged to study the classics and learn from the past in order to gain a solid foundation for understanding new concepts and ideas. It is also used in business and technology, where companies may look to the past for inspiration and guidance as they develop new products and services.
+
+In essence, "温故而知新" suggests that one cannot truly understand the new without first understanding the old, and that a deep appreciation for the past is essential for making progress and innovation.
+```
+#### 2.2 静态图分布式推理
+
+##### 2.2.1 静态图导出
+
+执行如下命令进行静态图导出，为静态图分布式推理做好准备：
+```bash
+bash static_export_llama_sdaa.sh
+```
+成功运行后，可以查看到模型导出的结果，样例如下：
+```bash
+[2024-12-10 15:30:28,991] [    INFO] - set state for layer 24
+[2024-12-10 15:30:30,246] [    INFO] - set state for layer 25
+[2024-12-10 15:30:31,586] [    INFO] - set state for layer 26
+[2024-12-10 15:30:32,892] [    INFO] - set state for layer 27
+[2024-12-10 15:30:34,228] [    INFO] - set state for layer 28
+[2024-12-10 15:30:35,530] [    INFO] - set state for layer 29
+[2024-12-10 15:30:36,925] [    INFO] - set state for layer 30
+[2024-12-10 15:30:38,233] [    INFO] - set state for layer 31
+[2024-12-10 15:30:39,635] [    INFO] - set state for layer 32
+[2024-12-10 15:30:40,992] [    INFO] - set state for layer 33
+[2024-12-10 15:30:42,375] [    INFO] - set state for layer 34
+[2024-12-10 15:30:43,717] [    INFO] - set state for layer 35
+[2024-12-10 15:30:45,076] [    INFO] - set state for layer 36
+[2024-12-10 15:30:46,423] [    INFO] - set state for layer 37
+[2024-12-10 15:30:47,827] [    INFO] - set state for layer 38
+[2024-12-10 15:30:49,216] [    INFO] - set state for layer 39
+[2024-12-10 15:30:51,136] [    INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load '/workspace/weights/meta-llama/Llama-2-13b-chat'.
+[2024-12-10 15:30:51,136] [    INFO] - Loading configuration file /workspace/weights/meta-llama/Llama-2-13b-chat/config.json
+[2024-12-10 15:30:51,137] [    INFO] - Loading configuration file /workspace/weights/meta-llama/Llama-2-13b-chat/generation_config.json
+/root/miniconda3/envs/paddle_env/lib/python3.10/site-packages/paddle/jit/dy2static/program_translator.py:747: UserWarning: full_graph=False don't support input_spec arguments. It will not produce any effect.
+You can set full_graph=True, then you can assign input spec.
+
+  warnings.warn(
+/root/miniconda3/envs/paddle_env/lib/python3.10/site-packages/paddle/jit/api.py:1106: UserWarning: What you save is a function, and `jit.save` will generate the name of the model file according to `path` you specify. When loading these files with `jit.load`, you get a `TranslatedLayer` whose inference result is the same as the inference result of the function you saved.
+  warnings.warn(
+I1210 15:30:58.707722 1174678 program_interpreter.cc:242] New Executor is Running.
+[2024-12-10 15:31:10,381] [    INFO] - Configuration saved in ./output_dir/exported_model/llama2_13b_chat_wint8_block_size32/config.json
+[2024-12-10 15:31:10,382] [    INFO] - Configuration saved in ./output_dir/exported_model/llama2_13b_chat_wint8_block_size32/generation_config.json
+[2024-12-10 15:31:10,382] [    INFO] - tokenizer config file saved in ./output_dir/exported_model/llama2_13b_chat_wint8_block_size32/tokenizer_config.json
+[2024-12-10 15:31:10,382] [    INFO] - Special tokens file saved in ./output_dir/exported_model/llama2_13b_chat_wint8_block_size32/special_tokens_map.json
+[2024-12-10 15:31:10,383] [    INFO] - Chat-template config file saved in ./output_dir/exported_model/llama2_13b_chat_wint8_block_size32/chat_template.json
+LAUNCH INFO 2024-12-10 15:31:12,346 Pod completed
+LAUNCH INFO 2024-12-10 15:31:12,347 Exit code 0
+```
+##### 2.2.2 静态图分布式推理
+
+执行如下命令进行静态图分布式推理：
+```bash
+bash static_infer_llama_sdaa.sh
+```
+成功运行后，可以查看到推理结果的生成，样例如下：
+```bash
+[2024-12-10 15:36:24,150] [    INFO] topology.py:370 - Total 4 data comm group(s) create successfully!
+[2024-12-10 15:36:24,150] [    INFO] topology.py:370 - Total 1 model comm group(s) create successfully!
+[2024-12-10 15:36:24,150] [    INFO] topology.py:370 - Total 4 sharding comm group(s) create successfully!
+[2024-12-10 15:36:24,150] [    INFO] topology.py:290 - HybridParallelInfo: rank_id: 0, mp_degree: 4, sharding_degree: 1, pp_degree: 1, dp_degree: 1, sep_degree: 1, mp_group: [0, 1, 2, 3],  sharding_group: [0], pp_group: [0], dp_group: [0], sep:group: None, check/clip group: [0, 1, 2, 3]
+[2024-12-10 15:36:24,152] [    INFO] - We are using <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load 'output_dir/exported_model/llama2_13b_chat_wint8_block_size32'.
+[2024-12-10 15:36:24,164] [    INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load 'output_dir/exported_model/llama2_13b_chat_wint8_block_size32'.
+[2024-12-10 15:36:24,164] [    INFO] - Loading configuration file output_dir/exported_model/llama2_13b_chat_wint8_block_size32/config.json
+[2024-12-10 15:36:24,165] [    INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load 'output_dir/exported_model/llama2_13b_chat_wint8_block_size32'.
+[2024-12-10 15:36:24,165] [    INFO] - Loading configuration file output_dir/exported_model/llama2_13b_chat_wint8_block_size32/config.json
+[2024-12-10 15:36:24,198] [    INFO] - We are using <class 'paddlenlp.transformers.llama.configuration.LlamaConfig'> to load 'output_dir/exported_model/llama2_13b_chat_wint8_block_size32'.
+[2024-12-10 15:36:24,198] [    INFO] - Loading configuration file output_dir/exported_model/llama2_13b_chat_wint8_block_size32/config.json
+[2024-12-10 15:36:24,199] [    INFO] - Loading configuration file output_dir/exported_model/llama2_13b_chat_wint8_block_size32/generation_config.json
+I1210 15:36:24.239424 1334951 analysis_predictor.cc:2142] MKLDNN is enabled
+I1210 15:36:24.239473 1334951 analysis_predictor.cc:2167] CustomDevice is enabled
+I1210 15:36:24.239486 1334951 analysis_predictor.cc:2210] Model is mixed precision type with float16, we will use a new PassStrategy. Note that only GPU/XPU backend is supported for now.
+I1210 15:36:24.239490 1334951 analysis_predictor.cc:2259] Ir optimization is turned off, no ir pass will be executed.
+--- Running analysis [ir_graph_build_pass]
+I1210 15:36:24.260483 1334951 executor.cc:183] Old Executor is Running.
+--- Running analysis [ir_analysis_pass]
+--- Running analysis [ir_params_sync_among_devices_pass]
+I1210 15:36:25.863914 1334951 ir_params_sync_among_devices_pass.cc:140] Sync params from CPU to sdaa:0
+--- Running analysis [adjust_cudnn_workspace_size_pass]
+--- Running analysis [inference_op_replace_pass]
+--- Running analysis [save_optimized_model_pass]
+--- Running analysis [ir_graph_to_program_pass]
+I1210 15:36:29.991195 1334951 analysis_predictor.cc:2348] ======= ir optimization completed =======
+I1210 15:36:30.000306 1334951 gen_comm_id_helper.cc:212] Server listening on: 127.0.1.1:36942 successful.
+I1210 15:36:30.088883 1334951 task_node.cc:43] Constructing TaskNode for DistModelInf. The TaskNode's id is: 0. And the TaskNode's max_run_time and max_slot_num will be set to 1.
+LAUNCH INFO 2024-12-10 15:37:24,254 Pod completed
+LAUNCH INFO 2024-12-10 15:37:24,254 Exit code 0
+I1210 15:36:30.189157 1334951 server.cpp:1107] Server[paddle::distributed::MessageServiceImpl] is serving on port=36942.
+I1210 15:36:30.189195 1334951 server.cpp:1110] Check out http://dmx-19:36942 in web browser.
+I1210 15:36:30.189320 1334951 message_bus.cc:201] Message bus's listen port thread starts successful.
+[2024-12-10 15:36:31,284] [    INFO] - Start predict
+[2024-12-10 15:36:31,296] [    INFO] - preprocess spend 0.010512113571166992
+[2024-12-10 15:36:31,355] [    INFO] - We are using <class 'paddlenlp.transformers.llama.tokenizer.LlamaTokenizer'> to load 'output_dir/exported_model/llama2_13b_chat_wint8_block_size32'.
+[2024-12-10 15:36:31,378] [    INFO] - Start read result message
+[2024-12-10 15:36:31,378] [    INFO] - Current path is /workspace/paddlenlp/llm
+[2024-12-10 15:37:22,118] [    INFO] - running spend 50.736462116241455
+[2024-12-10 15:37:22,125] [    INFO] - Finish read result message
+[2024-12-10 15:37:22,132] [    INFO] - End predict
+***********Source**********
+解释一下温故而知新
+***********Target**********
+
+***********Output**********
+ "温故而知新" (wēn gù er zhī xīn) is a Chinese idiom that means "to know the old in order to discern the new." It is often used to describe the idea that one can gain a deeper understanding of something new by studying and appreciating the past.
+
+The word "温" (wēn) in this idiom means "old" or "past," and "故" (gù) means "olden days" or "former times." The word "知" (zhī) means "to know" or "to understand," and "新" (xīn) means "new."
+
+The idiom "温故而知新" suggests that by studying and understanding the past, one can gain a deeper appreciation for the present and make more informed decisions about the future. It is often used in the context of learning from history, understanding cultural traditions, and appreciating the value of experience and wisdom.
+
+For example, if someone is trying a new type of food for the first time, they might say "I need to study the old recipes to know the new flavors" (我需要学习古老的菜谱，才能了解新的味道). This means that by understanding the traditional methods and ingredients used in the past, they can better appreciate the new dish and its unique qualities.
+
+Overall, "温故而知新" is a reminder that understanding the past can help us navigate the present and make more informed decisions about the future.
+I1210 15:37:22.926474 1334951 server.cpp:1167] Server[paddle::distributed::MessageServiceImpl] is going to quit
+```
diff --git a/llm/sdaa/llama/dynamic_infer_llama_sdaa.sh b/llm/sdaa/llama/dynamic_infer_llama_sdaa.sh
diff --git a/llm/sdaa/llama/static_export_llama_sdaa.sh b/llm/sdaa/llama/static_export_llama_sdaa.sh
diff --git a/llm/sdaa/llama/static_infer_llama_sdaa.sh b/llm/sdaa/llama/static_infer_llama_sdaa.sh