Skip to content

Add ft compile doc and scripts #3292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 17, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions faster_tokenizer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,3 +99,7 @@ A:在有三种情况下,打开`use_faster=True`开关可能无法提升性
2. 加载的Tokenizer类型暂不支持Faster版本。目前支持4种Tokenizer的Faster版本,分别是BERT、ERNIE、TinyBERT以及ERNIE-M Tokenizer。若加载不支持Faster版本的Tokenizer情况下打开`use_faster`开关,PaddleNLP会给出以下warning:"The tokenizer XXX doesn't have the faster version. Please check the map paddlenlp.transformers.auto.tokenizer.FASTER_TOKENIZER_MAPPING_NAMES to see which faster tokenizers are currently supported."

3. 待切词文本长度过短(如文本平均长度小于5)。这种情况下切词开销可能不是整个文本预处理的性能瓶颈,导致在使用FasterTokenizer后仍无法提升整体性能。

## 相关文档

[FasterTokenizer编译指南](docs/compile/README.md)
13 changes: 13 additions & 0 deletions faster_tokenizer/docs/compile/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# FasterTokenizer编译指南

本文档说明编译FasterTokenizer C++库、Python库两种编译过程,根据编译的平台参考如下文档

- [Linux & Mac 编译](./how_to_build_linux_and_mac.md)
- [Windows编译](./how_to_build_windows.md)

FasterTokenizer使用CMake编译,其中编译过程中,各平台上编译选项如下表所示

| 选项 | 作用 | 备注 |
|:---- | :--- | :--- |
| WITH_PYTHON | 是否编译Python库,默认为是 |
| WITH_TESTING | 是否编译C++单测,默认为否 |
36 changes: 36 additions & 0 deletions faster_tokenizer/docs/compile/how_to_build_linux_and_mac.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Linux & Mac编译

## 环境依赖

- cmake >= 3.10
- gcc >= 8.2.0

## 编译C++库方法

```bash
git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP/faster_tokenizer
mkdir build & cd build
cmake .. -DWITH_PYTHON=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
make -j8
```

编译后的C++库在当前目录下的`cpp`目录下。

## 编译Python库方法

```bash
git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP/faster_tokenizer
mkdir build & cd build
# 设置Python环境
export LD_LIBRARY_PATH=/opt/_internal/cpython-3.6.0/lib/:${LD_LIBRARY_PATH}
export PATH=/opt/_internal/cpython-3.6.0/bin/:${PATH}

cmake .. -DWITH_PYTHON=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
make -j8
```

编译后的wheel包即在当前目录下的`dist`目录中

更多编译选项说明参考[编译指南](./README.md)
42 changes: 42 additions & 0 deletions faster_tokenizer/docs/compile/how_to_build_windows.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Windows 编译

## 环境依赖

- cmake >= 3.10
- VS 2019
- ninja
- cmake >= 3.10

以上依赖安装好后,在Windows菜单打开`x64 Native Tools Command Prompt for VS 2019`命令工具即可进行下面的编译环节。

## 编译C++库方法

```bash
git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP/faster_tokenizer
mkdir build & cd build
cmake .. -G "Ninja" -DWITH_PYTHON=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
ninja -j8
```

编译后的C++库在当前目录下的`cpp`目录下。

## 编译Python库方法

```bash
git clone https://github.com/PaddlePaddle/PaddleNLP.git
cd PaddleNLP/faster_tokenizer
mkdir build & cd build
# 需要指定Python库
cmake .. -G "Ninja" -DWITH_PYTHON=ON ^
-DWITH_TESTING=OFF ^
-DCMAKE_BUILD_TYPE=Release ^
-DPYTHON_EXECUTABLE=C:\Python37\python.exe ^
-DPYTHON_INCLUDE_DIR=C:\Python37\include ^
-DPYTHON_LIBRARY=C:\Python37\libs\python3%%x.lib
ninja -j8
```

编译后的wheel包即在当前目录下的`dist`目录中

更多编译选项说明参考[编译指南](./README.md)
2 changes: 1 addition & 1 deletion faster_tokenizer/faster_tokenizer/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ add_subdirectory(postprocessors)
add_subdirectory(core)
add_subdirectory(utils)
# set the relative path of shared library
if (UNIX)
if (NOT APPLE AND NOT WIN32)
set(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} -Wl,-rpath='$ORIGIN'")
endif()

Expand Down
7 changes: 7 additions & 0 deletions faster_tokenizer/run_build_cpp_lib.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
if not exist build_cpp mkdir build_cpp
cd build_cpp
for /d %%G in ("*") do rmdir /s /q "%%G"
del /q *
cmake .. -G "Ninja" -DWITH_PYTHON=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
ninja -j20
cd ..
21 changes: 21 additions & 0 deletions faster_tokenizer/run_build_cpp_lib.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Can be used in linux and mac
mkdir -p build_cpp
cd build_cpp
rm -rf *
cmake .. -DWITH_PYTHON=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
make -j48
cd ..
14 changes: 14 additions & 0 deletions faster_tokenizer/run_build_py_lib.bat
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
for %%x in (6 7 8 9) do (
if not exist build_py3%%x mkdir build_py3%%x
cd build_py3%%x
for /d %%G in ("*") do rmdir /s /q "%%G"
del /q *
cmake .. -G "Ninja" -DWITH_PYTHON=ON ^
-DWITH_TESTING=OFF ^
-DCMAKE_BUILD_TYPE=Release ^
-DPYTHON_EXECUTABLE=C:\Python3%%x\python.exe ^
-DPYTHON_INCLUDE_DIR=C:\Python3%%x\include ^
-DPYTHON_LIBRARY=C:\Python3%%x\libs\python3%%x.lib
ninja -j20
cd ..
)
35 changes: 35 additions & 0 deletions faster_tokenizer/run_build_py_lib.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Can be used in linux and mac
# build python lib
mkdir -p build_py36 build_py37 build_py38 build_py39
for py_version in 6 7 8 9;
do
cd build_py3${py_version}
rm -rf *
platform="$(uname -s)"
if [[ $platform == Linux* ]];
then
export LD_LIBRARY_PATH=/opt/_internal/cpython-3.${py_version}.0/lib/:${LD_LIBRARY_PATH}
export PATH=/opt/_internal/cpython-3.${py_version}.0/bin/:${PATH}
else
export LD_LIBRARY_PATH=/Users/paddle/miniconda2/envs/py3${py_version}/lib/:${LD_LIBRARY_PATH}
export PATH=/Users/paddle/miniconda2/envs/py3${py_version}/bin/:${PATH}
fi
cmake .. -DWITH_PYTHON=ON -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release
make -j24
cd ..
done