diff --git a/docs/FAQ.md b/docs/FAQ.md index 12853cf627fd..db2b4ac52fc6 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -61,7 +61,6 @@ **A:** 通过使用PaddleNLP提供的 `load_dataset`, `MapDataset` 和 `IterDataset` ,可以方便的自定义属于自己的数据集哦,也欢迎您贡献数据集到PaddleNLP repo。 从本地文件创建数据集时,我们 **推荐** 根据本地数据集的格式给出读取function并传入 `load_dataset()` 中创建数据集。 -以[waybill_ie](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie)快递单信息抽取任务中的数据为例: ```python from paddlenlp.datasets import load_dataset @@ -368,12 +367,12 @@ model.set_state_dict(paddle.load("xxx_para")) 动转静,即将动态图的模型转为可用于部署的静态图模型。 动态图接口更加易用,python 风格的交互式编程体验,对于模型开发更为友好,而静态图相比于动态图在性能方面有更绝对的优势。因此动转静提供了这样的桥梁,同时兼顾开发成本和性能。 可以参考官方文档 [动态图转静态图文档](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/guides/04_dygraph_to_static/index_cn.html),使用 `paddle.jit.to_static` 完成动转静。 - 另外,在 PaddleNLP 我们也提供了导出静态图模型的例子,可以参考 [waybill_ie 模型导出](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie/#%E6%A8%A1%E5%9E%8B%E5%AF%BC%E5%87%BA)。 + 另外,在 PaddleNLP 我们也提供了导出静态图模型的例子。 (2)借助Paddle Inference部署 动转静之后保存下来的模型可以借助Paddle Inference完成高性能推理部署。Paddle Inference内置高性能的CPU/GPU Kernel,结合细粒度OP横向纵向融合等策略,并集成 TensorRT 实现模型推理的性能提升。具体可以参考文档 [Paddle Inference 简介](https://paddleinference.paddlepaddle.org.cn/master/product_introduction/inference_intro.html)。 - 为便于初次上手的用户更易理解 NLP 模型如何使用Paddle Inference,PaddleNLP 也提供了对应的例子以供参考,可以参考 [/PaddleNLP/examples](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/) 下的deploy目录,如[基于ERNIE的命名实体识别模型部署](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/examples/information_extraction/waybill_ie/deploy/python)。 + 为便于初次上手的用户更易理解 NLP 模型如何使用Paddle Inference,PaddleNLP 也提供了对应的例子以供参考,可以参考 [/PaddleNLP/examples](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/legacy/examples/) 下的deploy目录。 diff --git a/docs/data_prepare/dataset_self_defined.rst b/docs/data_prepare/dataset_self_defined.rst index 673d78a26acb..fbb1d8ee81c7 100644 --- a/docs/data_prepare/dataset_self_defined.rst +++ b/docs/data_prepare/dataset_self_defined.rst @@ -9,8 +9,6 @@ 从本地文件创建数据集时,我们 **推荐** 根据本地数据集的格式给出读取function并传入 :func:`load_dataset` 中创建数据集。 -以 `waybill_ie `__ 快递单信息抽取任务中的数据为例: - .. code-block:: from paddlenlp.datasets import load_dataset @@ -44,7 +42,7 @@ 从 :class:`paddle.io.Dataset/IterableDataset` 创建数据集 ------------------- -虽然PaddlePddle内置的 :class:`Dataset` 和 :class:`IterableDataset` 是可以直接接入 :class:`DataLoader` 用于模型训练的,但有时我们希望更方便的使用一些数据处理(例如convert to feature, 数据清洗,数据增强等)。而PaddleNLP内置的 :class:`MapDataset` 和 :class:`IterDataset` 正好提供了能实现以上功能的API。 +虽然PaddlePaddle内置的 :class:`Dataset` 和 :class:`IterableDataset` 是可以直接接入 :class:`DataLoader` 用于模型训练的,但有时我们希望更方便的使用一些数据处理(例如convert to feature, 数据清洗,数据增强等)。而PaddleNLP内置的 :class:`MapDataset` 和 :class:`IterDataset` 正好提供了能实现以上功能的API。 所以如果您习惯使用 :class:`paddle.io.Dataset/IterableDataset` 创建数据集的话。只需要在原来的数据集上套上一层 :class:`MapDataset` 或 :class:`IterDataset` 就可以把原来的数据集对象转换成PaddleNLP的数据集。 diff --git a/docs/locale/en/LC_MESSAGES/FAQ.po b/docs/locale/en/LC_MESSAGES/FAQ.po index 2439942650b8..23ede15f32aa 100644 --- a/docs/locale/en/LC_MESSAGES/FAQ.po +++ b/docs/locale/en/LC_MESSAGES/FAQ.po @@ -188,7 +188,6 @@ msgstr "" #: ../FAQ.md:63 msgid "" "从本地文件创建数据集时,我们 推荐 根据本地数据集的格式给出读取function并传入 load_dataset() 中创建数据集。 " -"以waybill_ie快递单信息抽取任务中的数据为例:" msgstr "" #: ../FAQ.md:84 diff --git a/docs/locale/en/LC_MESSAGES/data_prepare/dataset_self_defined.po b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_self_defined.po index 3d076811d2f3..08850795c7a3 100644 --- a/docs/locale/en/LC_MESSAGES/data_prepare/dataset_self_defined.po +++ b/docs/locale/en/LC_MESSAGES/data_prepare/dataset_self_defined.po @@ -37,13 +37,6 @@ msgid "" "中创建数据集。" msgstr "" -#: ../data_prepare/dataset_self_defined.rst:12 -msgid "" -"以 `waybill_ie " -"`__" -" 快递单信息抽取任务中的数据为例:" -msgstr "" - #: ../data_prepare/dataset_self_defined.rst:32 msgid "" "我们推荐将数据读取代码写成生成器(generator)的形式,这样可以更好的构建 :class:`MapDataset` 和 " diff --git a/examples/code_generation/codegen/README.md b/examples/code_generation/codegen/README.md deleted file mode 100644 index 6ef14f5bcbc8..000000000000 --- a/examples/code_generation/codegen/README.md +++ /dev/null @@ -1,326 +0,0 @@ -# 代码生成:写代码的AI助理 - -**目录** -- [代码生成](#代码生成) - - [简介](#简介) - - [特色](#特色) - - [效果展示](#效果展示) - - [Github Copilot插件配置](#GithubCopilot插件配置) - - [环境依赖](#环境依赖) - - [代码结构说明](#代码结构说明) - - [启动服务](#启动服务) - - [配置参数](#配置参数说明) - - [测试服务](#测试服务) - - [配置插件](#配置插件) - - [注意事项](#注意事项) - - [训练定制](#训练定制) - - [数据准备](#数据准备) - - [从本地文件创建数据集](#从本地文件创建数据集) - - [模型训练](#模型训练) - - [TaskFlow调用](#TaskFlow调用) - - [更多使用案例](#更多使用案例) - - [模型列表](#模型列表) - - [References](#references) - - -## 简介 -代码生成是根据编程人员的输入,生成出编程人员想要的代码,能够帮助编程人员甚至独立生成代码,提高编程效率。 - - -### 特色 - -本项目是基于预训练语言模型CodeGen的代码生成,具有以下优势: -- **效果领先**。CodeGen(16B)在HumanEval benchmark上评估指标已经超过[OpenAI's Codex](https://arxiv.org/pdf/2107.03374.pdf)。 -- **免费的Github Copilot**。支持通过Github Copilot调用该模型,让你免费体验代码AI助理。 -- **高性能**。基于[FastGeneration](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/fast_generation)打造高性能推理,毫秒级响应。具体加速指标可参考[perf](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/fast_generation/README.md)。 -- **支持自定义数据集训练**。可增加自己的代码数据加以微调,让其更智能。 -- **开箱即用**。本项目提供TaskFlow接口,无需训练,仅需几行代码便可预测。 - - -## 效果展示 - -- Github Copilot代码提示效果展示 -

-
-

- -- 解算法题效果展示。求解无重复字符的最长子串的长度 -```python -from paddlenlp import Taskflow - -prompt = "def lengthOfLongestSubstring(self, s: str) -> int:" -codegen = Taskflow("code_generation", model="Salesforce/codegen-2B-mono",decode_strategy="greedy_search", repetition_penalty=1.0) -print(codegen(prompt)) -``` -结果输出为: -```python - if not s: - return 0 - - start = 0 - end = 0 - max_len = 0 - - while end < len(s): - if s[end] not in s[start:end]: - max_len = max(max_len, end - start + 1) - end += 1 - else: - start += 1 - - return max_len -``` -

-
-

- - -## Jupyter Lab插件配置 - -请参考[codegenJupyterLabExt](https://github.com/chenqianhe/codegenJupyterLabExt), 感谢生态开发者[@chenqianhe](https://github.com/chenqianhe)的贡献!👏👏 - -## GithubCopilot插件配置 - -**以VS Code的插件为例** - -### 环境依赖 -- PaddleNLP >= 2.4.0 -- PaddlePaddle >= 2.3.1 - -其他依赖:`pip install -r requirements.txt` - -### 代码结构说明 - -以下是本项目主要代码结构及说明: - -```text -codegen/ -├── requirements.txt # 环境依赖 -├── codegen_server.py # server启动脚本 -├── run_clm.py # 训练评估脚本 -├── run_clm.sh # 启动脚本 -└── README.md # 说明文档 -``` - -### 启动服务 - -```python -python codegen_server.py -``` - -##### 配置参数说明 -在codegen_server.py中配置如下参数: -- `model_name_or_path`:模型名,默认为 "Salesforce/codegen-350M-mono" -- `device`:运行设备,默认为"gpu" -- `temperature`:解码参数temperature,默认为0.5 -- `top_k`:解码参数top_k,默认为10 -- `top_p`:解码参数top_p,默认为1.0 -- `repetition_penalty`:解码重复惩罚项,默认为1.0 -- `min_length`:生成的最小长度,默认为0 -- `max_length`:生成的最大长度,默认为16 -- `decode_strategy`:解码策略,默认为"greedy_search" -- `use_fast`:是否使用FastGeneration,可加速推理,默认为True -- `use_fp16_decoding`:是否使用fp16推理,可节省显存和加速推理,默认为True - -### 测试服务 -```python -import openai -openai.api_key = 'dummy' -openai.api_base = 'http://127.0.0.1:8978' -result = openai.Completion.create( - engine='codegen', prompt='def hello', max_tokens=16, temperature=0.1) -print(result) -''' - JSON: { - "id": "cmpl-dmhoeHmcw9DJ4NeqOJDQVKv3iivJ0", - "choices": [ - { - "text": "_world():\n print(\"Hello World!\")\n\n\n#", - "index": 0, - "finish_reason": "stop", - "logprobs": null, - } - ], - "usage": { - "completion_tokens": null, - "prompt_tokens": null, - "total_tokens": null - } -} -''' -``` -**注意**:如果要从本地访问服务器,`127.0.0.1`需要换成服务器的对外IP。 - - -### 配置插件 -打开用户设置([settings.json](https://code.visualstudio.com/docs/getstarted/settings#_settings-file-locations)),增加一行配置 -```json - "github.copilot.advanced": { - "debug.overrideEngine": "codegen", - "debug.testOverrideProxyUrl": "http://127.0.0.1:8978", - "debug.overrideProxyUrl": "http://127.0.0.1:8978" - }, -``` -接下来就可以愉快地使用了😊。 - - -#### 注意事项 -- 如果使用FastGeneration,需要设置[codegen_server.py](#配置参数说明)中`use_fast=True`,第一次推理会涉及到编译,会耗费一些时间。FastGeneration的环境依赖参考[这里](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/ops/README.md#%E4%BD%BF%E7%94%A8%E7%8E%AF%E5%A2%83%E8%AF%B4%E6%98%8E)。 -- 如果要使用自己训练好的模型,可以设置[codegen_server.py](#配置参数说明)中`model_name_or_path`为本地模型路径。 -- 如果要从本地访问服务器,上述的`127.0.0.1`需要换成服务器的对外IP。 -- 如果出现下方的提示和报错,则说明FastGeneration没有启动成功,需要定位下失败的原因。或者也可设置`use_fast=False`,不启动FastGeneration加速,但推理速度会较慢。 -```shell - FastGeneration is not available, and the original version would be used instead. -``` -```shell - RuntimeError: (NotFound) There are no kernels which are registered in the unsqueeze2 operator. - [Hint: Expected kernels_iter != all_op_kernels.end(), but received kernels_iter == all_op_kernels.end().] (at /home/Paddle/paddle/fluid/imperative/prepared_operator.cc:341) - [operator < unsqueeze2 > error] -``` -- 本代码也支持插件[fauxpilot](https://marketplace.visualstudio.com/items?itemName=Venthe.fauxpilot),感谢[@linonetwo](https://github.com/linonetwo)测试。`settings.json`中配置"fauxpilot.server": "http://服务器ip:8978/v1/engines" - -## 训练定制 - -### 数据准备 - -#### 从本地文件创建数据集 - -在许多情况,我们需要使用本地数据集来训练我们的代码生成模型,本项目支持使用固定格式本地数据集文件进行训练。 - -本地数据集文件格式如下: -- train.json/test.json 文件格式: -每行为一个jsonline -```text -{ - "code": "from paddlenlp.transformers import CodeGenForCausalLM\n\n\nmodel = CodeGenForCausalLM.from_pretrained('Salesforce/codegen-2B-mono')\n" -} -``` - -更多数据集读取格式详见[数据集加载](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_load.html#)和[自定义数据集](https://paddlenlp.readthedocs.io/zh/latest/data_prepare/dataset_self_defined.html)。 - - -### 模型训练 -运行如下命令即可在样例训练集上进行finetune,并在样例验证集上进行验证。 - -```shell -# GPU启动,参数`--gpus`指定训练所用的GPU卡号,可以是单卡,也可以多卡 -unset CUDA_VISIBLE_DEVICES - -python -m paddle.distributed.launch --gpus 0,1 run_clm.py \ - --model_name_or_path Salesforce/codegen-350M-mono \ - --block_size 1024 \ - --output_dir output \ - --train_file train.json \ - --validation_file test.json \ - --num_train_epochs 5 \ - --logging_steps 10 \ - --save_steps 1000 \ - --per_device_train_batch_size 2 \ - --per_device_eval_batch_size 2 \ - --learning_rate 1e-4 \ - --warmup_ratio 0.1 \ - --do_train \ - --do_eval \ - --device gpu -``` -使用多卡训练可以指定多个GPU卡号,例如 --gpus "0,1" - -关键参数释义如下: -- `gpus` 指示了训练所用的GPU卡号。 -- `model_name_or_path` 指示了finetune使用的具体预训练模型,可以是PaddleNLP提供的预训练模型(详见[模型列表](#模型列表)),或者是本地的预训练模型。如果使用本地的预训练模型,可以配置本地模型的目录地址,例如: ./checkpoints/model_xx/,目录中需包含paddle预训练模型model_state.pdparams。如果使用PaddleNLP提供的预训练模型,可以选择下面其中之一。 -- `block_size` 表示训练时候数据被拆分的块数。 -- `output_dir` 表示模型的保存路径。 -- `train_file` 本地训练数据地址,数据格式必须与`dataset_name`所指数据集格式相同。 -- `validation_file` 本地测试数据地址,数据格式必须与`dataset_name`所指数据集格式相同。 -- `num_train_epochs` 表示训练轮数。 -- `logging_steps` 表示日志打印间隔。 -- `save_steps` 表示模型保存及评估间隔。 -- `per_device_train_batch_size` 表示训练时**每张卡**上的样本数目。 -- `per_device_eval_batch_size` 表示测试时**每张卡**上的样本数目。 -- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 -- `warmup_ratio` 表示学习率逐渐升高到基础学习率(即上面配置的learning_rate)所需要的迭代数占总步数的比例,最早的使用可以参考[这篇论文](https://arxiv.org/pdf/1706.02677.pdf)。 -- `do_train` 表示是否训练。 -- `do_eval` 表示是否评测。 -- `device` 表示使用的设备,从gpu和cpu中选择。 - -可通过`bash run_clm.sh`启动训练,更多参数详情和参数的默认值请参考`run_clm.py`。 - -程序运行时将会自动进行训练和验证,训练过程中会自动保存模型在指定的`save_dir`中。 -如: -```text -./output/ -│── model_config.json -│── model_state.pdparams -│── tokenizer_config.json -│── special_tokens_map.json -│── added_tokens.json -│── vocab.json -│── merges.txt -└── ... -``` - -**NOTE:** 如需恢复模型训练,`model_name_or_path`配置本地模型的目录地址即可。 - - -## TaskFlow调用 -参考[TaskFlow文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/model_zoo/taskflow.md) - -## 更多使用案例 - -- 根据注释/功能描述写代码 - -```python -import re -import paddle -from paddlenlp.transformers import CodeGenTokenizer, CodeGenForCausalLM - -# The supported models are shown in the following table -model_name = 'Salesforce/codegen-2B-mono' -# Init tokenizer -tokenizer = CodeGenTokenizer.from_pretrained(model_name) -# Init model -model = CodeGenForCausalLM.from_pretrained(model_name) - -prompt = "# this function prints hello world" -inputs = tokenizer([prompt]) -inputs = {k: paddle.to_tensor(v) for (k, v) in inputs.items()} -# Generate -output, score = model.generate(inputs['input_ids'], - max_length=128, - decode_strategy='greedy_search') -# Decode the result -print( - tokenizer.decode(output[0], - truncate_before_pattern=[r"\n\n^#", "^'''", "\n\n\n"], - skip_special_tokens=True, - spaces_between_special_tokens=False)) -``` -结果输出为: -```python -def hello_world(): - print("Hello World") - -hello_world() -``` - -## 模型列表 -模型列表 -| 模型名称 | 说明 | -| :--------------------------------- | -------------------------------- | -| Salesforce/codegen-350M-mono | 基于Python数据集BIGPYTHON训练 | -| Salesforce/codegen-2B-mono | 基于Python数据集BIGPYTHON训练 | -| Salesforce/codegen-6B-mono | 基于Python数据集BIGPYTHON训练 | -| Salesforce/codegen-16B-mono | 基于Python数据集BIGPYTHON训练 | -| Salesforce/codegen-350M-nl | 基于自然语言数据集THEPILE训练 | -| Salesforce/codegen-2B-nl | 基于自然语言数据集THEPILE训练 | -| Salesforce/codegen-6B-nl | 基于自然语言数据集THEPILE训练 | -| Salesforce/codegen-16B-nl | 基于自然语言数据集THEPILE训练 | -| Salesforce/codegen-350M-multi | 基于多编程语言数据集BIGQUERY训练 | -| Salesforce/codegen-2B-multi | 基于多编程语言数据集BIGQUERY训练 | -| Salesforce/codegen-6B-multi | 基于多编程语言数据集BIGQUERY训练 | -| Salesforce/codegen-16B-multi | 基于多编程语言数据集BIGQUERY训练 | - -## References -- Nijkamp, Erik, et al. "A conversational paradigm for program synthesis." arXiv preprint arXiv:2203.13474 (2022). -- [https://github.com/features/copilot/](https://github.com/features/copilot/) -- [https://github.com/AndPuQing/Papilot](https://github.com/AndPuQing/Papilot) diff --git a/examples/code_generation/codegen/codegen_server.py b/examples/code_generation/codegen/codegen_server.py deleted file mode 100644 index e0c246063bf9..000000000000 --- a/examples/code_generation/codegen/codegen_server.py +++ /dev/null @@ -1,137 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import random -import string -import time - -import paddle -import uvicorn -from fastapi import FastAPI, Response, status -from pydantic import BaseModel -from sse_starlette.sse import EventSourceResponse - -from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer -from paddlenlp.utils.log import logger - - -class DefaultConfig: - model_name_or_path = "Salesforce/codegen-350M-mono" - device = "gpu" - temperature = 0.5 - top_k = 10 - top_p = 1.0 - repetition_penalty = 1.0 - min_length = 0 - max_length = 16 - decode_strategy = "greedy_search" - use_faster = True - use_fp16_decoding = True - default_dtype = "float16" if use_faster and use_fp16_decoding else "float32" - - -class Input(BaseModel): - prompt: str - stream: bool = False - - -class Output(BaseModel): - id: str - model: str = "codegen" - object: str = "text_completion" - created: int = int(time.time()) - choices: list = None - usage = { - "completion_tokens": None, - "prompt_tokens": None, - "total_tokens": None, - } - - -generate_config = DefaultConfig() -paddle.set_device(generate_config.device) -paddle.set_default_dtype(generate_config.default_dtype) - -tokenizer = CodeGenTokenizer.from_pretrained(generate_config.model_name_or_path) -model = CodeGenForCausalLM.from_pretrained(generate_config.model_name_or_path) - -app = FastAPI() - - -def random_completion_id(): - return "cmpl-" + "".join(random.choice(string.ascii_letters + string.digits) for _ in range(29)) - - -@app.post("/v1/engines/codegen/completions", status_code=200) -async def gen(item: Input): - item = item.dict() - logger.info(f"Request: {item}") - temperature = item.get("temperature", generate_config.temperature) - top_k = item.get("top_k", generate_config.top_k) - if temperature == 0.0: - temperature = 1.0 - top_k = 1 - repetition_penalty = item.get("frequency_penalty", generate_config.repetition_penalty) - - start_time = time.time() - logger.info("Start generating code") - tokenized = tokenizer([item["prompt"]], truncation=True, return_tensors="pd") - output, _ = model.generate( - tokenized["input_ids"], - max_length=16, - min_length=generate_config.min_length, - decode_strategy=generate_config.decode_strategy, - top_k=top_k, - repetition_penalty=repetition_penalty, - temperature=temperature, - use_fast=generate_config.use_faster, - use_fp16_decoding=generate_config.use_fp16_decoding, - ) - logger.info("Finish generating code") - end_time = time.time() - logger.info(f"Time cost: {end_time - start_time}") - output = tokenizer.decode(output[0], skip_special_tokens=True) - logger.info(f"Generated code: {output}") - output_json = Output( - id=random_completion_id(), - choices=[ - { - "text": output, - "index": 0, - "finish_reason": "stop", - "logprobs": None, - } - ], - usage={ - "completion_tokens": None, - "prompt_tokens": None, - "total_tokens": None, - }, - ).json() - - def stream_response(response): - yield f"{response}\n\n" - yield "data: [DONE]\n\n" - - if item.get("stream", False): - return EventSourceResponse(stream_response(output_json)) - else: - return Response( - status_code=status.HTTP_200_OK, - content=output_json, - media_type="application/json", - ) - - -if __name__ == "__main__": - uvicorn.run("codegen_server:app", host="0.0.0.0", port=8978) diff --git a/examples/code_generation/codegen/requirements.txt b/examples/code_generation/codegen/requirements.txt deleted file mode 100644 index ae00f4799fa1..000000000000 --- a/examples/code_generation/codegen/requirements.txt +++ /dev/null @@ -1,7 +0,0 @@ -fastapi==0.79.0 -pydantic==1.9.1 -python-dotenv==0.20.0 -sse_starlette==0.10.3 -uvicorn==0.17.6 -openai==0.8.0 -regex==2022.6.2 \ No newline at end of file diff --git a/examples/code_generation/codegen/run_clm.py b/examples/code_generation/codegen/run_clm.py deleted file mode 100644 index 4e9d5668763f..000000000000 --- a/examples/code_generation/codegen/run_clm.py +++ /dev/null @@ -1,162 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import math -from dataclasses import dataclass, field -from functools import partial -from itertools import chain -from typing import Optional - -import paddle -import paddle.nn as nn -from datasets import load_dataset - -from paddlenlp.data import DataCollatorWithPadding -from paddlenlp.trainer import PdArgumentParser, Trainer, TrainingArguments, set_seed -from paddlenlp.transformers import CodeGenForCausalLM, CodeGenTokenizer -from paddlenlp.utils.log import logger - - -@dataclass -class ModelArguments: - model_name_or_path: Optional[str] = field( - default="Salesforce/codegen-350M-mono", - metadata={"help": ("Path to pre-trained model.")}, - ) - overwrite_cache: Optional[bool] = field( - default=False, - metadata={"help": ("Whether to overwrite cache for dataset.")}, - ) - - -@dataclass -class DataArguments: - train_file: Optional[str] = field( - default=None, - metadata={"help": "The input training data file."}, - ) - validation_file: Optional[str] = field( - default=None, - metadata={"help": "The input validation data file."}, - ) - block_size: Optional[int] = field( - default=None, - metadata={"help": ("The training dataset will be truncated in block of this size for training. ")}, - ) - - -def compute_metrics(eval_preds): - labels = paddle.to_tensor(eval_preds.label_ids, dtype="int64") - logits = paddle.to_tensor(eval_preds.predictions) - loss_fct = nn.CrossEntropyLoss() - eval_loss = loss_fct(logits[:, :-1, :], labels[:, 1:]) - perplexity = math.exp(eval_loss) - return {"perplexity": perplexity} - - -def convert_example(examples, tokenizer): - """convert examples into necessary features""" - # Convert raw text to feature - tokenized_examples = tokenizer( - examples["code"], return_attention_mask=True, return_position_ids=False, return_token_type_ids=False - ) - return tokenized_examples - - -def group_texts(examples, block_size): - concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()} - total_length = len(concatenated_examples[list(examples.keys())[0]]) - if total_length >= block_size: - total_length = (total_length // block_size) * block_size - result = { - k: [t[i : i + block_size] for i in range(0, total_length, block_size)] - for k, t in concatenated_examples.items() - } - result["labels"] = result["input_ids"].copy() - return result - - -def process_ds(dataset, tokenizer, overwrite_cache, block_size): - trans_func = partial(convert_example, tokenizer=tokenizer) - dataset = dataset.map( - trans_func, batched=True, remove_columns=dataset.column_names, load_from_cache_file=overwrite_cache - ) - trans_func = partial(group_texts, block_size=block_size) - dataset = dataset.map(trans_func, batched=True, load_from_cache_file=overwrite_cache) - return dataset - - -def do_train(): - parser = PdArgumentParser((ModelArguments, DataArguments, TrainingArguments)) - model_args, data_args, training_args = parser.parse_args_into_dataclasses() - - paddle.set_device(training_args.device) - if paddle.distributed.get_world_size() > 1: - paddle.distributed.init_parallel_env() - - set_seed(training_args.seed) - - model = CodeGenForCausalLM.from_pretrained(model_args.model_name_or_path) - - tokenizer = CodeGenTokenizer.from_pretrained(model_args.model_name_or_path) - - train_set = load_dataset("json", data_files=data_args.train_file, split="train") - dev_set = load_dataset("json", data_files=data_args.validation_file, split="train") - - if data_args.block_size is None: - block_size = tokenizer.model_max_length - if block_size > 1024: - logger.warning( - f"The tokenizer picked seems to have a very large `model_max_length` ({tokenizer.model_max_length}). " - "Picking 1024 instead. You can change that default value by passing --block_size xxx." - ) - block_size = 1024 - else: - if data_args.block_size > tokenizer.model_max_length: - logger.warning( - f"The block_size passed ({data_args.block_size}) is larger than the maximum length for the model" - f"({tokenizer.model_max_length}). Using block_size={tokenizer.model_max_length}." - ) - block_size = min(data_args.block_size, tokenizer.model_max_length) - - train_set = process_ds(train_set, tokenizer, model_args.overwrite_cache, block_size) - dev_set = process_ds(dev_set, tokenizer, model_args.overwrite_cache, block_size) - - batchify_fn = DataCollatorWithPadding(tokenizer, return_attention_mask=True) - - trainer = Trainer( - model=model, - args=training_args, - train_dataset=train_set if training_args.do_train else None, - eval_dataset=dev_set if training_args.do_eval else None, - tokenizer=tokenizer, - data_collator=batchify_fn, - compute_metrics=compute_metrics, - ) - - if training_args.do_train: - train_results = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) - metrics = train_results.metrics - trainer.save_model() - trainer.log_metrics("train", metrics) - trainer.save_metrics("train", metrics) - trainer.save_state() - - if training_args.do_eval: - eval_metrics = trainer.evaluate() - trainer.log_metrics("eval", eval_metrics) - - -if __name__ == "__main__": - do_train() diff --git a/examples/code_generation/codegen/run_clm.sh b/examples/code_generation/codegen/run_clm.sh deleted file mode 100644 index 4c76ea178e3f..000000000000 --- a/examples/code_generation/codegen/run_clm.sh +++ /dev/null @@ -1,30 +0,0 @@ -# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -python -m paddle.distributed.launch --gpus 0,1 run_clm.py \ - --model_name_or_path Salesforce/codegen-350M-mono \ - --block_size 1024 \ - --output_dir output \ - --train_file train.json \ - --validation_file test.json \ - --num_train_epochs 5 \ - --logging_steps 10 \ - --save_steps 1000 \ - --per_device_train_batch_size 2 \ - --per_device_eval_batch_size 2 \ - --learning_rate 1e-4 \ - --warmup_ratio 0.1 \ - --do_train \ - --do_eval \ - --device gpu diff --git a/examples/dialogue/dgu/args.py b/examples/dialogue/dgu/args.py deleted file mode 100644 index 4139474c906b..000000000000 --- a/examples/dialogue/dgu/args.py +++ /dev/null @@ -1,118 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse - - -# yapf: disable -def parse_args(): - parser = argparse.ArgumentParser(__doc__) - parser.add_argument("--task_name", default=None, type=str, required=True, help="The name of the task to train.") - parser.add_argument("--model_name_or_path", default='bert-base-uncased', type=str, help="Path to pre-trained bert model or shortcut name.") - parser.add_argument("--output_dir", default=None, type=str, help="The output directory where the checkpoints will be saved.") - parser.add_argument("--data_dir", default=None, type=str, help="The directory where the dataset will be load.") - parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.") - parser.add_argument("--max_seq_len", default=None, type=int, help="The maximum total input sequence length after tokenization for trainng. Sequences longer than this will be truncated, sequences shorter will be padded.") - parser.add_argument("--test_max_seq_len", default=None, type=int, help="The maximum total input sequence length after tokenization for testing. Sequences longer than this will be truncated, sequences shorter will be padded.") - parser.add_argument("--batch_size", default=None, type=int, help="Batch size per GPU/CPU for training.") - parser.add_argument("--test_batch_size", default=None, type=int, help="Batch size per GPU/CPU for testing.") - parser.add_argument("--learning_rate", default=None, type=float, help="The initial learning rate for Adam.") - parser.add_argument("--weight_decay", default=0.01, type=float, help="Weight decay if we apply some.") - parser.add_argument("--epochs", default=None, type=int, help="Total number of training epochs to perform.") - parser.add_argument("--logging_steps", default=None, type=int, help="Log every X updates steps.") - parser.add_argument("--save_steps", default=None, type=int, help="Save checkpoint every X updates steps.") - parser.add_argument("--seed", default=42, type=int, help="Random seed for initialization.") - parser.add_argument("--warmup_proportion", default=0.1, type=float, help="The proportion of warmup.") - parser.add_argument("--max_grad_norm", default=1.0, type=float, help="The max value of grad norm.") - parser.add_argument("--do_train", default=True, type=eval, help="Whether training.") - parser.add_argument("--do_eval", default=True, type=eval, help="Whether evaluation.") - parser.add_argument("--do_test", default=True, type=eval, help="Whether testing.") - parser.add_argument("--device", type=str, default="gpu", help="Device for selecting for the training.") - - args = parser.parse_args() - return args -# yapf: enable - - -def set_default_args(args): - args.task_name = args.task_name.lower() - if args.task_name == "udc": - if not args.save_steps: - args.save_steps = 1000 - if not args.logging_steps: - args.logging_steps = 100 - if not args.epochs: - args.epochs = 2 - if not args.max_seq_len: - args.max_seq_len = 210 - if not args.test_batch_size: - args.test_batch_size = 100 - elif args.task_name == "dstc2": - if not args.save_steps: - args.save_steps = 400 - if not args.logging_steps: - args.logging_steps = 20 - if not args.epochs: - args.epochs = 40 - if not args.learning_rate: - args.learning_rate = 5e-5 - if not args.max_seq_len: - args.max_seq_len = 256 - if not args.test_max_seq_len: - args.test_max_seq_len = 512 - elif args.task_name == "atis_slot": - if not args.save_steps: - args.save_steps = 100 - if not args.logging_steps: - args.logging_steps = 10 - if not args.epochs: - args.epochs = 50 - elif args.task_name == "atis_intent": - if not args.save_steps: - args.save_steps = 100 - if not args.logging_steps: - args.logging_steps = 10 - if not args.epochs: - args.epochs = 20 - elif args.task_name == "mrda": - if not args.save_steps: - args.save_steps = 500 - if not args.logging_steps: - args.logging_steps = 200 - if not args.epochs: - args.epochs = 7 - elif args.task_name == "swda": - if not args.save_steps: - args.save_steps = 500 - if not args.logging_steps: - args.logging_steps = 200 - if not args.epochs: - args.epochs = 3 - else: - raise ValueError("Not support task: %s." % args.task_name) - - if not args.data_dir: - args.data_dir = "./DGU_datasets/" + args.task_name - if not args.output_dir: - args.output_dir = "./checkpoints/" + args.task_name - if not args.learning_rate: - args.learning_rate = 2e-5 - if not args.batch_size: - args.batch_size = 32 - if not args.test_batch_size: - args.test_batch_size = args.batch_size - if not args.max_seq_len: - args.max_seq_len = 128 - if not args.test_max_seq_len: - args.test_max_seq_len = args.max_seq_len diff --git a/examples/dialogue/dgu/data.py b/examples/dialogue/dgu/data.py deleted file mode 100644 index 469134f7cfc7..000000000000 --- a/examples/dialogue/dgu/data.py +++ /dev/null @@ -1,509 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import numpy as np -from typing import List - -from paddle.io import Dataset - -# The input data bigin with '[CLS]', using '[SEP]' split conversation content( -# Previous part, current part, following part, etc.). If there are multiple -# conversation in split part, using 'INNER_SEP' to further split. -INNER_SEP = "[unused0]" - - -def get_label_map(label_list): - """Create label maps""" - label_map = {} - for (i, l) in enumerate(label_list): - label_map[l] = i - return label_map - - -class UDCv1(Dataset): - """ - The UDCv1 dataset is using in task Dialogue Response Selection. - The source dataset is UDCv1(Ubuntu Dialogue Corpus v1.0). See detail at - http://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/ - """ - - MAX_LEN_OF_RESPONSE = 60 - LABEL_MAP = get_label_map(["0", "1"]) - - def __init__(self, data_dir, mode="train"): - super(UDCv1, self).__init__() - self._data_dir = data_dir - self._mode = mode - self.read_data() - - def read_data(self): - if self._mode == "train": - data_path = os.path.join(self._data_dir, "train.txt") - elif self._mode == "dev": - data_path = os.path.join(self._data_dir, "dev.txt-small") - elif self._mode == "test": - data_path = os.path.join(self._data_dir, "test.txt") - self.data = [] - with open(data_path, "r", encoding="utf8") as fin: - for line in fin: - if not line: - continue - arr = line.rstrip("\n").split("\t") - if len(arr) < 3: - print("Data format error: %s" % "\t".join(arr)) - print("Data row contains at least three parts: label\tconversation1\t.....\tresponse.") - continue - label = arr[0] - text_a = arr[1:-1] - text_b = arr[-1] - self.data.append([label, text_a, text_b]) - - @classmethod - def get_label(cls, label): - return cls.LABEL_MAP[label] - - @classmethod - def num_classes(cls): - return len(cls.LABEL_MAP) - - @classmethod - def convert_example(cls, example, tokenizer, max_seq_length=512): - """Convert a glue example into necessary features.""" - - def _truncate_and_concat(text_a: List[str], text_b: str, tokenizer, max_seq_length): - tokens_b = tokenizer.tokenize(text_b) - tokens_b = tokens_b[: min(cls.MAX_LEN_OF_RESPONSE, len(tokens_b))] - tokens_a = [] - for text in text_a: - tokens_a.extend(tokenizer.tokenize(text)) - tokens_a.append(INNER_SEP) - tokens_a = tokens_a[:-1] - if len(tokens_a) > max_seq_length - len(tokens_b) - 3: - tokens_a = tokens_a[len(tokens_a) - max_seq_length + len(tokens_b) + 3 :] - tokens, segment_ids = [], [] - tokens.extend([tokenizer.cls_token] + tokens_a + [tokenizer.sep_token]) - segment_ids.extend([0] * len(tokens)) - tokens.extend(tokens_b + [tokenizer.sep_token]) - segment_ids.extend([1] * (len(tokens_b) + 1)) - input_ids = tokenizer.convert_tokens_to_ids(tokens) - return input_ids, segment_ids - - label, text_a, text_b = example - label = np.array([cls.get_label(label)], dtype="int64") - input_ids, segment_ids = _truncate_and_concat(text_a, text_b, tokenizer, max_seq_length) - return input_ids, segment_ids, label - - def __getitem__(self, index): - return self.data[index] - - def __len__(self): - return len(self.data) - - -class DSTC2(Dataset): - """ - The dataset DSTC2 is using in task Dialogue State Tracking. - The source dataset is DSTC2(Dialog State Tracking Challenges 2). See detail at - https://github.com/matthen/dstc - """ - - LABEL_MAP = get_label_map([str(i) for i in range(217)]) - - def __init__(self, data_dir, mode="train"): - super(DSTC2, self).__init__() - self._data_dir = data_dir - self._mode = mode - self.read_data() - - def read_data(self): - def _concat_dialogues(examples): - """concat multi turns dialogues""" - new_examples = [] - max_turns = 20 - for i in range(len(examples)): - multi_turns = examples[max(i - max_turns, 0) : i + 1] - new_qa = "\1".join([example[0] for example in multi_turns]) - new_examples.append((new_qa.split("\1"), examples[i][1])) - return new_examples - - if self._mode == "train": - data_path = os.path.join(self._data_dir, "train.txt") - elif self._mode == "dev": - data_path = os.path.join(self._data_dir, "dev.txt") - elif self._mode == "test": - data_path = os.path.join(self._data_dir, "test.txt") - self.data = [] - with open(data_path, "r", encoding="utf8") as fin: - pre_idx = -1 - examples = [] - for line in fin: - if not line: - continue - arr = line.rstrip("\n").split("\t") - if len(arr) != 3: - print("Data format error: %s" % "\t".join(arr)) - print("Data row should contains three parts: id\tquestion\1answer\tlabel1 label2 ...") - continue - idx = arr[0] - qa = arr[1] - label_list = arr[2].split() - if idx != pre_idx: - if idx != 0: - examples = _concat_dialogues(examples) - self.data.extend(examples) - examples = [] - pre_idx = idx - examples.append((qa, label_list)) - if examples: - examples = _concat_dialogues(examples) - self.data.extend(examples) - - @classmethod - def get_label(cls, label): - return cls.LABEL_MAP[label] - - @classmethod - def num_classes(cls): - return len(cls.LABEL_MAP) - - @classmethod - def convert_example(cls, example, tokenizer, max_seq_length=512): - """Convert a glue example into necessary features.""" - - def _truncate_and_concat(texts: List[str], tokenizer, max_seq_length): - tokens = [] - for text in texts: - tokens.extend(tokenizer.tokenize(text)) - tokens.append(INNER_SEP) - tokens = tokens[:-1] - if len(tokens) > max_seq_length - 2: - tokens = tokens[len(tokens) - max_seq_length + 2 :] - tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token] - segment_ids = [0] * len(tokens) - input_ids = tokenizer.convert_tokens_to_ids(tokens) - return input_ids, segment_ids - - texts, labels = example - input_ids, segment_ids = _truncate_and_concat(texts, tokenizer, max_seq_length) - labels = [cls.get_label(l) for l in labels] - label = np.zeros(cls.num_classes(), dtype="int64") - for l in labels: - label[l] = 1 - return input_ids, segment_ids, label - - def __getitem__(self, index): - return self.data[index] - - def __len__(self): - return len(self.data) - - -class ATIS_DSF(Dataset): - """ - The dataset ATIS_DSF is using in task Dialogue Slot Filling. - The source dataset is ATIS(Airline Travel Information System). See detail at - https://www.kaggle.com/siddhadev/ms-cntk-atis - """ - - LABEL_MAP = get_label_map([str(i) for i in range(130)]) - - def __init__(self, data_dir, mode="train"): - super(ATIS_DSF, self).__init__() - self._data_dir = data_dir - self._mode = mode - self.read_data() - - def read_data(self): - if self._mode == "train": - data_path = os.path.join(self._data_dir, "train.txt") - elif self._mode == "dev": - data_path = os.path.join(self._data_dir, "dev.txt") - elif self._mode == "test": - data_path = os.path.join(self._data_dir, "test.txt") - self.data = [] - with open(data_path, "r", encoding="utf8") as fin: - for line in fin: - if not line: - continue - arr = line.rstrip("\n").split("\t") - if len(arr) != 2: - print("Data format error: %s" % "\t".join(arr)) - print("Data row should contains two parts: conversation_content\tlabel1 label2 label3.") - continue - text = arr[0] - label_list = arr[1].split() - self.data.append([text, label_list]) - - @classmethod - def get_label(cls, label): - return cls.LABEL_MAP[label] - - @classmethod - def num_classes(cls): - return len(cls.LABEL_MAP) - - @classmethod - def convert_example(cls, example, tokenizer, max_seq_length=512): - """Convert a glue example into necessary features.""" - text, labels = example - tokens, label_list = [], [] - words = text.split() - assert len(words) == len(labels) - for word, label in zip(words, labels): - piece_words = tokenizer.tokenize(word) - tokens.extend(piece_words) - label = cls.get_label(label) - label_list.extend([label] * len(piece_words)) - if len(tokens) > max_seq_length - 2: - tokens = tokens[len(tokens) - max_seq_length + 2 :] - label_list = label_list[len(tokens) - max_seq_length + 2 :] - tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token] - label_list = [0] + label_list + [0] - segment_ids = [0] * len(tokens) - input_ids = tokenizer.convert_tokens_to_ids(tokens) - label = np.array(label_list, dtype="int64") - return input_ids, segment_ids, label - - def __getitem__(self, index): - return self.data[index] - - def __len__(self): - return len(self.data) - - -class ATIS_DID(Dataset): - """ - The dataset ATIS_ID is using in task Dialogue Intent Detection. - The source dataset is ATIS(Airline Travel Information System). See detail at - https://www.kaggle.com/siddhadev/ms-cntk-atis - """ - - LABEL_MAP = get_label_map([str(i) for i in range(26)]) - - def __init__(self, data_dir, mode="train"): - super(ATIS_DID, self).__init__() - self._data_dir = data_dir - self._mode = mode - self.read_data() - - def read_data(self): - if self._mode == "train": - data_path = os.path.join(self._data_dir, "train.txt") - elif self._mode == "dev": - data_path = os.path.join(self._data_dir, "dev.txt") - elif self._mode == "test": - data_path = os.path.join(self._data_dir, "test.txt") - self.data = [] - with open(data_path, "r", encoding="utf8") as fin: - for line in fin: - if not line: - continue - arr = line.rstrip("\n").split("\t") - if len(arr) != 2: - print("Data format error: %s" % "\t".join(arr)) - print("Data row should contains two parts: label\tconversation_content.") - continue - label = arr[0] - text = arr[1] - self.data.append([label, text]) - - @classmethod - def get_label(cls, label): - return cls.LABEL_MAP[label] - - @classmethod - def num_classes(cls): - return len(cls.LABEL_MAP) - - @classmethod - def convert_example(cls, example, tokenizer, max_seq_length=512): - """Convert a glue example into necessary features.""" - label, text = example - tokens = tokenizer.tokenize(text) - if len(tokens) > max_seq_length - 2: - tokens = tokens[len(tokens) - max_seq_length + 2 :] - tokens = [tokenizer.cls_token] + tokens + [tokenizer.sep_token] - segment_ids = [0] * len(tokens) - input_ids = tokenizer.convert_tokens_to_ids(tokens) - label = np.array([cls.get_label(label)], dtype="int64") - return input_ids, segment_ids, label - - def __getitem__(self, index): - return self.data[index] - - def __len__(self): - return len(self.data) - - -def read_da_data(data_dir, mode): - def _concat_dialogues(examples): - """concat multi turns dialogues""" - new_examples = [] - for i in range(len(examples)): - label, caller, text = examples[i] - cur_txt = "%s : %s" % (caller, text) - pre_txt = ["%s : %s" % (item[1], item[2]) for item in examples[max(0, i - 5) : i]] - suf_txt = ["%s : %s" % (item[1], item[2]) for item in examples[i + 1 : min(len(examples), i + 3)]] - sample = [label, pre_txt, cur_txt, suf_txt] - new_examples.append(sample) - return new_examples - - if mode == "train": - data_path = os.path.join(data_dir, "train.txt") - elif mode == "dev": - data_path = os.path.join(data_dir, "dev.txt") - elif mode == "test": - data_path = os.path.join(data_dir, "test.txt") - data = [] - with open(data_path, "r", encoding="utf8") as fin: - pre_idx = -1 - examples = [] - for line in fin: - if not line: - continue - arr = line.rstrip("\n").split("\t") - if len(arr) != 4: - print("Data format error: %s" % "\t".join(arr)) - print("Data row should contains four parts: id\tlabel\tcaller\tconversation_content.") - continue - idx, label, caller, text = arr - if idx != pre_idx: - if idx != 0: - examples = _concat_dialogues(examples) - data.extend(examples) - examples = [] - pre_idx = idx - examples.append((label, caller, text)) - if examples: - examples = _concat_dialogues(examples) - data.extend(examples) - return data - - -def truncate_and_concat( - pre_txt: List[str], cur_txt: str, suf_txt: List[str], tokenizer, max_seq_length, max_len_of_cur_text -): - cur_tokens = tokenizer.tokenize(cur_txt) - cur_tokens = cur_tokens[: min(max_len_of_cur_text, len(cur_tokens))] - pre_tokens = [] - for text in pre_txt: - pre_tokens.extend(tokenizer.tokenize(text)) - pre_tokens.append(INNER_SEP) - pre_tokens = pre_tokens[:-1] - suf_tokens = [] - for text in suf_txt: - suf_tokens.extend(tokenizer.tokenize(text)) - suf_tokens.append(INNER_SEP) - suf_tokens = suf_tokens[:-1] - if len(cur_tokens) + len(pre_tokens) + len(suf_tokens) > max_seq_length - 4: - left_num = max_seq_length - 4 - len(cur_tokens) - if len(pre_tokens) > len(suf_tokens): - suf_num = int(left_num / 2) - suf_tokens = suf_tokens[:suf_num] - pre_num = left_num - len(suf_tokens) - pre_tokens = pre_tokens[max(0, len(pre_tokens) - pre_num) :] - else: - pre_num = int(left_num / 2) - pre_tokens = pre_tokens[max(0, len(pre_tokens) - pre_num) :] - suf_num = left_num - len(pre_tokens) - suf_tokens = suf_tokens[:suf_num] - tokens, segment_ids = [], [] - tokens.extend([tokenizer.cls_token] + pre_tokens + [tokenizer.sep_token]) - segment_ids.extend([0] * len(tokens)) - tokens.extend(cur_tokens + [tokenizer.sep_token]) - segment_ids.extend([1] * (len(cur_tokens) + 1)) - if suf_tokens: - tokens.extend(suf_tokens + [tokenizer.sep_token]) - segment_ids.extend([0] * (len(suf_tokens) + 1)) - input_ids = tokenizer.convert_tokens_to_ids(tokens) - return input_ids, segment_ids - - -class MRDA(Dataset): - """ - The dataset MRDA is using in task Dialogue Act. - The source dataset is MRDA(Meeting Recorder Dialogue Act). See detail at - https://www.aclweb.org/anthology/W04-2319.pdf - """ - - MAX_LEN_OF_CUR_TEXT = 50 - LABEL_MAP = get_label_map([str(i) for i in range(5)]) - - def __init__(self, data_dir, mode="train"): - super(MRDA, self).__init__() - self.data = read_da_data(data_dir, mode) - - @classmethod - def get_label(cls, label): - return cls.LABEL_MAP[label] - - @classmethod - def num_classes(cls): - return len(cls.LABEL_MAP) - - @classmethod - def convert_example(cls, example, tokenizer, max_seq_length=512): - """Convert a glue example into necessary features.""" - label, pre_txt, cur_txt, suf_txt = example - label = np.array([cls.get_label(label)], dtype="int64") - input_ids, segment_ids = truncate_and_concat( - pre_txt, cur_txt, suf_txt, tokenizer, max_seq_length, cls.MAX_LEN_OF_CUR_TEXT - ) - return input_ids, segment_ids, label - - def __getitem__(self, index): - return self.data[index] - - def __len__(self): - return len(self.data) - - -class SwDA(Dataset): - """ - The dataset SwDA is using in task Dialogue Act. - The source dataset is SwDA(Switchboard Dialog Act). See detail at - http://compprag.christopherpotts.net/swda.html - """ - - MAX_LEN_OF_CUR_TEXT = 50 - LABEL_MAP = get_label_map([str(i) for i in range(42)]) - - def __init__(self, data_dir, mode="train"): - super(SwDA, self).__init__() - self.data = read_da_data(data_dir, mode) - - @classmethod - def get_label(cls, label): - return cls.LABEL_MAP[label] - - @classmethod - def num_classes(cls): - return len(cls.LABEL_MAP) - - @classmethod - def convert_example(cls, example, tokenizer, max_seq_length=512): - """Convert a glue example into necessary features.""" - label, pre_txt, cur_txt, suf_txt = example - label = np.array([cls.get_label(label)], dtype="int64") - input_ids, segment_ids = truncate_and_concat( - pre_txt, cur_txt, suf_txt, tokenizer, max_seq_length, cls.MAX_LEN_OF_CUR_TEXT - ) - return input_ids, segment_ids, label - - def __getitem__(self, index): - return self.data[index] - - def __len__(self): - return len(self.data) diff --git a/examples/dialogue/dgu/main.py b/examples/dialogue/dgu/main.py deleted file mode 100644 index f5ca4faf4572..000000000000 --- a/examples/dialogue/dgu/main.py +++ /dev/null @@ -1,290 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import random -import time -import numpy as np -from functools import partial - -import paddle -import paddle.nn as nn -import paddle.nn.functional as F -import paddle.distributed as dist -from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler -from paddle.optimizer import AdamW -from paddle.metric import Accuracy - -from paddlenlp.datasets import MapDataset -from paddlenlp.data import Stack, Tuple, Pad -from paddlenlp.transformers import BertTokenizer, BertForSequenceClassification, BertForTokenClassification -from paddlenlp.transformers import LinearDecayWithWarmup - -from args import parse_args, set_default_args -import data -import metric - -TASK_CLASSES = { - "udc": (data.UDCv1, metric.RecallAtK), - "dstc2": (data.DSTC2, metric.JointAccuracy), - "atis_slot": (data.ATIS_DSF, metric.F1Score), - "atis_intent": (data.ATIS_DID, Accuracy), - "mrda": (data.MRDA, Accuracy), - "swda": (data.SwDA, Accuracy), -} - - -def set_seed(seed): - random.seed(seed) - np.random.seed(seed) - paddle.seed(seed) - - -def load_ckpt(args, model, optimizer=None): - if args.init_from_ckpt: - params_state_dict = paddle.load(args.init_from_ckpt + ".pdparams") - model.set_state_dict(params_state_dict) - if optimizer: - opt_state_dict = paddle.load(args.init_from_ckpt + ".pdopt") - optimizer.set_state_dict(opt_state_dict) - print("Loaded checkpoint from %s" % args.init_from_ckpt) - - -def save_ckpt(model, optimizer, output_dir, name): - params_path = os.path.join(output_dir, "{}.pdparams".format(name)) - opt_path = os.path.join(output_dir, "{}.pdopt".format(name)) - paddle.save(model.state_dict(), params_path) - paddle.save(optimizer.state_dict(), opt_path) - - -class DGULossFunction(nn.Layer): - def __init__(self, task_name): - super(DGULossFunction, self).__init__() - - self.task_name = task_name - self.loss_fn = self.get_loss_fn() - - def get_loss_fn(self): - if self.task_name in ["udc", "atis_slot", "atis_intent", "mrda", "swda"]: - return F.cross_entropy - elif self.task_name == "dstc2": - return nn.BCEWithLogitsLoss(reduction="sum") - - def forward(self, logits, labels): - if self.task_name in ["udc", "atis_intent", "mrda", "swda"]: - loss = self.loss_fn(logits, labels) - elif self.task_name == "dstc2": - loss = self.loss_fn(logits, paddle.cast(labels, dtype=logits.dtype)) - elif self.task_name == "atis_slot": - labels = paddle.unsqueeze(labels, axis=-1) - loss = self.loss_fn(logits, labels) - return loss - - -def print_logs(args, step, logits, labels, loss, total_time, metric): - if args.task_name in ["udc", "atis_intent", "mrda", "swda"]: - if args.task_name == "udc": - metric = Accuracy() - metric.reset() - correct = metric.compute(logits, labels) - metric.update(correct) - acc = metric.accumulate() - print("step %d - loss: %.4f - acc: %.4f - %.3fs/step" % (step, loss, acc, total_time / args.logging_steps)) - elif args.task_name == "dstc2": - metric.reset() - metric.update(logits, labels) - joint_acc = metric.accumulate() - print( - "step %d - loss: %.4f - joint_acc: %.4f - %.3fs/step" - % (step, loss, joint_acc, total_time / args.logging_steps) - ) - elif args.task_name == "atis_slot": - metric.reset() - metric.update(logits, labels) - f1_micro = metric.accumulate() - print( - "step %d - loss: %.4f - f1_micro: %.4f - %.3fs/step" - % (step, loss, f1_micro, total_time / args.logging_steps) - ) - - -def train(args, model, train_data_loader, dev_data_loader, metric, n_procs, rank): - num_examples = len(train_data_loader) * args.batch_size * n_procs - max_train_steps = args.epochs * len(train_data_loader) - print("\nNum train examples: %d" % num_examples) - print("Max train steps: %d" % max_train_steps) - print("Warmup proportion: %.2f" % args.warmup_proportion) - - lr_scheduler = LinearDecayWithWarmup(args.learning_rate, max_train_steps, args.warmup_proportion) - - # Generate parameter names needed to perform weight decay. - # All bias and LayerNorm parameters are excluded. - decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] - optimizer = AdamW( - learning_rate=lr_scheduler, - parameters=model.parameters(), - weight_decay=args.weight_decay, - apply_decay_param_fun=lambda x: x in decay_params, - grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm), - ) - loss_fn = DGULossFunction(args.task_name) - - load_ckpt(args, model, optimizer) - - step = 0 - best_metric = 0.0 - total_time = 0.0 - for epoch in range(args.epochs): - print("\nEpoch %d/%d" % (epoch + 1, args.epochs)) - batch_start_time = time.time() - for batch in train_data_loader: - step += 1 - input_ids, segment_ids, labels = batch - logits = model(input_ids, segment_ids) - loss = loss_fn(logits, labels) - loss.backward() - optimizer.step() - lr_scheduler.step() - optimizer.clear_grad() - total_time += time.time() - batch_start_time - if step % args.logging_steps == 0: - print_logs(args, step, logits, labels, loss, total_time, metric) - total_time = 0.0 - if step % args.save_steps == 0 or step == max_train_steps: - if rank == 0: - save_ckpt(model, optimizer, args.output_dir, step) - if args.do_eval: - print("\nEval begin...") - metric_out = evaluation(args, model, dev_data_loader, metric) - if rank == 0 and metric_out > best_metric: - best_metric = metric_out - save_ckpt(model, optimizer, args.output_dir, "best") - print("Best model, step: %d\n" % step) - batch_start_time = time.time() - - -@paddle.no_grad() -def evaluation(args, model, data_loader, metric): - model.eval() - metric.reset() - for batch in data_loader: - input_ids, segment_ids, labels = batch - logits = model(input_ids, segment_ids) - if args.task_name in ["atis_intent", "mrda", "swda"]: - correct = metric.compute(logits, labels) - metric.update(correct) - else: - metric.update(logits, labels) - model.train() - metric_out = metric.accumulate() - print("Total samples: %d" % (len(data_loader) * args.test_batch_size)) - if args.task_name == "udc": - print("R1@10: %.4f - R2@10: %.4f - R5@10: %.4f\n" % (metric_out[0], metric_out[1], metric_out[2])) - return metric_out[0] - elif args.task_name == "dstc2": - print("Joint_acc: %.4f\n" % metric_out) - return metric_out - elif args.task_name == "atis_slot": - print("F1_micro: %.4f\n" % metric_out) - return metric_out - elif args.task_name in ["atis_intent", "mrda", "swda"]: - print("Acc: %.4f\n" % metric_out) - return metric_out - - -def create_data_loader(args, dataset_class, trans_func, batchify_fn, mode): - dataset = dataset_class(args.data_dir, mode) - dataset = MapDataset(dataset).map(trans_func, lazy=True) - if mode == "train": - batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) - else: - batch_sampler = BatchSampler(dataset, batch_size=args.test_batch_size, shuffle=False) - data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) - return data_loader - - -def main(args): - paddle.set_device(args.device) - world_size = dist.get_world_size() - rank = dist.get_rank() - if world_size > 1 and args.do_train: - dist.init_parallel_env() - - set_seed(args.seed) - - dataset_class, metric_class = TASK_CLASSES[args.task_name] - tokenizer = BertTokenizer.from_pretrained(args.model_name_or_path) - trans_func = partial(dataset_class.convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_len) - test_trans_func = partial(dataset_class.convert_example, tokenizer=tokenizer, max_seq_length=args.test_max_seq_len) - metric = metric_class() - - if args.task_name in ("udc", "dstc2", "atis_intent", "mrda", "swda"): - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id), # input - Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment - Stack(dtype="int64"), # label - ): fn(samples) - model = BertForSequenceClassification.from_pretrained( - args.model_name_or_path, num_classes=dataset_class.num_classes() - ) - elif args.task_name == "atis_slot": - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id), # input - Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment - Pad(axis=0, pad_val=0, dtype="int64"), # label - ): fn(samples) - model = BertForTokenClassification.from_pretrained( - args.model_name_or_path, num_classes=dataset_class.num_classes(), dropout=0.0 - ) - if world_size > 1 and args.do_train: - model = paddle.DataParallel(model) - - if args.do_train: - train_data_loader = create_data_loader(args, dataset_class, trans_func, batchify_fn, "train") - if args.do_eval: - dev_data_loader = create_data_loader(args, dataset_class, test_trans_func, batchify_fn, "dev") - else: - dev_data_loader = None - train(args, model, train_data_loader, dev_data_loader, metric, world_size, rank) - - if args.do_test: - if rank == 0: - test_data_loader = create_data_loader(args, dataset_class, test_trans_func, batchify_fn, "test") - if args.do_train: - # If do_eval=True, use best model to evaluate the test data. - # Otherwise, use final model to evaluate the test data. - if args.do_eval: - args.init_from_ckpt = os.path.join(args.output_dir, "best") - load_ckpt(args, model) - else: - if not args.init_from_ckpt: - raise ValueError('"init_from_ckpt" should be set.') - load_ckpt(args, model) - print("\nTest begin...") - evaluation(args, model, test_data_loader, metric) - - -def print_args(args): - print("----------- Configuration Arguments -----------") - for arg, value in sorted(vars(args).items()): - print("%s: %s" % (arg, value)) - print("------------------------------------------------") - - -if __name__ == "__main__": - args = parse_args() - set_default_args(args) - print_args(args) - - main(args) diff --git a/examples/dialogue/dgu/metric.py b/examples/dialogue/dgu/metric.py deleted file mode 100644 index b5ef869f768c..000000000000 --- a/examples/dialogue/dgu/metric.py +++ /dev/null @@ -1,245 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import numpy as np - -import paddle -import paddle.nn as nn -from paddle.metric import Metric - - -class RecallAtK(Metric): - """ - Recall@K is the fraction of relevant results among the retrieved Top K - results, using to evaluate the performance of Dialogue Response Selection. - - Noted that this class manages the Recall@K score only for binary - classification task. - """ - - def __init__(self, name="Recall@K", *args, **kwargs): - super(RecallAtK, self).__init__(*args, **kwargs) - self._name = name - self.softmax = nn.Softmax() - self.reset() - - def reset(self): - """ - Resets all of the metric state. - """ - self.num_sampls = 0 - self.p_at_1_in_10 = 0.0 - self.p_at_2_in_10 = 0.0 - self.p_at_5_in_10 = 0.0 - - def get_p_at_n_in_m(self, data, n, m, idx): - """ - calculate precision in recall n - """ - pos_score = data[idx][0] - curr = data[idx : idx + m] - curr = sorted(curr, key=lambda x: x[0], reverse=True) - if curr[n - 1][0] <= pos_score: - return 1 - return 0 - - def update(self, logits, labels): - """ - Update the states based on the current mini-batch prediction results. - - Args: - logits (Tensor): The predicted value is a Tensor with - shape [batch_size, 2] and type float32 or float64. - labels (Tensor): The ground truth value is a 2D Tensor, - its shape is [batch_size, 1] and type is int64. - """ - probs = self.softmax(logits) - probs = probs.numpy() - labels = labels.numpy() - assert probs.shape[0] == labels.shape[0] - data = [] - for prob, label in zip(probs, labels): - data.append((prob[1], label)) - assert len(data) % 10 == 0 - - length = int(len(data) / 10) - self.num_sampls += length - for i in range(length): - idx = i * 10 - assert data[idx][1] == 1 - self.p_at_1_in_10 += self.get_p_at_n_in_m(data, 1, 10, idx) - self.p_at_2_in_10 += self.get_p_at_n_in_m(data, 2, 10, idx) - self.p_at_5_in_10 += self.get_p_at_n_in_m(data, 5, 10, idx) - - def accumulate(self): - """ - Calculate the final Recall@K. - - Returns: - A list with scaler float: results of the calculated R1@K, R2@K, R5@K. - """ - metrics_out = [ - self.p_at_1_in_10 / self.num_sampls, - self.p_at_2_in_10 / self.num_sampls, - self.p_at_5_in_10 / self.num_sampls, - ] - return metrics_out - - def name(self): - """ - Returns metric name - """ - return self._name - - -class JointAccuracy(Metric): - """ - The joint accuracy rate is used to evaluate the performance of multi-turn - Dialogue State Tracking. For each turn, if and only if all state in - state_list are correctly predicted, the dialog state prediction is - considered correct. And the joint accuracy rate is equal to 1, otherwise - it is equal to 0. - """ - - def __init__(self, name="JointAccuracy", *args, **kwargs): - super(JointAccuracy, self).__init__(*args, **kwargs) - self._name = name - self.sigmoid = nn.Sigmoid() - self.reset() - - def reset(self): - """ - Resets all of the metric state. - """ - self.num_samples = 0 - self.correct_joint = 0.0 - - def update(self, logits, labels): - """ - Update the states based on the current mini-batch prediction results. - - Args: - logits (Tensor): The predicted value is a Tensor with - shape [batch_size, num_classes] and type float32 or float64. - labels (Tensor): The ground truth value is a 2D Tensor, - its shape is [batch_size, num_classes] and type is int64. - """ - probs = self.sigmoid(logits) - probs = probs.numpy() - labels = labels.numpy() - assert probs.shape[0] == labels.shape[0] - assert probs.shape[1] == labels.shape[1] - for i in range(probs.shape[0]): - pred, refer = [], [] - for j in range(probs.shape[1]): - if probs[i][j] >= 0.5: - pred.append(j) - if labels[i][j] == 1: - refer.append(j) - if not pred: - pred = [np.argmax(probs[i])] - if pred == refer: - self.correct_joint += 1 - self.num_samples += probs.shape[0] - - def accumulate(self): - """ - Calculate the final JointAccuracy. - - Returns: - A scaler float: results of the calculated JointAccuracy. - """ - joint_acc = self.correct_joint / self.num_samples - return joint_acc - - def name(self): - """ - Returns metric name - """ - return self._name - - -class F1Score(Metric): - """ - F1-score is the harmonic mean of precision and recall. Micro-averaging is - to create a global confusion matrix for all examples, and then calculate - the F1-score. This class is using to evaluate the performance of Dialogue - Slot Filling. - """ - - def __init__(self, name="F1Score", *args, **kwargs): - super(F1Score, self).__init__(*args, **kwargs) - self._name = name - self.reset() - - def reset(self): - """ - Resets all of the metric state. - """ - self.tp = {} - self.fn = {} - self.fp = {} - - def update(self, logits, labels): - """ - Update the states based on the current mini-batch prediction results. - - Args: - logits (Tensor): The predicted value is a Tensor with - shape [batch_size, seq_len, num_classes] and type float32 or - float64. - labels (Tensor): The ground truth value is a 2D Tensor, - its shape is [batch_size, seq_len] and type is int64. - """ - probs = paddle.argmax(logits, axis=-1) - probs = probs.numpy() - labels = labels.numpy() - assert probs.shape[0] == labels.shape[0] - assert probs.shape[1] == labels.shape[1] - for i in range(probs.shape[0]): - start, end = 1, probs.shape[1] - while end > start: - if labels[i][end - 1] != 0: - break - end -= 1 - prob, label = probs[i][start:end], labels[i][start:end] - for y_pred, y in zip(prob, label): - if y_pred == y: - self.tp[y] = self.tp.get(y, 0) + 1 - else: - self.fp[y_pred] = self.fp.get(y_pred, 0) + 1 - self.fn[y] = self.fn.get(y, 0) + 1 - - def accumulate(self): - """ - Calculate the final micro F1 score. - - Returns: - A scaler float: results of the calculated micro F1 score. - """ - tp_total = sum(self.tp.values()) - fn_total = sum(self.fn.values()) - fp_total = sum(self.fp.values()) - p_total = float(tp_total) / (tp_total + fp_total) - r_total = float(tp_total) / (tp_total + fn_total) - if p_total + r_total == 0: - return 0 - f1_micro = 2 * p_total * r_total / (p_total + r_total) - return f1_micro - - def name(self): - """ - Returns metric name - """ - return self._name diff --git a/examples/dialogue/lic2021_baseline/args.py b/examples/dialogue/lic2021_baseline/args.py deleted file mode 100644 index cc08c7d618f6..000000000000 --- a/examples/dialogue/lic2021_baseline/args.py +++ /dev/null @@ -1,58 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse - - -# yapf: disable -def parse_args(): - parser = argparse.ArgumentParser(__doc__) - parser.add_argument('--model_name_or_path', type=str, default='unified_transformer-12L-cn', help='The path or shortcut name of the pre-trained model.') - parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.') - parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.') - parser.add_argument('--train_data_path', type=str, default='./datasets/train.txt', help='Specify the path to load train data.') - parser.add_argument('--valid_data_path', type=str, default='./datasets/valid.txt', help='Specify the path to load valid data.') - parser.add_argument('--test_data_path', type=str, default='./datasets/test.txt', help='Specify the path to load test data.') - parser.add_argument('--logging_steps', type=int, default=500, help='Log every X updates steps.') - parser.add_argument('--save_steps', type=int, default=8000, help='Save checkpoint every X updates steps.') - parser.add_argument('--seed', type=int, default=2021, help='Random seed for initialization.') - parser.add_argument('--batch_size', type=int, default=8192, required=True, help='Batch size per GPU/CPU for training.') - parser.add_argument('--lr', type=float, default=1e-5, help='The initial learning rate.') - parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.') - parser.add_argument('--epochs', type=int, default=10, help='Total number of training epochs to perform.') - parser.add_argument('--warmup_steps', type=int, default=4000, help='The number of warmup steps.') - parser.add_argument('--max_grad_norm', type=float, default=0.1, help='The max value of grad norm.') - parser.add_argument('--sort_pool_size', type=int, default=65536, help='The pool size for sort in build batch data.') - parser.add_argument('--min_dec_len', type=int, default=1, help='The minimum sequence length of generation.') - parser.add_argument('--max_dec_len', type=int, default=64, help='The maximum sequence length of generation.') - parser.add_argument('--num_samples', type=int, default=1, help='The decode numbers in generation.') - parser.add_argument('--decode_strategy', type=str, default='sampling', help='The decode strategy in generation.') - parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') - parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.') - parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.') - parser.add_argument('--num_beams', type=int, default=0, help='The number of beams for beam search.') - parser.add_argument('--length_penalty', type=float, default=1.0, help='The exponential penalty to the sequence length for beam search.') - parser.add_argument('--early_stopping', type=eval, default=False, help='Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.') - parser.add_argument('--device', type=str, default='gpu', help='Device for selecting for the training.') - - args = parser.parse_args() - return args -# yapf: enable - - -def print_args(args): - print("----------- Configuration Arguments -----------") - for arg, value in sorted(vars(args).items()): - print("%s: %s" % (arg, value)) - print("------------------------------------------------") diff --git a/examples/dialogue/lic2021_baseline/data.py b/examples/dialogue/lic2021_baseline/data.py deleted file mode 100644 index 6bc938d1b729..000000000000 --- a/examples/dialogue/lic2021_baseline/data.py +++ /dev/null @@ -1,258 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from glob import glob - -import numpy as np -import paddle.distributed as dist -from paddle.io import IterableDataset - -from paddlenlp.transformers.tokenizer_utils import convert_to_unicode - - -class DialogueDataset(IterableDataset): - def __init__( - self, - filepattern, - batch_size, - pad_token_id, - bos_token_id, - sort_pool_size=2**16, - seed=1, - n_procs=None, - rank=None, - mode="test", - ): - super(DialogueDataset, self).__init__() - - self.file_list = glob(filepattern) - self.sort_pool_size = 0 if mode == "test" else sort_pool_size - self.n_procs = n_procs if n_procs else dist.get_world_size() - self.rank = rank if rank else dist.get_rank() - self.batch_size = batch_size * self.n_procs - self.shuffle = True if mode == "train" else False - self.mode = mode - self.pad_id = pad_token_id - self.bos_id = bos_token_id - self.global_rng = np.random.RandomState(seed) - - assert len(self.file_list) > 0, "There is no files in %s." % filepattern - - def load_file(self, file_path): - with open(file_path, "r", encoding="utf-8") as fin: - for i, line in enumerate(fin): - cols = convert_to_unicode(line).strip().split(";") - cols = list(map(lambda x: list(map(int, x.split(" "))), cols)) - if len(cols) > 3: - cols = cols[:3] - token_ids, type_ids, pos_ids = cols - if self.mode == "test": - tgt_start_idx = len(cols[0]) - else: - tgt_start_idx = token_ids.index(self.bos_id, 1) - sample = [token_ids, type_ids, pos_ids, tgt_start_idx] - yield sample - - def get_sorted_batch(self, pool): - """Generate sorted batches from pool.""" - pool = sorted(pool, key=lambda sample: len(sample[0])) - batches = [] - batch, max_len = [], 0 - for sample in pool: - max_len = max(max_len, len(sample[0])) - if self.mode == "test": - to_append = len(batch) < self.batch_size - else: - to_append = (len(batch) + 1) * max_len <= self.batch_size - if to_append: - batch.append(sample) - else: - batches.append(batch) - batch, max_len = [sample], len(sample[0]) - if len(batch) > 0: - batches.append(batch) - if self.shuffle: - self.global_rng.shuffle(batches) - for batch in batches: - yield batch - - @property - def get_batch(self): - all_files = list(self.file_list) - if self.shuffle: - self.global_rng.shuffle(all_files) - if self.sort_pool_size > 0: - pool = [] - for file_path in all_files: - for sample in self.load_file(file_path): - pool.append(sample) - if len(pool) == self.sort_pool_size: - for batch in self.get_sorted_batch(pool): - yield batch - pool = [] - if len(pool) > 0: - for batch in self.get_sorted_batch(pool): - yield batch - else: - batch, max_len = [], 0 - for file_path in all_files: - for sample in self.load_file(file_path): - max_len = max(max_len, len(sample[0])) - if self.mode == "test": - to_append = len(batch) < self.batch_size - else: - to_append = (len(batch) + 1) * max_len <= self.batch_size - if to_append: - batch.append(sample) - else: - yield batch - batch, max_len = [sample], len(sample[0]) - if len(batch) > 0: - yield batch - - def pad_batch_data(self, batch): - """Pad the instances to the max sequence length in batch.""" - max_len = max(map(len, batch)) - batch_data = np.array([list(data) + [self.pad_id] * (max_len - len(data)) for data in batch], dtype="int64") - return batch_data - - def gen_tgt_label_and_pos(self, batch_token_ids, batch_tgt_start_idx): - max_len = max(map(len, batch_token_ids)) - tgt_label = [] - tgt_pos = [] - for sent_index, sent in enumerate(batch_token_ids): - sent_b_index = batch_tgt_start_idx[sent_index] - tgt_label.extend(sent[sent_b_index + 1 :]) - tgt_pos.extend([sent_index * max_len + i for i in range(sent_b_index, len(sent) - 1)]) - tgt_label = np.array(tgt_label).astype("int64") - tgt_pos = np.array(tgt_pos).astype("int64") - - return tgt_label, tgt_pos - - def gen_self_attn_mask(self, batch_token_ids, batch_tgt_start_idx): - max_len = max(map(len, batch_token_ids)) - input_mask_data = np.zeros((len(batch_token_ids), max_len, max_len)) - for index, mask_data in enumerate(input_mask_data): - start = batch_tgt_start_idx[index] - end = len(batch_token_ids[index]) - mask_data[:end, :start] = 1.0 - # Generate the lower triangular matrix using the slice of matrix - b = np.tril(np.ones([end - start, end - start]), 0) - mask_data[start:end, start:end] = b - return input_mask_data.astype("float32") - - def __iter__(self): - for batch_data in self.get_batch: - # sample [token_ids, type_ids, pos_ids, tgt_start_idx] - # raw_batch [sample0, sample1, ...] - if self.n_procs > 1: - batch_data = batch_data[self.rank :: self.n_procs] - batch_data = zip(*batch_data) - token_ids, type_ids, pos_ids, tgt_start_idx = batch_data - - pad_token_ids = self.pad_batch_data(token_ids) - pad_type_ids = self.pad_batch_data(type_ids) - pad_pos_ids = self.pad_batch_data(pos_ids) - - generation_mask = self.gen_self_attn_mask(token_ids, tgt_start_idx) - - if self.mode == "test": - # [batch_size, 1] - tgt_ids = np.array([[self.bos_id]] * len(token_ids), dtype="int64") - tgt_type = np.ones((len(token_ids), 1), dtype="int64") - tgt_pos = np.array(tgt_start_idx, dtype="int64").reshape(-1, 1) - tgt_generation_mask = generation_mask[:, 0:1, :].astype("float32") - - pad_token_ids = np.concatenate((pad_token_ids, tgt_ids), axis=1) - pad_type_ids = np.concatenate((pad_type_ids, tgt_type), axis=1) - pad_pos_ids = np.concatenate((pad_pos_ids, tgt_pos), axis=1) - generation_mask = np.concatenate((generation_mask, tgt_generation_mask), axis=1) - - append_mask = np.zeros((generation_mask.shape[0], generation_mask.shape[1], 1), dtype="float32") - append_mask[:, -1, :] = 1.0 - generation_mask = np.concatenate((generation_mask, append_mask), axis=2) - generation_mask = (generation_mask - 1.0) * 1e9 - generation_mask = np.expand_dims(generation_mask, axis=1) - yield (pad_token_ids, pad_type_ids, pad_pos_ids, generation_mask) - else: - tgt_label, tgt_pos = self.gen_tgt_label_and_pos(token_ids, tgt_start_idx) - generation_mask = (generation_mask - 1.0) * 1e9 - generation_mask = np.expand_dims(generation_mask, axis=1) - yield (pad_token_ids, pad_type_ids, pad_pos_ids, generation_mask, tgt_label, tgt_pos) - - -def post_process_response(token_ids, tokenizer): - """ - Post-process the decoded sequence. Truncate from the first - and remove the and tokens currently. - """ - eos_pos = len(token_ids) - for i, tok_id in enumerate(token_ids): - if tok_id == tokenizer.sep_token_id: - eos_pos = i - break - token_ids = token_ids[:eos_pos] - tokens = tokenizer.convert_ids_to_tokens(token_ids) - response = tokenizer.merge_subword(tokens) - return token_ids, response - - -def get_in_turn_repetition(pred, is_cn=False): - """Get in-turn repetition.""" - if len(pred) == 0: - return 1.0 - if isinstance(pred[0], str): - pred = [tok.lower() for tok in pred] - if is_cn: - pred = "".join(pred) - tri_grams = set() - for i in range(len(pred) - 2): - tri_gram = tuple(pred[i : i + 3]) - if tri_gram in tri_grams: - return True - tri_grams.add(tri_gram) - return False - - -def select_response(ids, scores, tokenizer, max_dec_len=None, num_samples=1): - ids = ids.numpy().tolist() - scores = scores.numpy() - - if len(ids) != len(scores) or (len(ids) % num_samples) != 0: - raise ValueError("the length of `ids` is {}, but the `num_samples` is {}".format(len(ids), num_samples)) - - group = [] - tmp = [] - for pred, score in zip(ids, scores): - pred_token_ids, pred_tokens = post_process_response(pred, tokenizer) - num_token = len(pred_token_ids) - response = " ".join(pred_tokens) - - in_turn_repetition = get_in_turn_repetition(pred_tokens, True) or get_in_turn_repetition(pred_token_ids) - # not ending - if max_dec_len is not None and num_token >= max_dec_len: - score -= 1e3 - elif in_turn_repetition: - score -= 1e3 - - tmp.append([response, score]) - if len(tmp) == num_samples: - group.append(tmp) - tmp = [] - - results = [] - for preds in group: - preds = sorted(preds, key=lambda x: -x[1]) - results.append(preds[0][0]) - return results diff --git a/examples/dialogue/lic2021_baseline/finetune.py b/examples/dialogue/lic2021_baseline/finetune.py deleted file mode 100644 index 8f74d75ef84b..000000000000 --- a/examples/dialogue/lic2021_baseline/finetune.py +++ /dev/null @@ -1,149 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import math -import os -import time - -import paddle -import paddle.distributed as dist -import paddle.nn as nn -import paddle.nn.functional as F -from args import parse_args, print_args -from data import DialogueDataset -from paddle.io import DataLoader -from paddle.optimizer import AdamW -from paddle.optimizer.lr import NoamDecay - -from paddlenlp.transformers import ( - UnifiedTransformerLMHeadModel, - UnifiedTransformerTokenizer, -) - - -def save_ckpt(model, tokenizer, save_dir, name): - output_dir = os.path.join(save_dir, "model_{}".format(name)) - if not os.path.exists(output_dir): - os.makedirs(output_dir) - # Need better way to get inner model of DataParallel - model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model - model_to_save.save_pretrained(output_dir) - tokenizer.save_pretrained(output_dir) - - -def main(args): - paddle.set_device(args.device) - paddle.seed(args.seed) - world_size = dist.get_world_size() - if world_size > 1: - dist.init_parallel_env() - - model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path) - tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path) - if world_size > 1: - model = paddle.DataParallel(model) - - train_dataset = DialogueDataset( - args.train_data_path, - args.batch_size, - tokenizer.pad_token_id, - tokenizer.cls_token_id, - args.sort_pool_size, - args.seed, - mode="train", - ) - train_dataloader = DataLoader(train_dataset, return_list=True, batch_size=None) - valid_dataset = DialogueDataset( - args.valid_data_path, - args.batch_size, - tokenizer.pad_token_id, - tokenizer.cls_token_id, - args.sort_pool_size, - mode="valid", - ) - valid_dataloader = DataLoader(valid_dataset, return_list=True, batch_size=None) - - lr_scheduler = NoamDecay(1 / (args.warmup_steps * (args.lr**2)), args.warmup_steps) - # Generate parameter names needed to perform weight decay. - # All bias and LayerNorm parameters are excluded. - decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] - optimizer = AdamW( - learning_rate=lr_scheduler, - parameters=model.parameters(), - weight_decay=args.weight_decay, - apply_decay_param_fun=lambda x: x in decay_params, - grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm), - ) - - step = 0 - total_time = 0.0 - for epoch in range(args.epochs): - print("\nEpoch %d/%d" % (epoch + 1, args.epochs)) - batch_start_time = time.time() - for inputs in train_dataloader: - step += 1 - token_ids, type_ids, pos_ids, generation_mask, tgt_label, tgt_pos = inputs - - logits = model(token_ids, type_ids, pos_ids, generation_mask, tgt_pos) - loss = F.cross_entropy(logits, tgt_label) - loss.backward() - optimizer.step() - lr_scheduler.step() - optimizer.clear_grad() - - total_time += time.time() - batch_start_time - if step % args.logging_steps == 0: - ppl = paddle.exp(loss) - print( - "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step" - % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps) - ) - total_time = 0.0 - if step % args.save_steps == 0: - evaluation(model, valid_dataloader) - if dist.get_rank() == 0: - save_ckpt(model, tokenizer, args.save_dir, step) - batch_start_time = time.time() - - -@paddle.no_grad() -def evaluation(model, data_loader): - print("\nEval begin...") - model.eval() - total_tokens = 0 - total_loss = 0.0 - start_time = time.time() - step = 0 - for inputs in data_loader: - step += 1 - token_ids, type_ids, pos_ids, generation_mask, tgt_label, tgt_pos = inputs - - logits = model(token_ids, type_ids, pos_ids, generation_mask, tgt_pos) - loss = F.cross_entropy(logits, tgt_label, reduction="sum") - - total_loss += float(loss.numpy()) - total_tokens += tgt_label.shape[0] - - avg_loss = total_loss / total_tokens - ppl = math.exp(avg_loss) - avg_speed = (time.time() - start_time) / step - print("loss: %.4f - ppl: %.4f - %.3fs/step\n" % (avg_loss, ppl, avg_speed)) - model.train() - - -if __name__ == "__main__": - args = parse_args() - print_args(args) - - main(args) diff --git a/examples/dialogue/lic2021_baseline/infer.py b/examples/dialogue/lic2021_baseline/infer.py deleted file mode 100644 index b41cb6fcf2d0..000000000000 --- a/examples/dialogue/lic2021_baseline/infer.py +++ /dev/null @@ -1,88 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import time - -import paddle -from args import parse_args, print_args -from data import DialogueDataset, select_response -from paddle.io import DataLoader - -from paddlenlp.transformers import ( - UnifiedTransformerLMHeadModel, - UnifiedTransformerTokenizer, -) - - -def main(args): - paddle.set_device(args.device) - paddle.seed(args.seed) - - model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path) - tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path) - - test_dataset = DialogueDataset( - args.test_data_path, args.batch_size, tokenizer.pad_token_id, tokenizer.cls_token_id, mode="test" - ) - test_dataloader = DataLoader(test_dataset, return_list=True, batch_size=None) - - infer(model, test_dataloader, tokenizer) - - -@paddle.no_grad() -def infer(model, data_loader, tokenizer): - print("\nInfer begin...") - model.eval() - total_time = 0.0 - start_time = time.time() - responses = [] - for step, inputs in enumerate(data_loader, 1): - token_ids, type_ids, pos_ids, generation_mask = inputs - ids, scores = model.generate( - input_ids=token_ids, - token_type_ids=type_ids, - position_ids=pos_ids, - attention_mask=generation_mask, - max_length=args.max_dec_len, - min_length=args.min_dec_len, - decode_strategy=args.decode_strategy, - temperature=args.temperature, - top_k=args.top_k, - top_p=args.top_p, - num_beams=args.num_beams, - length_penalty=args.length_penalty, - early_stopping=args.early_stopping, - num_return_sequences=args.num_samples, - use_fast=False, - ) - - total_time += time.time() - start_time - if step % args.logging_steps == 0: - print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) - total_time = 0.0 - results = select_response(ids, scores, tokenizer, args.max_dec_len, args.num_samples) - responses.extend(results) - - start_time = time.time() - - with open(args.output_path, "w", encoding="utf-8") as fout: - for response in responses: - fout.write(response + "\n") - print("\nSave inference result into: %s" % args.output_path) - - -if __name__ == "__main__": - args = parse_args() - print_args(args) - main(args) diff --git a/examples/dialogue/plato-2/imgs/case.jpg b/examples/dialogue/plato-2/imgs/case.jpg deleted file mode 100644 index e3378e4164a3..000000000000 Binary files a/examples/dialogue/plato-2/imgs/case.jpg and /dev/null differ diff --git a/examples/dialogue/plato-2/imgs/network.png b/examples/dialogue/plato-2/imgs/network.png deleted file mode 100644 index c14de8e75d74..000000000000 Binary files a/examples/dialogue/plato-2/imgs/network.png and /dev/null differ diff --git a/examples/dialogue/plato-2/interaction.py b/examples/dialogue/plato-2/interaction.py deleted file mode 100644 index 6c4227b3d0d1..000000000000 --- a/examples/dialogue/plato-2/interaction.py +++ /dev/null @@ -1,104 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - - -import argparse -from collections import namedtuple - -import paddle -from model import Plato2InferModel -from readers.nsp_reader import NSPReader -from readers.plato_reader import PlatoReader -from termcolor import colored, cprint -from utils import gen_inputs -from utils.args import parse_args - -from paddlenlp.trainer.argparser import strtobool - - -def setup_args(): - """Setup arguments.""" - parser = argparse.ArgumentParser() - group = parser.add_argument_group("Model") - group.add_argument("--init_from_ckpt", type=str, default="") - group.add_argument("--vocab_size", type=int, default=8001) - group.add_argument("--latent_type_size", type=int, default=20) - group.add_argument("--num_layers", type=int, default=24) - - group = parser.add_argument_group("Task") - group.add_argument("--is_cn", type=strtobool, default=False) - - args, _ = parser.parse_known_args() - NSPReader.add_cmdline_args(parser) - - args = parse_args(parser) - args.batch_size *= args.latent_type_size - - # print(json.dumps(args, indent=2)) - return args - - -def load_params(model, init_from_ckpt): - state_dict = paddle.load(init_from_ckpt) - model.set_state_dict(state_dict) - - -def interact(args): - """Inference main function.""" - plato_reader = PlatoReader(args) - nsp_reader = NSPReader(args) - - if args.num_layers == 24: - n_head = 16 - hidden_size = 1024 - elif args.num_layers == 32: - n_head = 32 - hidden_size = 2048 - else: - raise ValueError( - "The pre-trained model only support 24 or 32 layers, " "but received num_layers=%d." % args.num_layers - ) - - model = Plato2InferModel(nsp_reader, args.num_layers, n_head, hidden_size) - load_params(model, args.init_from_ckpt) - model.eval() - - Example = namedtuple("Example", ["src", "data_id"]) - context = [] - start_info = "Enter [EXIT] to quit the interaction, [NEXT] to start a new conversation." - cprint(start_info, "yellow", attrs=["bold"]) - while True: - user_utt = input(colored("[Human]: ", "red", attrs=["bold"])).strip() - if user_utt == "[EXIT]": - break - elif user_utt == "[NEXT]": - context = [] - cprint(start_info, "yellow", attrs=["bold"]) - else: - context.append(user_utt) - example = Example(src=" [SEP] ".join(context), data_id=0) - record = plato_reader._convert_example_to_record(example, is_infer=True) - data = plato_reader._pad_batch_records([record], is_infer=True) - inputs = gen_inputs(data, args.latent_type_size) - inputs["tgt_ids"] = inputs["tgt_ids"].astype("int64") - pred = model(inputs)[0] - bot_response = pred["response"] - print(colored("[Bot]:", "blue", attrs=["bold"]), colored(bot_response, attrs=["bold"])) - context.append(bot_response) - return - - -if __name__ == "__main__": - args = setup_args() - interact(args) diff --git a/examples/dialogue/plato-2/model.py b/examples/dialogue/plato-2/model.py deleted file mode 100644 index 437a70887683..000000000000 --- a/examples/dialogue/plato-2/model.py +++ /dev/null @@ -1,458 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from collections import namedtuple - -import paddle -import paddle.nn as nn -import paddle.nn.functional as F - - -def post_process_context(token_ids, reader, merge=True): - """Post-process the context sequence.""" - context = [] - utt = [] - for tok_id in token_ids[1:]: - if tok_id == reader.eos_id: - utt = reader.tokenizer.convert_ids_to_tokens(utt) - if merge: - utt = reader.tokenizer.merge_subword(utt) - context.append(utt) - utt = [] - else: - utt.append(tok_id) - return context - - -def post_process_response(token_ids, reader, merge=True): - """ - Post-process the decoded sequence. Truncate from the first - and remove the and tokens currently. - """ - eos_pos = len(token_ids) - for i, tok_id in enumerate(token_ids): - if tok_id == reader.eos_id: - eos_pos = i - break - token_ids = token_ids[1:eos_pos] - response = reader.tokenizer.convert_ids_to_tokens(token_ids) - if merge: - response = reader.tokenizer.merge_subword(response) - return token_ids, response - - -def get_cross_turn_repetition(context, pred_tokens, eos_idx, is_cn=False): - """Get cross-turn repetition.""" - if len(pred_tokens) == 0: - return 1.0 - if is_cn: - context = ["".join(utt) for utt in context] - pred_tokens = "".join(pred_tokens) - - pred_tri_grams = set() - for i in range(len(pred_tokens) - 2): - tri_gram = tuple(pred_tokens[i : i + 3]) - pred_tri_grams.add(tri_gram) - for utt in context: - for i in range(len(utt) - 2): - tri_gram = tuple(utt[i : i + 3]) - if tri_gram in pred_tri_grams: - return 1.0 - return 0.0 - - -def get_in_turn_repetition(pred, is_cn=False): - """Get in-turn repetition.""" - if len(pred) == 0: - return 1.0 - if isinstance(pred[0], str): - pred = [tok.lower() for tok in pred] - if is_cn: - pred = "".join(pred) - tri_grams = set() - for i in range(len(pred) - 2): - tri_gram = tuple(pred[i : i + 3]) - if tri_gram in tri_grams: - return 1.0 - tri_grams.add(tri_gram) - return 0.0 - - -class Plato2EncoderLayer(nn.Layer): - def __init__(self, n_head, hidden_size, attn_dropout, act_dropout): - super(Plato2EncoderLayer, self).__init__() - - self.self_attn = nn.MultiHeadAttention(hidden_size, n_head, attn_dropout) - self.pre_norm_layer = nn.LayerNorm(hidden_size) - self.post_norm_layer = nn.LayerNorm(hidden_size) - self.fc1 = nn.Linear(hidden_size, hidden_size * 4) - self.fc2 = nn.Linear(hidden_size * 4, hidden_size) - - self.dropout_layer = nn.Dropout(act_dropout) - self.gelu_layer = nn.GELU() - - def forward(self, x, attn_mask, cache): - query = self.pre_norm_layer(x) - attn_output, new_cache = self.self_attn(query, None, None, attn_mask, cache) - attn_output = self.dropout_layer(attn_output) - attn_output = attn_output + x - ffd_input = self.post_norm_layer(attn_output) - - ffd_output = self.fc1(ffd_input) - ffd_output = self.gelu_layer(ffd_output) - ffd_output = self.dropout_layer(ffd_output) - - ffd_output = self.fc2(ffd_output) - ffd_output = self.dropout_layer(ffd_output) - out = ffd_output + attn_output - - return out, new_cache - - def gen_cache(self, key): - return self.self_attn.gen_cache(key) - - -class Plato2Encoder(nn.Layer): - def __init__( - self, vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout - ): - super(Plato2Encoder, self).__init__() - - self.n_head = n_head - - self.word_embedding_layer = nn.Embedding(vocab_size, hidden_size) - self.sent_embedding_layer = nn.Embedding(type_size, hidden_size) - self.pos_embedding_layer = nn.Embedding(max_position_seq_len, hidden_size) - - self.encoder_layers = [] - for i in range(num_layers): - encoder_layer = Plato2EncoderLayer(n_head, hidden_size, attn_dropout, act_dropout) - self.encoder_layers.append(encoder_layer) - self.add_sublayer("layers." + str(i), encoder_layer) - self.post_encoder_layer_norm = nn.LayerNorm(hidden_size) - - self.dropout_layer = nn.Dropout(act_dropout) - - def forward(self, caches, token_ids, type_ids, pos_ids, generation_mask, aux_emb=None): - out, self_attn_mask = self.gen_input(token_ids, type_ids, pos_ids, generation_mask, aux_emb) - - new_caches = [] - for i, encoder_layer in enumerate(self.encoder_layers): - out, new_cache = encoder_layer(out, self_attn_mask, caches[i]) - new_caches.append(new_cache) - - enc_output = self.post_encoder_layer_norm(out) - return enc_output, new_caches - - def gen_input(self, token_ids, type_ids, pos_ids, input_mask, aux_emb=None): - token_emb_out = self.word_embedding_layer(token_ids) - type_emb_out = self.sent_embedding_layer(type_ids) - pos_emb_out = self.pos_embedding_layer(pos_ids) - emb_out = token_emb_out + type_emb_out + pos_emb_out - - # auxiliary memory embeddings - if aux_emb is not None: - emb_out = paddle.concat([aux_emb, emb_out], axis=1) - - emb_out = self.dropout_layer(emb_out) - - # generate n-head self-attention mask - self_attn_mask = input_mask - self_attn_mask = paddle.scale(x=self_attn_mask, scale=1e4, bias=-1.0, bias_after_scale=False) - n_head_self_attn_mask = paddle.stack(x=[self_attn_mask] * self.n_head, axis=1) - n_head_self_attn_mask.stop_gradient = True - - return emb_out, n_head_self_attn_mask - - def gen_caches(self, key): - caches = [encoder_layer.gen_cache(key) for encoder_layer in self.encoder_layers] - return caches - - -class NSP(nn.Layer): - def __init__( - self, vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout - ): - super(NSP, self).__init__() - - self.n_head = n_head - self.hidden_size = hidden_size - - self.word_embedding_layer = nn.Embedding(vocab_size, hidden_size) - self.sent_embedding_layer = nn.Embedding(type_size, hidden_size) - self.pos_embedding_layer = nn.Embedding(max_position_seq_len, hidden_size) - - encoder_layer = nn.TransformerEncoderLayer( - hidden_size, n_head, hidden_size * 4, act_dropout, "gelu", attn_dropout, act_dropout, "True" - ) - encoder_norm = nn.LayerNorm(hidden_size) - self.encoder = nn.TransformerEncoder(encoder_layer, num_layers, encoder_norm) - self.fc1 = nn.Linear(hidden_size, hidden_size) - self.fc2 = nn.Linear(hidden_size, 2) - - self.dropout_layer = nn.Dropout(act_dropout) - self.tanh_layer = nn.Tanh() - self.softmax = nn.Softmax() - - def forward(self, inputs): - token_ids = inputs["token_ids"] - type_ids = inputs["type_ids"] - pos_ids = inputs["pos_ids"] - attention_mask = inputs["attention_mask"] - label_pos = inputs["label_pos"] - - out, self_attn_mask = self.gen_input(token_ids, type_ids, pos_ids, attention_mask) - # [-1, seq_len, hidden_size] - enc_out = self.encoder(out, self_attn_mask) - - enc_out = paddle.reshape(enc_out, [-1, self.hidden_size]) - label_pos = paddle.cast(label_pos, "int64") - out = paddle.gather(enc_out, label_pos) - pooled_out = self.fc1(out) - pooled_out = self.tanh_layer(pooled_out) - - # [-1, 2] - logits = self.fc2(pooled_out) - probs = self.softmax(logits) - - return probs - - def gen_input(self, token_ids, type_ids, pos_ids, input_mask, aux_emb=None): - token_emb_out = self.word_embedding_layer(token_ids) - type_emb_out = self.sent_embedding_layer(type_ids) - pos_emb_out = self.pos_embedding_layer(pos_ids) - emb_out = token_emb_out + type_emb_out + pos_emb_out - - # auxiliary memory embeddings - if aux_emb is not None: - emb_out = paddle.concat([aux_emb, emb_out], axis=1) - - emb_out = self.dropout_layer(emb_out) - - # generate n-head self-attention mask - self_attn_mask = input_mask - self_attn_mask = paddle.scale(x=self_attn_mask, scale=1e4, bias=-1.0, bias_after_scale=False) - n_head_self_attn_mask = paddle.stack(x=[self_attn_mask] * self.n_head, axis=1) - n_head_self_attn_mask.stop_gradient = True - - return emb_out, n_head_self_attn_mask - - -class Plato2InferModel(nn.Layer): - def __init__( - self, - nsp_reader, - num_layers, - n_head, - hidden_size, - vocab_size=8001, - type_size=2, - latent_type_size=20, - max_position_seq_len=256, - act_dropout=0.1, - attn_dropout=0.1, - max_dec_len=64, - min_dec_len=1, - topk=10, - ): - super(Plato2InferModel, self).__init__() - - self.nsp_reader = nsp_reader - self.num_layers = num_layers - self.latent_type_size = latent_type_size - self.max_dec_len = max_dec_len - self.min_dec_len = min_dec_len - self.topk = topk - self.unk_id = 0 - self.bos_id = 1 - self.eos_id = 2 - self.mask_id = 8000 - self.after_eos = paddle.ones([vocab_size]) * -1e9 - self.after_eos[self.eos_id] = 0 - self.is_cn = False - self.batch_size = 1 - - self.latent_weight = paddle.create_parameter([hidden_size, latent_type_size], "float32") - - self.plato2_encoder = Plato2Encoder( - vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout - ) - - self.logits_fc_layer = nn.Linear(hidden_size, hidden_size) - self.logits_layer_norm = nn.LayerNorm(hidden_size) - self.logits_bias = paddle.create_parameter([vocab_size], "float32", is_bias=True) - - self.nsp_predictor = NSP( - vocab_size, type_size, max_position_seq_len, num_layers, n_head, hidden_size, attn_dropout, act_dropout - ) - - self.gelu_layer = nn.GELU() - self.softmax = nn.Softmax() - - @paddle.no_grad() - def forward(self, inputs): - token_ids = inputs["token_ids"] - type_ids = inputs["type_ids"] - pos_ids = inputs["pos_ids"] - generation_mask = inputs["generation_mask"] - latent_id = inputs["latent_id"] - data_id = inputs["data_id"] - - # [-1, 1, latent_type_size] - latent_id = F.one_hot(latent_id, self.latent_type_size) - # [-1, 1, hidden_size] - latent_emb = paddle.matmul(latent_id, self.latent_weight, transpose_y=True) - - caches = self.plato2_encoder.gen_caches(token_ids) - - # [-1, seq_len + 1, hidden_size] - enc_out, new_caches = self.plato2_encoder(caches, token_ids, type_ids, pos_ids, generation_mask, latent_emb) - - pred_ids = self.decode(inputs, new_caches) - - nsp_inputs = self.gen_nsp_input(token_ids, pred_ids) - # [-1, 2] - probs = self.nsp_predictor(nsp_inputs) - - return self.get_results(data_id, token_ids, pred_ids, probs) - - def decode(self, inputs, caches): - tgt_ids = inputs["tgt_ids"] - tgt_pos = inputs["tgt_pos"] - tgt_generation_mask = inputs["tgt_generation_mask"] - predictions = tgt_ids - - # TODO - step = 0 - while step < self.max_dec_len: - # [-1, 1] - append_mask = paddle.cast(tgt_ids != self.eos_id, dtype=tgt_generation_mask.dtype) - tgt_generation_mask = paddle.concat([tgt_generation_mask, paddle.unsqueeze(append_mask, 1)], axis=-1) - tgt_sent = paddle.ones([tgt_generation_mask.shape[0], 1], dtype=tgt_ids.dtype) - - # [-1, 1, hidden_size] - out, caches = self.plato2_encoder(caches, tgt_ids, tgt_sent, tgt_pos, tgt_generation_mask) - out = paddle.squeeze(out, axis=1) - - # [-1, hidden_size] - trans = self.logits_fc_layer(out) - trans = self.gelu_layer(trans) - trans = self.logits_layer_norm(trans) - - # [-1, vocab_size] - logits = ( - paddle.matmul(trans, self.plato2_encoder.word_embedding_layer.weight, transpose_y=True) - + self.logits_bias - ) - logits[:, self.unk_id] = -1e9 - logits[:, self.bos_id] = -1e9 - logits[:, self.mask_id] = -1e9 - if step < self.min_dec_len: - logits[:, self.eos_id] = -1e9 - logits = logits * append_mask + (1 - append_mask) * self.after_eos - probs = self.softmax(logits) - - # [-1, topk] - topk_probs, _ = paddle.topk(probs, k=self.topk) - mask = paddle.cast(probs >= topk_probs[:, -1:], "float32") - sums = paddle.sum(topk_probs, axis=-1, keepdim=True) - new_probs = probs * mask / sums - # [-1, 1] - sampling_ids = paddle.multinomial(new_probs) - - step = step + 1 - tgt_ids = sampling_ids - tgt_pos = tgt_pos + 1 - predictions = paddle.concat([predictions, tgt_ids], axis=1) - return predictions - - def gen_nsp_input(self, token_ids, pred_ids): - token_ids = token_ids.numpy() - pred_ids = pred_ids.numpy() - - def __reader__(): - headers = ["src", "tgt", "data_id"] - - Example = namedtuple("Example", headers) - - for i, (raw, pred) in enumerate(zip(token_ids, pred_ids)): - context = post_process_context(raw, self.nsp_reader, merge=False) - _, response = post_process_response(pred, self.nsp_reader, merge=False) - context_tokenized_input = " [SEP] ".join(" ".join(utt) for utt in context) - response_tokenized_input = " ".join(response) - example = Example(src=context_tokenized_input, tgt=response_tokenized_input, data_id=i) - data = self.nsp_reader._convert_example_to_record(example, is_infer=True) - yield data - return - - generator = self.nsp_reader.data_generator( - reader=__reader__, - is_infer=True, - phase="test", - ) - inputs = next(generator()) - - # print('\nnsp_inputs:') - for key in inputs: - inputs[key] = paddle.to_tensor(inputs[key]) - if key in ["token_ids", "type_ids", "pos_ids"]: - inputs[key] = paddle.squeeze(inputs[key], axis=-1) - # print(key, inputs[key].shape) - # print(inputs[key]) - return inputs - - def get_results(self, data_id, token_ids, pred_ids, probs): - data_id = data_id.numpy() - token_ids = token_ids.numpy() - pred_ids = pred_ids.numpy() - probs = probs.numpy() - - infos = [] - for raw, pred, prob in zip(token_ids, pred_ids, probs): - tokens = post_process_context(raw, self.nsp_reader) - pred_token_ids, pred_tokens = post_process_response(pred, self.nsp_reader) - info = {} - info["response"] = " ".join(pred_tokens) - cross_turn_repetition = get_cross_turn_repetition(tokens, pred_tokens, self.nsp_reader.eos_id, self.is_cn) - in_turn_repetition = max( - get_in_turn_repetition(pred_tokens, self.is_cn), get_in_turn_repetition(pred_token_ids) - ) - - info["score"] = float(prob[1]) - if len(pred_token_ids) >= self.max_dec_len: - info["score"] -= 1e3 - elif cross_turn_repetition > 0: - info["score"] -= 1e3 - elif in_turn_repetition > 0: - info["score"] -= 1e3 - infos.append(info) - - results = [] - pre_idx = 0 - sample = [] - for idx, info in zip(data_id, infos): - if idx != pre_idx: - sample = sorted(sample, key=lambda info: -info["score"]) - result = sample[0] - result["data_id"] = pre_idx - results.apeend(result) - sample = [] - pre_idx = idx - sample.append(info) - if sample: - sample = sorted(sample, key=lambda info: -info["score"]) - result = sample[0] - result["data_id"] = pre_idx - results.append(result) - return results diff --git a/examples/dialogue/plato-2/readers/dialog_reader.py b/examples/dialogue/plato-2/readers/dialog_reader.py deleted file mode 100644 index 00339c1f0a0e..000000000000 --- a/examples/dialogue/plato-2/readers/dialog_reader.py +++ /dev/null @@ -1,436 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""Dialogue Reader.""" - -import csv -import gzip -from collections import namedtuple -from contextlib import contextmanager - -import numpy as np -import utils.tokenization as tokenization -from utils import pad_batch_data -from utils.masking import mask - -from paddlenlp.trainer.argparser import strtobool - - -class DialogReader(object): - """The implement of DialogReader.""" - - @classmethod - def add_cmdline_args(cls, parser): - """Add cmdline argurments.""" - group = parser.add_argument_group("Reader") - group.add_argument("--max_src_len", type=int, default=128) - group.add_argument("--max_tgt_len", type=int, default=128) - group.add_argument("--truncate_first_turn", type=strtobool, default=False) - group.add_argument("--file_format", type=str, default="file", choices=["file", "filelist"]) - group.add_argument("--data_format", type=str, default="raw", choices=["raw", "tokenized", "numerical"]) - group.add_argument("--in_tokens", type=strtobool, default=False) - group.add_argument("--batch_size", type=int, default=16) - group.add_argument("--continuous_position", type=strtobool, default=True) - group.add_argument("--random_seed", type=int, default=11) - group.add_argument("--sort_pool_size", type=int, default=2**16) - - group = parser.add_argument_group("Tokenizer") - group.add_argument("--tokenizer", type=str, default="SentencePieceTokenizer") - args, _ = parser.parse_known_args() - tokenizer_cls = getattr(tokenization, args.tokenizer) - tokenizer_cls.add_cmdline_args(parser) - return group - - def __init__(self, args): - tokenizer_cls = getattr(tokenization, args.tokenizer) - self.tokenizer = tokenizer_cls(args) - self.vocab = self.tokenizer.vocab - self.pad_id = args.pad_id = self.vocab["[PAD]"] - self.bos_id = args.bos_id = self.vocab["[CLS]"] - self.eos_id = args.eos_id = self.vocab["[SEP]"] - self.unk_id = args.unk_id = self.vocab["[UNK]"] - self.mask_id = args.mask_id = self.vocab["[MASK]"] - self.vocab_size = args.get("vocab_size", 0) - self.max_src_len = args.max_src_len - self.max_tgt_len = args.max_tgt_len - self.truncate_first_turn = args.truncate_first_turn - self.file_format = args.file_format - self.data_format = args.data_format - self.in_tokens = args.in_tokens - self.batch_size = args.batch_size - self.continuous_position = args.continuous_position - self.sort_pool_size = args.sort_pool_size - - # random_seed must be set for data slicing when using multi-gpu - self.global_rng = np.random.RandomState(args.random_seed) - - # training progress - self.current_example = 0 - self.current_epoch = 0 - self.num_examples = 0 - - # model related - - self.fields = ["token_ids", "type_ids", "pos_ids"] - self.num_numerical_fields = len(self.fields) - self.fields += ["tgt_start_idx", "data_id"] - self.sort_key = lambda record: [len(record.token_ids)] - - self.Record = namedtuple("Record", self.fields, defaults=(None,) * len(self.fields)) - - self.features = {} - return - - def get_train_progress(self): - """Gets progress for training phase.""" - return self.current_epoch, self.current_file_index, self.total_file - - def _convert_example_to_record(self, example, is_infer): - # process src - src_token_ids = [] - src_pos_ids = [] - - # tokenize src - s_token_ids_list = [] - for s in example.src.split("[SEP]"): - s = tokenization.convert_to_unicode(s).strip() - - if self.data_format == "tokenized": - s_tokens = s.split(" ") - else: - s_tokens = self.tokenizer.tokenize(s) - - s_token_ids = self.tokenizer.convert_tokens_to_ids(s_tokens) + [self.eos_id] - s_token_ids_list.append(s_token_ids) - - # trim src - idx = len(s_token_ids_list) - 1 - total_token_num = 1 - while idx >= 0: - total_token_num += len(s_token_ids_list[idx]) - if total_token_num > self.max_src_len: - if self.truncate_first_turn and idx == 0: - truncated_ids = s_token_ids_list[idx][: self.max_src_len - total_token_num] - if len(truncated_ids) > 1: - s_token_ids_list[idx] = truncated_ids[:-1] + [self.eos_id] - idx -= 1 - break - idx -= 1 - - for i, s_token_ids in enumerate(s_token_ids_list[idx + 1 :], idx + 1): - src_token_ids += s_token_ids - src_pos_ids += list(range(1, len(s_token_ids) + 1)) - - src_token_ids = [self.bos_id] + src_token_ids - src_type_ids = [0] * len(src_token_ids) - src_pos_ids = [0] + src_pos_ids - assert ( - len(src_token_ids) == len(src_type_ids) == len(src_pos_ids) - ), "not len(src_token_ids) == len(src_type_ids) == len(src_pos_ids)" - - token_ids = src_token_ids - type_ids = src_type_ids - pos_ids = src_pos_ids - tgt_start_idx = len(token_ids) - - if not is_infer: - # process tgt - # tokenize tgt - tgt = tokenization.convert_to_unicode(example.tgt).strip() - if self.data_format == "tokenized": - tgt_tokens = tgt.split(" ") - else: - tgt_tokens = self.tokenizer.tokenize(tgt) - - tgt_token_ids = self.tokenizer.convert_tokens_to_ids(tgt_tokens) - tgt_token_ids.append(self.eos_id) - - # trim tgt - if len(tgt_token_ids) > self.max_tgt_len - 1: - tgt_token_ids = tgt_token_ids[: self.max_tgt_len - 1] - - tgt_token_ids = [self.bos_id] + tgt_token_ids - tgt_type_ids = [1] * len(tgt_token_ids) - tgt_pos_ids = list(range(1, len(tgt_token_ids) + 1)) - assert ( - len(tgt_token_ids) == len(tgt_type_ids) == len(tgt_pos_ids) - ), "not len(tgt_token_ids) == len(tgt_type_ids) == len(tgt_pos_ids)" - - token_ids += tgt_token_ids - type_ids += tgt_type_ids - pos_ids += tgt_pos_ids - - assert len(token_ids) == len(type_ids) == len(pos_ids), "not len(token_ids) == len(type_ids) == len(pos_ids)" - - if self.continuous_position: - src_pos_ids = list(range(len(src_token_ids))) - if not is_infer: - tgt_pos_ids = list(range(len(tgt_token_ids))) - pos_ids = list(range(len(token_ids))) - - field_values = {"token_ids": src_token_ids, "type_ids": src_type_ids, "pos_ids": src_pos_ids} - field_values["tgt_start_idx"] = tgt_start_idx - field_values["data_id"] = example.data_id - - record = self.Record(**field_values) - return record - - def _read_tsv(self, fp, phase, is_infer, delimiter="\t", quotechar=None): - """Reads a tab separated value file.""" - csv.field_size_limit(2**20) - reader = csv.reader(fp, delimiter=delimiter, quotechar=quotechar) - headers = next(reader) - headers.append("data_id") - Example = namedtuple("Example", headers) - - for i, line in enumerate(reader): - example = Example(*line, data_id=i) - if is_infer or phase.endswith("test"): - self.features[phase][i] = example - record = self._convert_example_to_record(example, is_infer) - yield record - - def _read_numerical_file(self, fp, delimiter=";"): - for i, line in enumerate(fp): - cols = tokenization.convert_to_unicode(line).strip().split(delimiter) - cols = list(map(lambda x: list(map(int, x.split(" "))), cols)) - if len(cols) > self.num_numerical_fields: - cols = cols[: self.num_numerical_fields] - tgt_start_idx = cols[0].index(self.bos_id, 1) - record = self.Record(*cols, tgt_start_idx=tgt_start_idx, data_id=i) - yield record - - def _read_file(self, input_file, phase, is_infer): - def __wrapper__(): - with open_file(input_file) as fp: - if self.data_format == "numerical": - records = self._read_numerical_file(fp) - else: - records = self._read_tsv(fp, phase, is_infer) - for record in records: - yield record - - return __wrapper__ - - def _read_files(self, filelist, phase, is_infer, shuffle_files): - input_files = open(filelist).readlines() - - def __wrapper__(): - if shuffle_files: - self.global_rng.shuffle(input_files) - - if phase == "train": - self.total_file = len(input_files) - for file_index, input_file in enumerate(input_files, 1): - if phase == "train": - self.current_file_index = file_index - self.current_file = input_file - file_reader = self._read_file(input_file.strip(), phase, is_infer) - for record in file_reader(): - yield record - - return __wrapper__ - - def _batch_reader(self, reader, phase=None, is_infer=False, sort_pool_size=2**16): - """Construct a batch reader.""" - - def update_max_lens(max_lens, record): - """Update max_lens.""" - if max_lens is None: - return self.sort_key(record) - else: - return [max(max_len, l) for max_len, l in zip(max_lens, self.sort_key(record))] - - def get_batch(reader): - """Generate batches from reader.""" - batch, max_lens = [], None - for record in reader(): - if record is None: - yield batch - batch, max_lens = [], None - continue - - self.current_example += 1 - max_lens = update_max_lens(max_lens, record) - if self.in_tokens: - to_append = (len(batch) + 1) * sum(max_lens) <= self.batch_size - else: - to_append = len(batch) < self.batch_size - if to_append: - batch.append(record) - else: - yield batch - batch, max_lens = [record], self.sort_key(record) - - if len(batch) > 0: - yield batch - - def get_sorted_batch(pool): - """Generate sorted batches from pool.""" - pool = sorted(pool, key=self.sort_key) - batches = [] - batch, max_lens = [], None - for record in pool: - self.current_example += 1 - max_lens = update_max_lens(max_lens, record) - if self.in_tokens: - to_append = (len(batch) + 1) * sum(max_lens) <= self.batch_size - else: - to_append = len(batch) < self.batch_size - if to_append: - batch.append(record) - else: - batches.append(batch) - batch, max_lens = [record], self.sort_key(record) - - if len(batch) > 0: - batches.append(batch) - self.global_rng.shuffle(batches) - - for batch in batches: - yield batch - - def __wrapper__(): - if sort_pool_size > 0: - pool = [] - for record in reader(): - pool.append(record) - if len(pool) == sort_pool_size: - for batch in get_sorted_batch(pool): - yield batch - pool = [] - if len(pool) > 0: - for batch in get_sorted_batch(pool): - yield batch - else: - for batch in get_batch(reader): - yield batch - - return __wrapper__ - - def _distributed_batch_reader(self, batch_reader, num_part, part_id, is_test=False): - def __wrapper__(): - batches = [] - for batch in batch_reader(): - batches.append(batch) - if len(batches) == num_part: - yield batches[part_id] - batches = [] - if is_test and 0 <= part_id < len(batches): - yield batches[part_id] - return - - return __wrapper__ - - def data_generator( - self, input_file=None, reader=None, num_epochs=1, num_part=1, part_id=0, phase=None, is_infer=False - ): - """Data generator.""" - - def __wrapper__(): - if is_infer or phase.endswith("test"): - self.features[phase] = {} - - nonlocal reader - if reader is None: - if self.file_format == "filelist": - reader = self._read_files(input_file, phase, is_infer, not phase.endswith("test")) - else: - if phase == "train": - self.total_file = 1 - self.current_file_index = 1 - self.current_file = input_file - reader = self._read_file(input_file, phase, is_infer) - - batch_reader = self._batch_reader( - reader, phase, is_infer, sort_pool_size=self.sort_pool_size if not is_infer else 0 - ) - if phase == "train": - batch_reader = self._distributed_batch_reader(batch_reader, num_part, part_id) - elif phase.startswith("distributed"): - batch_reader = self._distributed_batch_reader(batch_reader, num_part, part_id, is_test=True) - - for epoch_index in range(num_epochs): - if phase == "train": - self.current_example = 0 - self.current_epoch = epoch_index + 1 - for batch in batch_reader(): - yield self._pad_batch_records(batch, is_infer) - - return __wrapper__ - - def _gen_self_attn_mask(self, batch_token_ids, batch_tgt_start_idx=None, is_unidirectional=True, shift_len=0): - max_len = max(map(len, batch_token_ids)) - input_mask_data = np.zeros((len(batch_token_ids), max_len + shift_len, max_len + shift_len)) - if is_unidirectional: - for index, mask_data in enumerate(input_mask_data): - start = 0 if batch_tgt_start_idx is None else batch_tgt_start_idx[index] - end = len(batch_token_ids[index]) - mask_data[: end + shift_len, : start + shift_len] = 1.0 - # Generate the lower triangular matrix using the slice of matrix - b = np.tril(np.ones([end - start, end - start]), 0) - mask_data[start + shift_len : end + shift_len, start + shift_len : end + shift_len] = b - else: - for index, token_ids in enumerate(batch_token_ids): - input_mask_data[index, : len(token_ids) + shift_len, : len(token_ids) + shift_len] = 1.0 - return input_mask_data.astype("float32") - - def _pad_batch_records(self, batch_records, is_infer): - """ - Padding batch records and construct model's inputs. - """ - batch = {} - batch_token_ids = [record.token_ids for record in batch_records] - batch_type_ids = [record.type_ids for record in batch_records] - batch_pos_ids = [record.pos_ids for record in batch_records] - batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id) - batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id) - batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id) - - batch_tgt_start_idx = [record.tgt_start_idx for record in batch_records] - batch["generation_mask"] = self._gen_self_attn_mask(batch_token_ids, batch_tgt_start_idx=batch_tgt_start_idx) - - if is_infer: - tgt_ids = np.array([[[self.bos_id]]] * len(batch_token_ids), dtype="int64") - if self.continuous_position: - tgt_pos = np.array(batch_tgt_start_idx, dtype="int64") - else: - tgt_pos = np.zeros_like(batch_tgt_start_idx, dtype="int64") - tgt_pos = tgt_pos.reshape(-1, 1, 1) - batch["init_score"] = np.zeros_like(tgt_ids, dtype="float32").reshape(-1, 1).tolist() - batch["tgt_ids"] = tgt_ids.tolist() - batch["tgt_pos"] = tgt_pos.tolist() - - batch["tgt_generation_mask"] = batch["generation_mask"][:, 0:1, :].astype("float32") - else: - batch["tgt_label"], batch["tgt_pos"] = mask( - batch_tokens=batch_token_ids, - vocab_size=self.vocab_size, - sent_b_starts=batch_tgt_start_idx, - is_unidirectional=True, - ) - - batch_data_id = [record.data_id for record in batch_records] - batch["data_id"] = np.array(batch_data_id).astype("int64").reshape([-1, 1]) - return batch - - -@contextmanager -def open_file(filename): - """Open file.""" - if filename.endswith(".gz"): - fp = gzip.open(filename, "rt") - else: - fp = open(filename) - yield fp - fp.close() diff --git a/examples/dialogue/plato-2/readers/nsp_reader.py b/examples/dialogue/plato-2/readers/nsp_reader.py deleted file mode 100644 index 968da9ff1ba0..000000000000 --- a/examples/dialogue/plato-2/readers/nsp_reader.py +++ /dev/null @@ -1,152 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""NSP Reader.""" - -from collections import namedtuple - -import numpy as np -from readers.dialog_reader import DialogReader -from utils import pad_batch_data -from utils.masking import mask - -from paddlenlp.trainer.argparser import strtobool - - -class NSPReader(DialogReader): - """NSP Reader.""" - - @classmethod - def add_cmdline_args(cls, parser): - """Add cmdline argurments.""" - group = DialogReader.add_cmdline_args(parser) - group.add_argument( - "--attention_style", type=str, default="bidirectional", choices=["bidirectional", "unidirectional"] - ) - group.add_argument("--mix_negative_sample", type=strtobool, default=False) - return group - - def __init__(self, args): - super(NSPReader, self).__init__(args) - self.fields.append("label") - self.Record = namedtuple("Record", self.fields, defaults=(None,) * len(self.fields)) - - self.attention_style = args.attention_style - self.mix_negative_sample = args.mix_negative_sample - return - - def _convert_example_to_record(self, example, is_infer): - record = super(NSPReader, self)._convert_example_to_record(example, False) - if "label" in example._fields: - record = record._replace(label=int(example.label)) - return record - - def _mix_negative_sample(self, reader, neg_pool_size=2**16): - def gen_from_pool(pool): - num_samples = len(pool) - if num_samples == 1: - # only one sample: it is impossible to generate negative sample - yield pool[0]._replace(label=1) - return - self.global_rng.shuffle(pool) - for i in range(num_samples): - pool[i] = pool[i]._replace(label=1) - j = (i + 1) % num_samples - idx_i = pool[i].tgt_start_idx - idx_j = pool[j].tgt_start_idx - field_values = {} - field_values["token_ids"] = pool[i].token_ids[:idx_i] + pool[j].token_ids[idx_j:] - field_values["type_ids"] = pool[i].type_ids[:idx_i] + pool[j].type_ids[idx_j:] - field_values["pos_ids"] = list(range(len(field_values["token_ids"]))) - neg_record = self.Record(**field_values, tgt_start_idx=idx_i, data_id=-1, label=0) - pool.append(neg_record) - assert len(neg_record.token_ids) <= self.max_seq_len - self.global_rng.shuffle(pool) - for record in pool: - yield record - - def __wrapper__(): - pool = [] - for record in reader(): - pool.append(record) - if len(pool) == neg_pool_size: - for record in gen_from_pool(pool): - yield record - pool = [] - if len(pool) > 0: - for record in gen_from_pool(pool): - yield record - - return __wrapper__ - - def _batch_reader(self, reader, phase=None, is_infer=False, sort_pool_size=2**16): - if self.mix_negative_sample: - reader = self._mix_negative_sample(reader) - return super(NSPReader, self)._batch_reader( - reader, phase=phase, is_infer=is_infer, sort_pool_size=sort_pool_size - ) - - def _pad_batch_records(self, batch_records, is_infer): - """ - Padding batch records and construct model's inputs. - """ - batch = {} - batch_token_ids = [record.token_ids for record in batch_records] - batch_type_ids = [record.type_ids for record in batch_records] - batch_pos_ids = [record.pos_ids for record in batch_records] - batch_tgt_start_idx = [record.tgt_start_idx for record in batch_records] - batch_label = [record.label for record in batch_records] - - if self.attention_style == "unidirectional": - batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id) - batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id) - batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id) - tgt_label, tgt_pos, label_pos = mask( - batch_tokens=batch_token_ids, - vocab_size=self.vocab_size, - bos_id=self.bos_id, - sent_b_starts=batch_tgt_start_idx, - labels=batch_label, - is_unidirectional=True, - ) - attention_mask = self._gen_self_attn_mask(batch_token_ids, batch_tgt_start_idx) - else: - batch_mask_token_ids, tgt_label, tgt_pos, label_pos = mask( - batch_tokens=batch_token_ids, - vocab_size=self.vocab_size, - bos_id=self.bos_id, - eos_id=self.eos_id, - mask_id=self.mask_id, - sent_b_starts=batch_tgt_start_idx, - labels=batch_label, - is_unidirectional=False, - ) - if not is_infer: - batch_token_ids = batch_mask_token_ids - batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id) - batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id) - batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id) - attention_mask = self._gen_self_attn_mask(batch_token_ids, is_unidirectional=False) - - batch["attention_mask"] = attention_mask - batch["label_pos"] = label_pos - - if not is_infer: - batch_label = np.array(batch_label).astype("int64").reshape([-1, 1]) - batch["label"] = batch_label - batch["tgt_label"] = tgt_label - batch["tgt_pos"] = tgt_pos - - batch_data_id = [record.data_id for record in batch_records] - batch["data_id"] = np.array(batch_data_id).astype("int64").reshape([-1, 1]) - return batch diff --git a/examples/dialogue/plato-2/readers/plato_reader.py b/examples/dialogue/plato-2/readers/plato_reader.py deleted file mode 100644 index 3d3cd790ee76..000000000000 --- a/examples/dialogue/plato-2/readers/plato_reader.py +++ /dev/null @@ -1,85 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""Plato Reader.""" - -import numpy as np - -from readers.dialog_reader import DialogReader -from utils import pad_batch_data -from utils.masking import mask - - -class PlatoReader(DialogReader): - """The implement of PlatoReader""" - - def __init__(self, args): - super(PlatoReader, self).__init__(args) - self.latent_type_size = args.latent_type_size - self.use_bow = args.use_bow - - def _pad_batch_records(self, batch_records, is_infer): - """ - Padding batch records and construct model's inputs. - """ - batch = {} - batch_token_ids = [record.token_ids for record in batch_records] - batch_type_ids = [record.type_ids for record in batch_records] - batch_pos_ids = [record.pos_ids for record in batch_records] - - batch_tgt_start_idx = [record.tgt_start_idx for record in batch_records] - - batch_size = len(batch_token_ids) - - # padding - batch["token_ids"] = pad_batch_data(batch_token_ids, pad_id=self.pad_id) - batch["type_ids"] = pad_batch_data(batch_type_ids, pad_id=self.pad_id) - batch["pos_ids"] = pad_batch_data(batch_pos_ids, pad_id=self.pad_id) - - batch["generation_mask"] = self._gen_self_attn_mask( - batch_token_ids, batch_tgt_start_idx=batch_tgt_start_idx, is_unidirectional=True, shift_len=1 - ) - if not is_infer: - batch["recognition_mask"] = self._gen_self_attn_mask(batch_token_ids, is_unidirectional=False, shift_len=1) - - if is_infer: - tgt_ids = np.array([[[self.bos_id]]] * batch_size, dtype="int64") - if self.continuous_position: - tgt_pos = np.array(batch_tgt_start_idx, dtype="int64") - else: - tgt_pos = np.zeros_like(batch_tgt_start_idx, dtype="int64") - tgt_pos = tgt_pos.reshape(-1, 1, 1) - batch["init_score"] = np.zeros_like(tgt_ids, dtype="float32").reshape(-1, 1).tolist() - batch["tgt_ids"] = tgt_ids.tolist() - batch["tgt_pos"] = tgt_pos.tolist() - batch["parent_idx"] = np.array(range(batch_size), dtype="int32") - - batch["tgt_generation_mask"] = batch["generation_mask"][:, 0:1, :].astype("float32") - else: - mask_return_list = mask( - batch_tokens=batch_token_ids, - vocab_size=self.vocab_size, - sent_b_starts=batch_tgt_start_idx, - is_unidirectional=True, - use_latent=True, - use_bow=self.use_bow, - ) - batch["tgt_label"] = mask_return_list[0] - batch["tgt_pos"] = mask_return_list[1] - if self.use_bow: - batch["bow_label"] = mask_return_list[2] - batch["bow_pos"] = mask_return_list[3] - - batch_data_id = [record.data_id for record in batch_records] - batch["data_id"] = np.array(batch_data_id).astype("int64").reshape([-1, 1]) - return batch diff --git a/examples/dialogue/plato-2/utils/__init__.py b/examples/dialogue/plato-2/utils/__init__.py deleted file mode 100644 index 1a9ff1098dd3..000000000000 --- a/examples/dialogue/plato-2/utils/__init__.py +++ /dev/null @@ -1,51 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""Utils.""" - -from itertools import chain - -import numpy as np -import paddle - - -def repeat_array(array, times): - """Repeate numpy array.""" - if isinstance(array, list): - return list(chain(*([array] * times))) - else: - return np.concatenate([array] * times, axis=0) - - -def gen_inputs(inputs, latent_type_size): - batch_size = len(inputs["data_id"]) - inputs = {name: repeat_array(array, latent_type_size) for name, array in inputs.items()} - # Add latent_id - inputs["latent_id"] = np.array( - [i for i in range(latent_type_size) for _ in range(batch_size)], dtype="int64" - ).reshape([-1, 1]) - - # print('\nplato_inputs:') - for key in inputs: - inputs[key] = paddle.to_tensor(inputs[key]) - if key in ["token_ids", "type_ids", "pos_ids", "tgt_ids", "tgt_pos", "data_id"]: - inputs[key] = paddle.squeeze(inputs[key], axis=-1) - # print(key, inputs[key].shape, inputs[key].dtype) - return inputs - - -def pad_batch_data(insts, pad_id=0): - """Pad the instances to the max sequence length in batch.""" - max_len = max(map(len, insts)) - inst_data = np.array([list(inst) + [pad_id] * (max_len - len(inst)) for inst in insts]) - return inst_data.astype("int64").reshape([-1, max_len, 1]) diff --git a/examples/dialogue/plato-2/utils/args.py b/examples/dialogue/plato-2/utils/args.py deleted file mode 100644 index b112acf6ba73..000000000000 --- a/examples/dialogue/plato-2/utils/args.py +++ /dev/null @@ -1,88 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""Parse argument.""" - -import argparse -import json - - -class Args(dict): - """Arguments class - - Store arguments in training / infer / ... scripts. - """ - - def __getattr__(self, name): - if name in self.keys(): - return self[name] - for v in self.values(): - if isinstance(v, Args): - if name in v: - return v[name] - return None - - def get(self, key, default_value=None): - """Get the value of corresponding key.""" - if key in self.keys(): - return self[key] - for v in self.values(): - if isinstance(v, Args): - if key in v: - return v[key] - return default_value - - def __setattr__(self, name, value): - self[name] = value - - def save(self, filename): - with open(filename, "w") as fp: - json.dump(self, fp, ensure_ascii=False, indent=4, sort_keys=False) - - def load(self, filename, group_name=None): - if group_name is not None: - if group_name not in self: - self[group_name] = Args() - self[group_name].load(filename) - return - with open(filename, "r") as fp: - params_dict = json.load(fp) - for k, v in params_dict.items(): - if isinstance(v, dict): - self[k].update(Args(v)) - else: - self[k] = v - - -def parse_args(parser: argparse.ArgumentParser, allow_unknown=False) -> Args: - """Parse hyper-parameters from cmdline.""" - if allow_unknown: - parsed, _ = parser.parse_known_args() - else: - parsed = parser.parse_args() - args = Args() - optional_args = parser._action_groups[1] - for action in optional_args._group_actions[1:]: - arg_name = action.dest - args[arg_name] = getattr(parsed, arg_name) - for group in parser._action_groups[2:]: - group_args = Args() - for action in group._group_actions: - arg_name = action.dest - group_args[arg_name] = getattr(parsed, arg_name) - if len(group_args) > 0: - if group.title in args: - args[group.title].update(group_args) - else: - args[group.title] = group_args - return args diff --git a/examples/dialogue/plato-2/utils/masking.py b/examples/dialogue/plato-2/utils/masking.py deleted file mode 100644 index fb6be808448a..000000000000 --- a/examples/dialogue/plato-2/utils/masking.py +++ /dev/null @@ -1,119 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""Reader utils.""" - -import numpy as np - - -def mask( - batch_tokens, - vocab_size, - bos_id=1, - eos_id=2, - mask_id=3, - sent_b_starts=None, - labels=None, - is_unidirectional=False, - use_latent=False, - use_bow=False, -): - """ - Add mask for batch_tokens, return out, mask_label, mask_pos; - Note: mask_pos responding the batch_tokens after padded; - """ - batch_tokens = np.copy(batch_tokens) - max_len = max(map(len, batch_tokens)) - mask_label = [] - mask_pos = [] - if labels is not None: - label_pos = [] - - if is_unidirectional: - # unidirectional language model - if use_latent: - max_len += 1 - shift_len = 1 - else: - shift_len = 0 - for sent_index, sent in enumerate(batch_tokens): - sent_b_index = sent_b_starts[sent_index] if sent_b_starts is not None else 0 - if labels is not None: - label_pos.append(sent_index * max_len + len(sent) - 1 + shift_len) - mask_label.extend(sent[sent_b_index + 1 :]) - mask_pos.extend([sent_index * max_len + i + shift_len for i in range(sent_b_index, len(sent) - 1)]) - mask_label = np.array(mask_label).astype("int64").reshape([-1, 1]) - mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1]) - return_list = [mask_label, mask_pos] - - # latent related (bow label and pos) - if use_latent and use_bow: - bow_label = [] - bow_pos = [] - for sent_index, sent in enumerate(batch_tokens): - sent_b_index = sent_b_starts[sent_index] if sent_b_starts is not None else 0 - - def __filter__(tok_id): - # TODO: exclude [EOS] from bow loss - return True - - bow_pos.extend([sent_index for i in range(sent_b_index + 1, len(sent)) if __filter__(sent[i])]) - bow_label.extend([sent[i] for i in range(sent_b_index + 1, len(sent)) if __filter__(sent[i])]) - bow_label = np.array(bow_label).astype("int64").reshape([-1, 1]) - bow_pos = np.array(bow_pos).astype("int64").reshape([-1, 1]) - return_list += [bow_label, bow_pos] - else: - # bidirectional mask language model - total_token_num = sum(map(len, batch_tokens)) - prob_mask = np.random.rand(total_token_num) - # TODO: fix replace_ids, include [UNK] - replace_ids = np.random.randint(3, high=vocab_size, size=total_token_num) - prob_index = 0 - for sent_index, sent in enumerate(batch_tokens): - # add pair label position - if labels is not None: - label_pos.append(sent_index * max_len) - - # add mask label and position - for token_index, token in enumerate(sent): - if token == eos_id or token == bos_id: - continue - prob = prob_mask[prob_index + token_index] - if prob > 0.15: - continue - elif 0.03 < prob <= 0.15: - # mask - mask_label.append(sent[token_index]) - sent[token_index] = mask_id - mask_pos.append(sent_index * max_len + token_index) - elif 0.015 < prob <= 0.03: - # random replace - mask_label.append(sent[token_index]) - sent[token_index] = replace_ids[prob_index + token_index] - mask_pos.append(sent_index * max_len + token_index) - else: - # keep the original token - mask_label.append(sent[token_index]) - mask_pos.append(sent_index * max_len + token_index) - - prob_index += len(sent) - - mask_label = np.array(mask_label).astype("int64").reshape([-1, 1]) - mask_pos = np.array(mask_pos).astype("int64").reshape([-1, 1]) - return_list = [batch_tokens, mask_label, mask_pos] - - if labels is not None: - label_pos = np.array(label_pos).astype("int64").reshape([-1, 1]) - assert len(labels) == len(label_pos) - return_list.append(label_pos) - return return_list diff --git a/examples/dialogue/plato-2/utils/tokenization.py b/examples/dialogue/plato-2/utils/tokenization.py deleted file mode 100644 index 7d4741ba984e..000000000000 --- a/examples/dialogue/plato-2/utils/tokenization.py +++ /dev/null @@ -1,189 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""Tokenization classes.""" - -import collections -import sentencepiece as spm -import unicodedata - -from utils.args import str2bool - - -def clean_text(text): - """Performs invalid character removal and whitespace cleanup on text.""" - text = text.replace("“", '"').replace("”", '"').replace("‘", "'").replace("’", "'").replace("—", "-") - - output = [] - for char in text: - if _is_control(char): - continue - if _is_whitespace(char): - output.append(" ") - else: - output.append(char) - return "".join(output) - - -def preprocess_text(inputs, remove_space=True, lower=False): - """preprocess data by removing extra space and normalize data.""" - outputs = inputs - if remove_space: - outputs = " ".join(inputs.strip().split()) - - outputs = unicodedata.normalize("NFKD", outputs) - outputs = "".join([c for c in outputs if not unicodedata.combining(c)]) - if lower: - outputs = outputs.lower() - - return outputs - - -def encode_pieces(spm_model, text, return_unicode=True, sample=False): - """turn sentences into word pieces.""" - # liujiaxiang: add for ernie-albert, mainly consider for “/”/‘/’/— causing too many unk - text = clean_text(text) - - if not sample: - pieces = spm_model.EncodeAsPieces(text) - else: - pieces = spm_model.SampleEncodeAsPieces(text, 64, 0.1) - - return pieces - - -def encode_ids(spm_model, text, sample=False): - """turn sentences into word pieces.""" - pieces = encode_pieces(spm_model, text, return_unicode=False, sample=sample) - ids = [spm_model.PieceToId(piece) for piece in pieces] - return ids - - -def convert_to_unicode(text): - """Converts `text` to Unicode (if it's not already), assuming utf-8 input.""" - if isinstance(text, str): - return text - elif isinstance(text, bytes): - return text.decode("utf-8", "ignore") - else: - raise ValueError("Unsupported string type: %s" % (type(text))) - - -def load_vocab(vocab_file): - """Loads a vocabulary file into a dictionary.""" - vocab = collections.OrderedDict() - fin = open(vocab_file, "r", encoding="UTF-8") - for num, line in enumerate(fin): - items = convert_to_unicode(line.rstrip()).split("\t") - if len(items) > 2: - break - token = items[0] - index = items[1] if len(items) == 2 else num - token = token.strip() - vocab[token] = int(index) - return vocab - - -def convert_by_vocab(vocab, items): - """Converts a sequence of [tokens|ids] using the vocab.""" - output = [] - for item in items: - output.append(vocab[item]) - return output - - -class SentencePieceTokenizer(object): - """Runs end-to-end tokenziation.""" - - @classmethod - def add_cmdline_args(cls, parser): - """Add cmdline argurments.""" - group = parser.add_argument_group("Tokenizer") - group.add_argument("--vocab_path", type=str, required=True) - group.add_argument("--do_lower_case", type=str2bool, default=False) - group.add_argument("--spm_model_file", type=str, required=True) - return group - - def __init__(self, args): - self.spm_model = spm.SentencePieceProcessor() - self.spm_model.Load(args.spm_model_file) - self.vocab = load_vocab(args.vocab_path) - self.do_lower_case = args.do_lower_case - self.inv_vocab = {v: k for k, v in self.vocab.items()} - - def tokenize(self, text): - """Tokenizes a piece of text.""" - text = preprocess_text(text, lower=self.do_lower_case) - return encode_pieces(self.spm_model, text, return_unicode=True) - - def convert_tokens_to_ids(self, tokens): - """Convert tokens to ids.""" - ret = [] - unk_id = self.vocab[""] - for token in tokens: - if token in self.vocab: - ret.append(self.vocab[token]) - else: - ret.append(unk_id) - return ret - - def convert_ids_to_tokens(self, ids): - """Convert ids to tokens.""" - return convert_by_vocab(self.inv_vocab, ids) - - def merge_subword(self, tokens): - """Merge subword.""" - ret = [] - for token in tokens: - if token.startswith("▁"): - ret.append(token[1:]) - else: - if len(ret): - ret[-1] += token - else: - ret.append(token) - - ret = [token for token in ret if token] - return ret - - def convert_ids_to_str(self, ids): - """Convert ids to string.""" - tokens = self.convert_ids_to_tokens(ids) - tokens = self.merge_subword(tokens) - res = " ".join(tokens).replace("", "") - res = res.replace("", "\n").replace("\n ", "\n").strip() - return res - - -def _is_whitespace(char): - """Checks whether `chars` is a whitespace character.""" - # \t, \n, and \r are technically contorl characters but we treat them - # as whitespace since they are generally considered as such. - if char == " " or char == "\t" or char == "\n" or char == "\r": - return True - cat = unicodedata.category(char) - if cat == "Zs": - return True - return False - - -def _is_control(char): - """Checks whether `chars` is a control character.""" - # These are technically control characters but we count them as whitespace - # characters. - if char == "\t" or char == "\n" or char == "\r": - return False - cat = unicodedata.category(char) - if cat.startswith("C"): - return True - return False diff --git a/examples/dialogue/plato-xl b/examples/dialogue/plato-xl deleted file mode 120000 index 3225c24e6351..000000000000 --- a/examples/dialogue/plato-xl +++ /dev/null @@ -1 +0,0 @@ -../../model_zoo/plato-xl \ No newline at end of file diff --git a/examples/dialogue/unified_transformer/finetune.py b/examples/dialogue/unified_transformer/finetune.py deleted file mode 100644 index daeb700d3605..000000000000 --- a/examples/dialogue/unified_transformer/finetune.py +++ /dev/null @@ -1,165 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import math -import os -import time - -import paddle -import paddle.distributed as dist -import paddle.nn as nn -import paddle.nn.functional as F -from datasets import load_dataset -from paddle.optimizer import AdamW -from paddle.optimizer.lr import NoamDecay -from utils import create_data_loader, print_args, set_seed - -from paddlenlp.transformers import ( - UnifiedTransformerLMHeadModel, - UnifiedTransformerTokenizer, -) - - -# yapf: disable -def parse_args(): - parser = argparse.ArgumentParser(__doc__) - parser.add_argument('--model_name_or_path', type=str, default='unified_transformer-12L-cn-luge', help='The path or shortcut name of the pre-trained model.') - parser.add_argument('--save_dir', type=str, default='./checkpoints', help='The directory where the checkpoints will be saved.') - parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.') - parser.add_argument('--save_steps', type=int, default=1000, help='Save checkpoint every X updates steps.') - parser.add_argument('--seed', type=int, default=2021, help='Random seed for initialization.') - parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.') - parser.add_argument('--lr', type=float, default=5e-5, help='The initial learning rate.') - parser.add_argument('--weight_decay', type=float, default=0.01, help='The weight decay for optimizer.') - parser.add_argument('--epochs', type=int, default=3, help='Total number of training epochs to perform.') - parser.add_argument('--warmup_steps', type=int, default=2500, help='The number of warmup steps.') - parser.add_argument('--max_grad_norm', type=float, default=0.1, help='The max value of grad norm.') - parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.') - parser.add_argument('--max_response_len', type=int, default=128, help='The maximum response sequence length of training.') - parser.add_argument('--max_knowledge_len', type=int, default=256, help='The maximum knowledge sequence length of training.') - parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.') - - args = parser.parse_args() - return args -# yapf: enable - - -def save_ckpt(model, tokenizer, save_dir, name): - output_dir = os.path.join(save_dir, "model_{}".format(name)) - if not os.path.exists(output_dir): - os.makedirs(output_dir) - # Need better way to get inner model of DataParallel - model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model - model_to_save.save_pretrained(output_dir) - tokenizer.save_pretrained(output_dir) - - -def train(args): - paddle.set_device(args.device) - world_size = dist.get_world_size() - if world_size > 1: - dist.init_parallel_env() - - set_seed(args.seed) - - model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path) - tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path) - - if world_size > 1: - model = paddle.DataParallel(model) - - train_ds, dev_ds = load_dataset("duconv", split=("train", "dev")) - train_ds, train_data_loader = create_data_loader(train_ds, tokenizer, args, "train") - dev_ds, dev_data_loader = create_data_loader(dev_ds, tokenizer, args, "dev") - - lr_scheduler = NoamDecay(1 / (args.warmup_steps * (args.lr**2)), args.warmup_steps) - # Generate parameter names needed to perform weight decay. - # All bias and LayerNorm parameters are excluded. - decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] - optimizer = AdamW( - learning_rate=lr_scheduler, - parameters=model.parameters(), - weight_decay=args.weight_decay, - apply_decay_param_fun=lambda x: x in decay_params, - grad_clip=nn.ClipGradByGlobalNorm(args.max_grad_norm), - ) - - step = 0 - total_time = 0.0 - best_ppl = 1e9 - for epoch in range(args.epochs): - print("\nEpoch %d/%d" % (epoch + 1, args.epochs)) - batch_start_time = time.time() - for inputs in train_data_loader: - step += 1 - labels = inputs[-1] - - logits = model(*inputs[:-1]) - loss = F.cross_entropy(logits, labels) - loss.backward() - optimizer.step() - lr_scheduler.step() - optimizer.clear_grad() - - total_time += time.time() - batch_start_time - if step % args.logging_steps == 0: - ppl = paddle.exp(loss) - print( - "step %d - loss: %.4f - ppl: %.4f - lr: %.7f - %.3fs/step" - % (step, loss, ppl, optimizer.get_lr(), total_time / args.logging_steps) - ) - total_time = 0.0 - if step % args.save_steps == 0: - ppl = evaluation(model, dev_data_loader) - if dist.get_rank() == 0: - save_ckpt(model, tokenizer, args.save_dir, step) - if ppl < best_ppl: - best_ppl = ppl - save_ckpt(model, tokenizer, args.save_dir, "best") - print("Saved step {} as best model.\n".format(step)) - batch_start_time = time.time() - print("\nTraining completed.") - - -@paddle.no_grad() -def evaluation(model, data_loader): - print("\nEval begin...") - model.eval() - total_tokens = 0 - total_loss = 0.0 - start_time = time.time() - step = 0 - for inputs in data_loader: - step += 1 - labels = inputs[-1] - - logits = model(*inputs[:-1]) - loss = F.cross_entropy(logits, labels, reduction="sum") - - total_loss += loss.item() - total_tokens += labels.shape[0] - - avg_loss = total_loss / total_tokens - ppl = math.exp(avg_loss) - avg_speed = (time.time() - start_time) / step - print("loss: %.4f - ppl: %.4f - %.3fs/step" % (avg_loss, ppl, avg_speed)) - model.train() - return ppl - - -if __name__ == "__main__": - args = parse_args() - print_args(args) - train(args) diff --git a/examples/dialogue/unified_transformer/infer.py b/examples/dialogue/unified_transformer/infer.py deleted file mode 100644 index 3781f73e16aa..000000000000 --- a/examples/dialogue/unified_transformer/infer.py +++ /dev/null @@ -1,146 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import time - -import paddle -from datasets import load_dataset -from utils import create_data_loader, print_args, select_response, set_seed - -from paddlenlp.metrics import BLEU, Distinct -from paddlenlp.transformers import ( - UnifiedTransformerLMHeadModel, - UnifiedTransformerTokenizer, -) - - -# yapf: disable -def parse_args(): - parser = argparse.ArgumentParser(__doc__) - parser.add_argument('--model_name_or_path', type=str, default='unified_transformer-12L-cn-luge', help='The path or shortcut name of the pre-trained model.') - parser.add_argument('--output_path', type=str, default='./predict.txt', help='The file path where the infer result will be saved.') - parser.add_argument('--logging_steps', type=int, default=100, help='Log every X updates steps.') - parser.add_argument('--seed', type=int, default=2021, help='Random seed for initialization.') - parser.add_argument('--batch_size', type=int, default=16, help='Batch size per GPU/CPU for training.') - parser.add_argument('--max_seq_len', type=int, default=512, help='The maximum sequence length of training.') - parser.add_argument('--max_response_len', type=int, default=128, help='The maximum response sequence length of training.') - parser.add_argument('--max_knowledge_len', type=int, default=256, help='The maximum knowledge sequence length of training.') - parser.add_argument('--min_dec_len', type=int, default=1, help='The minimum sequence length of generation.') - parser.add_argument('--max_dec_len', type=int, default=64, help='The maximum sequence length of generation.') - parser.add_argument('--num_return_sequences', type=int, default=20, help='The numbers of returned sequences for one input in generation.') - parser.add_argument('--decode_strategy', type=str, default='sampling', help='The decode strategy in generation.') - parser.add_argument('--top_k', type=int, default=0, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') - parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.') - parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.') - parser.add_argument('--num_beams', type=int, default=0, help='The number of beams for beam search.') - parser.add_argument('--length_penalty', type=float, default=1.0, help='The exponential penalty to the sequence length for beam search.') - parser.add_argument('--early_stopping', type=eval, default=False, help='Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.') - parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.') - parser.add_argument('--faster', action='store_true', help='Whether to process inference using faster transformer. ') - parser.add_argument('--use_fp16_decoding', action='store_true', help='Whether to use fp16 when using faster transformer. Only works when using faster transformer. ') - - args = parser.parse_args() - return args -# yapf: enable - - -def calc_bleu_and_distinct(preds, targets): - assert len(preds) == len(targets), ( - "The length of pred_responses should be equal to the length of " - "target_responses. But received {} and {}.".format(len(preds), len(targets)) - ) - bleu1 = BLEU(n_size=1) - bleu2 = BLEU(n_size=2) - distinct1 = Distinct(n_size=1) - distinct2 = Distinct(n_size=2) - for pred, target in zip(preds, targets): - pred_tokens = pred.split() - target_token = target.split() - - bleu1.add_inst(pred_tokens, [target_token]) - bleu2.add_inst(pred_tokens, [target_token]) - - distinct1.add_inst(pred_tokens) - distinct2.add_inst(pred_tokens) - - print("\n" + "*" * 15) - print("The auto evaluation result is:") - print("BLEU-1:", bleu1.score()) - print("BLEU-2:", bleu2.score()) - print("DISTINCT-1:", distinct1.score()) - print("DISTINCT-2:", distinct2.score()) - - -@paddle.no_grad() -def infer(args): - paddle.set_device(args.device) - set_seed(args.seed) - - model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path) - tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path) - - test_ds = load_dataset("duconv", split="test_1") - test_ds, test_data_loader = create_data_loader(test_ds, tokenizer, args, "test") - - model.eval() - total_time = 0.0 - start_time = time.time() - pred_responses = [] - for step, inputs in enumerate(test_data_loader, 1): - input_ids, token_type_ids, position_ids, attention_mask, seq_len = inputs - output = model.generate( - input_ids=input_ids, - token_type_ids=token_type_ids, - position_ids=position_ids, - attention_mask=attention_mask, - seq_len=seq_len, - max_length=args.max_dec_len, - min_length=args.min_dec_len, - decode_strategy=args.decode_strategy, - temperature=args.temperature, - top_k=args.top_k, - top_p=args.top_p, - num_beams=args.num_beams, - length_penalty=args.length_penalty, - early_stopping=args.early_stopping, - num_return_sequences=args.num_return_sequences, - use_fp16_decoding=args.use_fp16_decoding, - use_fast=args.faster, - ) - - total_time += time.time() - start_time - if step % args.logging_steps == 0: - print("step %d - %.3fs/step" % (step, total_time / args.logging_steps)) - total_time = 0.0 - - ids, scores = output - results = select_response(ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences) - pred_responses.extend(results) - - start_time = time.time() - - with open(args.output_path, "w", encoding="utf-8") as fout: - for response in pred_responses: - fout.write(response + "\n") - print("\nSave inference result into: %s" % args.output_path) - - target_responses = [example["response"] for example in test_ds] - calc_bleu_and_distinct(pred_responses, target_responses) - - -if __name__ == "__main__": - args = parse_args() - print_args(args) - infer(args) diff --git a/examples/dialogue/unified_transformer/interaction.py b/examples/dialogue/unified_transformer/interaction.py deleted file mode 100644 index cde62e5057d6..000000000000 --- a/examples/dialogue/unified_transformer/interaction.py +++ /dev/null @@ -1,107 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse - -import paddle -from termcolor import colored, cprint -from utils import print_args, select_response, set_seed - -from paddlenlp.transformers import ( - UnifiedTransformerLMHeadModel, - UnifiedTransformerTokenizer, -) - - -# yapf: disable -def parse_args(): - parser = argparse.ArgumentParser(__doc__) - parser.add_argument('--model_name_or_path', type=str, default='plato-mini', help='The path or shortcut name of the pre-trained model.') - parser.add_argument('--seed', type=int, default=None, help='Random seed for initialization.') - parser.add_argument('--min_dec_len', type=int, default=1, help='The minimum sequence length of generation.') - parser.add_argument('--max_dec_len', type=int, default=64, help='The maximum sequence length of generation.') - parser.add_argument('--num_return_sequences', type=int, default=20, help='The numbers of returned sequences for one input in generation.') - parser.add_argument('--decode_strategy', type=str, default='sampling', help='The decode strategy in generation.') - parser.add_argument('--top_k', type=int, default=5, help='The number of highest probability vocabulary tokens to keep for top-k sampling.') - parser.add_argument('--temperature', type=float, default=1.0, help='The value used to module the next token probabilities.') - parser.add_argument('--top_p', type=float, default=1.0, help='The cumulative probability for top-p sampling.') - parser.add_argument('--num_beams', type=int, default=0, help='The number of beams for beam search.') - parser.add_argument('--length_penalty', type=float, default=1.0, help='The exponential penalty to the sequence length for beam search.') - parser.add_argument('--early_stopping', type=eval, default=False, help='Whether to stop the beam search when at least `num_beams` sentences are finished per batch or not.') - parser.add_argument('--device', type=str, default='gpu', help='The device to select for training the model.') - - args = parser.parse_args() - return args -# yapf: enable - - -def interaction(args, model, tokenizer): - history = [] - start_info = "Enter [EXIT] to quit the interaction, [NEXT] to start a new conversation." - cprint(start_info, "yellow", attrs=["bold"]) - while True: - user_utt = input(colored("[Human]: ", "red", attrs=["bold"])).strip() - if user_utt == "[EXIT]": - break - elif user_utt == "[NEXT]": - history = [] - cprint(start_info, "yellow", attrs=["bold"]) - else: - history.append(user_utt) - inputs = tokenizer.dialogue_encode( - history, add_start_token_as_response=True, return_tensors=True, is_split_into_words=False - ) - inputs["input_ids"] = inputs["input_ids"].astype("int64") - ids, scores = model.generate( - input_ids=inputs["input_ids"], - token_type_ids=inputs["token_type_ids"], - position_ids=inputs["position_ids"], - attention_mask=inputs["attention_mask"], - max_length=args.max_dec_len, - min_length=args.min_dec_len, - decode_strategy=args.decode_strategy, - temperature=args.temperature, - top_k=args.top_k, - top_p=args.top_p, - num_beams=args.num_beams, - length_penalty=args.length_penalty, - early_stopping=args.early_stopping, - num_return_sequences=args.num_return_sequences, - use_fast=True, - ) - bot_response = select_response( - ids, scores, tokenizer, args.max_dec_len, args.num_return_sequences, keep_space=False - )[0] - print(colored("[Bot]:", "blue", attrs=["bold"]), colored(bot_response, attrs=["bold"])) - history.append(bot_response) - return - - -def main(args): - paddle.set_device(args.device) - if args.seed is not None: - set_seed(args.seed) - - # Initialize the model and tokenizer - model = UnifiedTransformerLMHeadModel.from_pretrained(args.model_name_or_path) - tokenizer = UnifiedTransformerTokenizer.from_pretrained(args.model_name_or_path) - - model.eval() - interaction(args, model, tokenizer) - - -if __name__ == "__main__": - args = parse_args() - print_args(args) - main(args) diff --git a/examples/dialogue/unified_transformer/utils.py b/examples/dialogue/unified_transformer/utils.py deleted file mode 100644 index 90585d69e0ee..000000000000 --- a/examples/dialogue/unified_transformer/utils.py +++ /dev/null @@ -1,265 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import random -from functools import partial - -import numpy as np - -import paddle -import paddle.distributed as dist -from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler -from paddlenlp.data import Pad - - -def print_args(args): - print("----------- Configuration Arguments -----------") - for arg, value in sorted(vars(args).items()): - print("%s: %s" % (arg, value)) - print("------------------------------------------------") - - -def set_seed(seed): - # Use the same data seed(for data shuffle) for all procs to guarantee data - # consistency after sharding. - random.seed(seed) - np.random.seed(seed) - # Maybe different op seeds(for dropout) for different procs is better. - paddle.seed(seed + dist.get_rank()) - - -def preprocess_examples(examples, mode="train"): - """ - For training set and dev set, treat each utterance of the first speaker as - the response, and concatenate the goal, knowledge and the dialog’s previous - utterances as the history. In this way, multiple history-response pairs - are constructed. - """ - if mode == "test": - return examples - new_examples = {} - goal = [] - knowledge = [] - history = [] - response = [] - - conv = examples["conversation"] - for index, conversation in enumerate(conv): - for i in range(0, len(conversation), 2): - goal.append(examples["goal"][index]) - knowledge.append(examples["knowledge"][index]) - history.append(conversation[:i]) - response.append(conversation[i]) - new_examples["goal"] = goal - new_examples["knowledge"] = knowledge - new_examples["history"] = history - new_examples["response"] = response - - return new_examples - - -def convert_example(example, tokenizer, max_seq_len=512, max_response_len=128, max_knowledge_len=256, mode="train"): - """Convert all examples into necessary features.""" - goal = example["goal"] - knowledge = example["knowledge"] - goal_knowledge = " ".join([" ".join(lst) for lst in goal + knowledge]) - - if mode != "test": - tokenized_example = tokenizer.dialogue_encode( - example["history"], - response=example["response"], - knowledge=goal_knowledge, - task_type="knowledge", - max_seq_len=max_seq_len, - max_response_len=max_response_len, - max_knowledge_len=max_knowledge_len, - return_length=True, - ) - response_start = tokenized_example["input_ids"].index(tokenizer.cls_token_id, 1) - response_end = tokenized_example["seq_len"] - # Use to gather the logits corresponding to the labels during training - tokenized_example["masked_positions"] = list(range(response_start, response_end - 1)) - tokenized_example["labels"] = tokenized_example["input_ids"][response_start + 1 : response_end] - return tokenized_example - else: - tokenized_example = tokenizer.dialogue_encode( - example["history"], - knowledge=goal_knowledge, - task_type="knowledge", - max_seq_len=max_seq_len, - max_knowledge_len=max_knowledge_len, - add_start_token_as_response=True, - return_length=True, - ) - - if "response" in example: - tokenized_example["response"] = example["response"] - return tokenized_example - - -def batchify_fn(batch_examples, pad_val, mode): - def pad_mask(batch_attention_mask): - batch_size = len(batch_attention_mask) - max_len = max(map(len, batch_attention_mask)) - attention_mask = np.ones((batch_size, max_len, max_len), dtype="float32") * -1e9 - for i, mask_data in enumerate(attention_mask): - seq_len = len(batch_attention_mask[i]) - mask_data[-seq_len:, -seq_len:] = np.array(batch_attention_mask[i], dtype="float32") - # In order to ensure the correct broadcasting mechanism, expand one - # dimension to the second dimension (n_head of Transformer). - attention_mask = np.expand_dims(attention_mask, axis=1) - return attention_mask - - pad_func = Pad(pad_val=pad_val, pad_right=False, dtype="int64") - - input_ids = pad_func([example["input_ids"] for example in batch_examples]) - token_type_ids = pad_func([example["token_type_ids"] for example in batch_examples]) - position_ids = pad_func([example["position_ids"] for example in batch_examples]) - - attention_mask = pad_mask([example["attention_mask"] for example in batch_examples]) - - if mode != "test": - max_len = max([example["seq_len"] for example in batch_examples]) - masked_positions = np.concatenate( - [ - np.array(example["masked_positions"]) + (max_len - example["seq_len"]) + i * max_len - for i, example in enumerate(batch_examples) - ] - ) - labels = np.concatenate([np.array(example["labels"], dtype="int64") for example in batch_examples]) - return input_ids, token_type_ids, position_ids, attention_mask, masked_positions, labels - else: - seq_len = np.asarray([example["seq_len"] for example in batch_examples]).astype("int32") - return input_ids, token_type_ids, position_ids, attention_mask, seq_len - - -def create_data_loader(dataset, tokenizer, args, mode): - trans_func1 = partial(preprocess_examples, mode=mode) - trans_func2 = partial( - convert_example, - tokenizer=tokenizer, - max_seq_len=args.max_seq_len, - max_response_len=args.max_response_len, - max_knowledge_len=args.max_knowledge_len, - mode=mode, - ) - remove_columns = None - if mode in ["train", "dev"]: - remove_columns = ["id", "conversation"] - - dataset = dataset.map(trans_func1, batched=True, batch_size=None, remove_columns=remove_columns).map(trans_func2) - if mode == "train": - batch_sampler = DistributedBatchSampler(dataset, batch_size=args.batch_size, shuffle=True) - else: - batch_sampler = BatchSampler(dataset, batch_size=args.batch_size, shuffle=False) - collate_fn = partial(batchify_fn, pad_val=tokenizer.pad_token_id, mode=mode) - data_loader = DataLoader(dataset, batch_sampler=batch_sampler, collate_fn=collate_fn, return_list=True) - return dataset, data_loader - - -def post_process_response(token_ids, tokenizer): - """Post-process the decoded sequence. Truncate from the first .""" - eos_pos = len(token_ids) - for i, tok_id in enumerate(token_ids): - if tok_id == tokenizer.sep_token_id: - eos_pos = i - break - token_ids = token_ids[:eos_pos] - tokens = tokenizer.convert_ids_to_tokens(token_ids) - tokens = tokenizer.merge_subword(tokens) - return token_ids, tokens - - -def get_in_turn_repetition(pred, is_cn=False): - """Get in-turn repetition.""" - if len(pred) == 0: - return 1.0 - if isinstance(pred[0], str): - pred = [tok.lower() for tok in pred] - if is_cn: - pred = "".join(pred) - tri_grams = set() - for i in range(len(pred) - 2): - tri_gram = tuple(pred[i : i + 3]) - if tri_gram in tri_grams: - return True - tri_grams.add(tri_gram) - return False - - -def select_response(ids, scores, tokenizer, max_dec_len=None, num_return_sequences=1, keep_space=True): - results = [] - group = [] - tmp = [] - if scores is not None: - ids = ids.numpy() - scores = scores.numpy() - - if len(ids) != len(scores) or (len(ids) % num_return_sequences) != 0: - raise ValueError( - "the length of `ids` is {}, but the `num_return_sequences` is {}".format( - len(ids), num_return_sequences - ) - ) - - for pred, score in zip(ids, scores): - pred_token_ids, pred_tokens = post_process_response(pred, tokenizer) - num_token = len(pred_token_ids) - if keep_space: - response = " ".join(pred_tokens) - else: - response = "".join(pred_tokens) - - in_turn_repetition = get_in_turn_repetition(pred_tokens, True) or get_in_turn_repetition(pred_token_ids) - # not ending - if max_dec_len is not None and num_token >= max_dec_len: - score -= 1e3 - elif in_turn_repetition: - score -= 1e3 - - tmp.append([response, score]) - if len(tmp) == num_return_sequences: - group.append(tmp) - tmp = [] - - for preds in group: - preds = sorted(preds, key=lambda x: -x[1]) - results.append(preds[0][0]) - else: - ids = ids.numpy() - - for pred in ids: - pred_token_ids, pred_tokens = post_process_response(pred, tokenizer) - num_token = len(pred_token_ids) - if keep_space: - response = " ".join(pred_tokens) - else: - response = "".join(pred_tokens) - - in_turn_repetition = get_in_turn_repetition(pred_tokens, True) or get_in_turn_repetition(pred_token_ids) - - last_pos = 0 - if (max_dec_len is not None and num_token >= max_dec_len) or in_turn_repetition: - tmp.append([response]) - else: - tmp.insert(last_pos, [response]) - last_pos += 1 - - if len(tmp) == num_return_sequences: - group.append(tmp) - tmp = [] - - for preds in group: - results.append(preds[0][0]) - return results diff --git a/examples/few_shot/README.md b/examples/few_shot/README.md deleted file mode 100644 index 317ce55d0683..000000000000 --- a/examples/few_shot/README.md +++ /dev/null @@ -1,34 +0,0 @@ -# Few-Shot Learning (FSL) - -Few-Shot Learning 旨在研究如何从少量有监督的训练样本中学习出具有良好泛化性的模型,对训练数据很少或监督数据获取成本极高的应用场景有很大价值。 - -随着大规模预训练模型的不断涌现,FSL 结合预训练模型的先验知识和强大的泛化能力在下游任务效果上取得了显著提升,为大规模预训练模型结合 FSL 的工业落地应用带来了无限可能性。 - -我们旨在为 FSL 领域的研究者提供简单易用、全面、前沿的 FSL 策略库,便于研究者基于 FSL 策略库将注意力集中在算法创新上。我们会持续开源 FSL 领域的前沿学术工作,并在中文小样本学习测评基准 [FewCLUE](https://github.com/CLUEbenchmark/FewCLUE) 上进行评测。 - -## Benchmark -我们在 FewCLUE 9 个任务的 test_public.json 测试集上进行了效果评测 - -| 算法 | 预训练模型 | eprstmt | csldcp | iflytek | tnews | ocnli | bustm | chid | csl | cluewsc | -| ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |------------ | ------------ | ---------- | -| PET | ERNIE-1.0-Large-CW | 88.03 | 63.79 | 56.43 | 56.57 | 56.27 | 72.69 | 91.39 | 76.00 | 78.79 | -| P-Tuning | ERNIE-1.0-Large-CW | 89.84 | 64.57 | 45.80 | 57.41 | 44.13 | 68.51 | 90.00 | 74.67 | 73.26 | -| EFL | ERNIE-1.0-Large-CW | 90.82 | 54.48 | 46.71 | 54.43 | 43.17 | 72.63 | 85.71 | 61.52 | 80.02 | - -**注释**: -- 表格中 CHID 数据集的指标与 FewCLUE 榜单指标计算方式不同。 -- 由于 iflytek 和 csldcp 标签数较多,每条样本采样 5 个非正确标签作为负样本训练评测。 -- 为统一配置,除 EFL-iflytek 外均训练 1000 steps,EFL-iflytek 训练 5000 steps。 - -## Models -- [P-tuning](./p-tuning) -- [EFL](./efl) -- [PET](./pet) - -## References - -- [1] X. Liu et al., “GPT Understands, Too,” arXiv:2103.10385 [cs], Mar. 2021, Accessed: Mar. 22, 2021. [Online]. Available: http://arxiv.org/abs/2103.10385. - -- [2] Wang, Sinong, Han Fang, Madian Khabsa, Hanzi Mao, and Hao Ma. “Entailment as Few-Shot Learner.” ArXiv:2104.14690 [Cs], April 29, 2021. http://arxiv.org/abs/2104.14690. - -- [3] Schick, Timo, and Hinrich Schütze. “Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference.” ArXiv:2001.07676 [Cs], January 25, 2021. http://arxiv.org/abs/2001.07676. diff --git a/examples/few_shot/RGL/README.md b/examples/few_shot/RGL/README.md deleted file mode 100644 index 14c183d8a6ed..000000000000 --- a/examples/few_shot/RGL/README.md +++ /dev/null @@ -1,129 +0,0 @@ -# RGL: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning - -This is the implementation of the paper [RGL: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning](https://aclanthology.org/2022.findings-naacl.81/). - -**************************** Updates ***************************** - -2022-07-11: Our training code has been released. - -2022-04-08: Our paper has been accepted to Findings of [NAACL 2022](https://aclanthology.org/2022.findings-naacl.81/)! - -# Overview - -

-overview -

- -We propose a simple yet effective Relation Graph augmented Learning RGL method that can obtain better performance in few-shot natural language understanding tasks. - -RGL constructs a relation graph based on the label consistency between samples in the same batch, and learns to solve the resultant node classification and link prediction problems of the relation graphs. In this way, RGL fully exploits the limited supervised information, which can boost the tuning effectiveness. - -# Prepare the data - -We evaluate on the GLUE variant for few-shot learning in the paper, including SST-2, SST-5, MR, CR, MPQA, Subj, TREC, CoLA, MNLI, MNLI-mm, SNLI, QNLI, RTE, MRPC, QQP and STS-B. Please download the [datasets](https://paddlenlp.bj.bcebos.com/datasets/k-shot-glue/rgl-k-shot.zip) and extract the data files to the path ``./data/k-shot``. - - -# Experiments - -The structure of the code: - -``` -├── scripts/ -│ ├── run_pet.sh # Script for PET -│ └── run_rgl.sh # Script for RGL -├── template.py # The parser for prompt template -├── verbalizer.py # The mapping from labels to corresponding words -├── tokenizer.py # The tokenizer wrapeer to conduct text truncation -├── utils.py # The tools -└── rgl.py # The training process of RGL -``` - -## How to define a template - -We inspire from [OpenPrompt](https://github.com/thunlp/OpenPrompt/tree/main) and define template as a list of dictionary. The key of raw texts in datasets is `text`, and the corresponding value is the keyword of text in loaded dataset, where we use `text_a` to denote the first sentence in every example and `text_b` to denote the other sentences by default. - -For example, given the template ``{'text':'text_a'} It was {'mask'}.`` and a sample text ``nothing happens , and it happens to flat characters .`` the input text will be ``nothing happens , and it happens to flat characters . It was .`` - - -## Quick start - -Run the following code for prompt-tuning. - -``` -export CUDA_VISIBLE_DEVICES=0 -python rgl.py \ ---output_dir ./checkpoints/ \ ---dataset SST-2 \ ---data_path ./data/k-hot/SST-2/16-13/ \ ---max_seq_length 128 \ ---max_steps 1000 \ ---logging_step 10 \ ---eval_step 100 \ ---batch_size 4 \ ---alpha 0.1 \ ---seed 13 \ ---learning_rate 1e-5 \ ---template "{'text':'text_a'} It was {'mask'}." \ ---verbalizer "{'0':'terrible','1':'great'}" -``` - -The configurations consist of: -- ``output_dir``: The directory to save model checkpoints. -- ``dataset``: The dataset name for few-shot learning. -- ``data_path``: The path to data files of ``dataset``. -- ``max_seq_length``: The maximum length of input text, including the prompt. -- ``max_steps``: The maximum steps for training. -- ``logging_step``: Print logs every ``logging_step``. -- ``eval_step``: Evaluate model every ``eval_step``. -- ``batch_size``: The number of samples per batch. -- ``alpha``: The weight of the loss proposed in RGL. -- ``seed``: Random seed. -- ``learning_rate``: The learning rate for tuning. -- ``template``: The template to define how to combine text data and prompt. -- ``verbalizer``: The verbalizer to map labels to words in vocabulary. - - -## Multiple runs for the best results - -To reproduce our experiments, you can use the scripts to get the results under different settings. We have defined the templates and the verbalizers in both ``./script/run_pet.sh`` and ``./script/run_rgl.sh``. You can refer to these scripts for more details. - -### Run PET - -``` -bash ./scripts/run_pet.sh SST-2 0 -``` - -where ``SST-2`` specifies the dataset used for prompt-tuning and you can replace it with any other downloaded datasets in ``./data/k-shot/ ``. Besides, ``0`` refers to the gpu device id. - -**NOTE**: The dataset name is case-sensitive to run the scripts. - -### Run RGL - -``` -bash ./scripts/run_rgl.sh SST-2 0 -``` - -Please see the descriptions above for the arguments. - - -# Citation - -Please cite our paper if you use RGL in your work: -``` -@inproceedings{wang-etal-2022-rgl, - title = "{RGL}: A Simple yet Effective Relation Graph Augmented Prompt-based Tuning Approach for Few-Shot Learning", - author = "Wang, Yaqing and - Tian, Xin and - Xiong, Haoyi and - Li, Yueyang and - Chen, Zeyu and - Guo, Sheng and - Dou, Dejing", - booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022", - year = "2022", - publisher = "Association for Computational Linguistics", - url = "https://aclanthology.org/2022.findings-naacl.81", - pages = "1078--1084", -} - -``` diff --git a/examples/few_shot/RGL/data.py b/examples/few_shot/RGL/data.py deleted file mode 100644 index 32efac286aad..000000000000 --- a/examples/few_shot/RGL/data.py +++ /dev/null @@ -1,496 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import csv -import json -import os -from abc import abstractmethod -from collections import defaultdict -from dataclasses import dataclass, field - -import paddle -import pandas as pd -from paddle.metric import Accuracy - -from paddlenlp.datasets import MapDataset -from paddlenlp.metrics import AccuracyAndF1, Mcc, PearsonAndSpearman - - -@dataclass -class InputExample(object): - """Data structure of every example in datasets.""" - - uid: str = field(default=None, metadata={"help": "A unique identifier of the example."}) - text_a: str = field(default=None, metadata={"help": "The first text sequence in each example."}) - text_b: str = field(default=None, metadata={"help": "The other text sequences in each example."}) - cls_label: int = field(default=None, metadata={"help": "The label of classification tasks."}) - seq_label: list = field(default=None, metadata={"help": "The label of generation tasks."}) - meta: dict = field(default=None, metadata={"help": "An optional dictionary of other data for each example."}) - - def __repr__(self): - content = {k: v for k, v in self.__dict__.items() if v is not None} - content = json.dumps(content, indent=2, sort_keys=True) + "\n" - return str(content) - - def keys(self, keep_none=False): - return [key for key in self.__dict__.keys() if getattr(self, key) is not None] - - -class InputFeatures(dict): - """ - Data structure of every wrapped example or a batch of examples as the input of model. - - Args: - input_ids (paddle.Tensor): - The token ids. - attention_mask (paddle.Tensor): - The mask ids. - token_type_ids (paddle.Tensor, optional): - The token type ids. - input_embeds (paddle.Tensor, optional): - The embeddings of soft tokens. - mask_ids (paddle.Tensor, optional): - The mask ids where 1 denotes that a token is a mask, 0 denotes it is not a mask. - cls_label (list, optional): - The label of classification task. - seq_label (list, optional): - The label of generation task. - uid (list, optional): - The unique id(s) for example(s). - """ - - input_keys = [ - "input_ids", - "attention_mask", - "token_type_ids", - "input_embeds", - "cls_label", - "seq_label", - "label", - "uid", - "mask_ids", - "soft_token_ids", - ] - - def __init__( - self, - input_ids=None, - attention_mask=None, - token_type_ids=None, - input_embeds=None, - mask_ids=None, - label=None, - cls_label=None, - seq_label=None, - uid=None, - soft_token_ids=None, - ): - self.input_ids = input_ids - self.attention_mask = attention_mask - self.token_type_ids = token_type_ids - self.input_embeds = input_embeds - self.label = label - self.cls_label = cls_label - self.seq_label = seq_label - self.mask_ids = mask_ids - self.uid = uid - self.soft_token_ids = soft_token_ids - - @classmethod - def add_keys(cls, *args): - cls.input_keys.extend(args) - - def keys(self, keep_none=False): - if keep_none: - return self.input_keys - else: - return [key for key in self.input_keys if getattr(self, key) is not None] - - def values(self, keep_none=False): - return [getattr(self, key) for key in self.keys(keep_none=keep_none)] - - def items(self): - return [(key, getattr(self, key)) for key in self.keys()] - - def __len__(self): - return len(self.keys()) - - def __repr__(self): - return str(json.dumps(self.items()) + "\n") - - def __getitem__(self, key): - return getattr(self, key) - - def __iter__(self): - return iter(self.keys()) - - def __contains__(self, key, keep_none): - return key in self.keys(keep_none) - - def __setitem__(self, key, value): - if key not in self.input_keys: - raise KeyError("{} not in predefined keys, use add_keys to add it.".format(key)) - setattr(self, key, value) - - @staticmethod - def collate_fn(batch): - """Collate batch data in form of InputFeatures.""" - new_batch = {} - for key in batch[0]: - values = [b[key] for b in batch] - try: - new_batch[key] = paddle.to_tensor(values) - except ValueError: - new_batch[key] = values - - return InputFeatures(**new_batch) - - -class DataProcessor(object): - """Base class for reading datasets from files.""" - - def __init__(self, labels=None): - self._labels = labels - if labels is not None: - self._labels = sorted(labels) - - @property - def labels(self): - if not getattr(self, "_labels"): - raise ValueError("labels and label_mappings are not setted yet.") - return self._labels - - @labels.setter - def labels(self, labels): - if labels is not None: - self._labels = sorted(labels) - - @property - def label_mapping(self): - if not getattr(self, "_labels"): - raise ValueError("labels and label_mappings are not setted yet.") - if not getattr(self, "_label_mapping"): - self._label_mapping = {k: i for i, k in enumerate(self._labels)} - return self._label_mapping - - @label_mapping.setter - def label_mapping(self, label_mapping): - if getattr(self, "_labels"): - assert self._labels == sorted(list(label_mapping.keys())) - self._label_mapping = label_mapping - - @abstractmethod - def get_examples(self, data_dir, split): - raise NotImplementedError - - def get_train_examples(self, data_dir): - return self.get_examples(data_dir, "train") - - def get_dev_examples(self, data_dir): - return self.get_examples(data_dir, "dev") - - def get_test_exaples(self, data_dir): - return self.get_examples(data_dir, "test") - - @classmethod - def read_tsv(cls, input_file, quotechar=None): - with open(input_file, "r", encoding="utf-8-sig") as f: - data = csv.reader(f, delimiter="\t", quotechar=quotechar) - return [x for x in data] - - @classmethod - def read_csv(cls, input_file, header=None): - data = pd.read_csv(input_file, header=header) - return data.values.tolist() - - @classmethod - def read_json(cls, input_file): - with open(input_file, "r") as f: - data = [json.loads(x) for x in f.readlines()] - return data - - -class BoolQProcessor(DataProcessor): - def __init__(self): - super().__init__(["False", "True"]) - self.split_map = {"train": "train", "dev": "dev32", "test": "val"} - - def get_examples(self, data_dir, split): - split = self.split_map[split] - raw_data = self.read_json(os.path.join(data_dir, split + ".jsonl")) - examples = [] - for i, line in enumerate(raw_data): - examples.append( - InputExample( - uid="%s-%d" % (split, i), - text_a=line["passage"], - text_b=line["question"], - cls_label=str(line["label"]), - ) - ) - - return examples - - -class MrpcProcesser(DataProcessor): - def __init__(self): - super().__init__(["0", "1"]) - - def get_examples(self, data_dir, split): - raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) - examples = [] - for i, line in enumerate(raw_data): - if i == 0: - continue - examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[3], text_b=line[4], cls_label=line[0])) - - return examples - - -class MnliProcessor(DataProcessor): - def __init__(self): - super().__init__(["contradiction", "entailment", "neutral"]) - - def _process_file(self, split): - if split in ["dev", "test"]: - return split + "_matched" - return split - - def get_examples(self, data_dir, split): - split = self._process_file(split) - raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) - examples = [] - for i, line in enumerate(raw_data): - if i == 0: - continue - examples.append( - InputExample(uid="%s-%s" % (split, line[0]), text_a=line[8], text_b=line[9], cls_label=line[-1]) - ) - return examples - - -class MnliMismatchedProcessor(MnliProcessor): - def _process_file(self, split): - if split == "dev": - return split + "_matched" - if split == "test": - return split + "_mismatched" - return split - - -class SnliProcessor(DataProcessor): - def __init__(self): - super().__init__(["contradiction", "entailment", "neutral"]) - - def get_examples(self, data_dir, split): - raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) - examples = [] - for i, line in enumerate(raw_data): - if i == 0: - continue - examples.append( - InputExample(uid="%s-%s" % (split, line[0]), text_a=line[7], text_b=line[8], cls_label=line[-1]) - ) - return examples - - -class ColaProcessor(DataProcessor): - def __init__(self): - super().__init__(["0", "1"]) - - def get_examples(self, data_dir, split): - raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) - examples = [] - for i, line in enumerate(raw_data): - examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[3], text_b=None, cls_label=line[1])) - return examples - - -class Sst2Processor(DataProcessor): - def __init__(self): - super().__init__(["0", "1"]) - - def get_examples(self, data_dir, split): - raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) - examples = [] - for i, line in enumerate(raw_data): - if i == 0: - continue - examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[0], text_b=None, cls_label=line[1])) - return examples - - -class StsbProcessor(DataProcessor): - def __init__(self): - super().__init__(["0", "1"]) - - def get_examples(self, data_dir, split): - raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) - examples = [] - for i, line in enumerate(raw_data): - if i == 0: - continue - examples.append( - InputExample(uid="%s-%s" % (split, line[0]), text_a=line[7], text_b=line[8], cls_label=line[-1]) - ) - return examples - - -class QqpProcessor(DataProcessor): - def __init__(self): - super().__init__(["0", "1"]) - - def get_examples(self, data_dir, split): - raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) - examples = [] - for i, line in enumerate(raw_data): - if i == 0: - continue - try: - examples.append( - InputExample(uid="%s-%s" % (split, line[0]), text_a=line[3], text_b=line[4], cls_label=line[5]) - ) - except IndexError: - continue - return examples - - -class QnliProcessor(DataProcessor): - def __init__(self): - super().__init__(["entailment", "not_entailment"]) - - def get_examples(self, data_dir, split): - raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) - examples = [] - for i, line in enumerate(raw_data): - if i == 0: - continue - examples.append( - InputExample(uid="%s-%s" % (split, line[0]), text_a=line[1], text_b=line[2], cls_label=line[-1]) - ) - return examples - - -class RteProcessor(DataProcessor): - def __init__(self): - super().__init__(["entailment", "not_entailment"]) - - def get_examples(self, data_dir, split): - raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) - examples = [] - for i, line in enumerate(raw_data): - if i == 0: - continue - examples.append( - InputExample(uid="%s-%s" % (split, line[0]), text_a=line[1], text_b=line[2], cls_label=line[-1]) - ) - return examples - - -class WnliProcessor(DataProcessor): - def __init__(self): - super().__init__(["0", "1"]) - - def get_examples(self, data_dir, split): - raw_data = self.read_tsv(os.path.join(data_dir, split + ".tsv")) - examples = [] - for i, line in enumerate(raw_data): - if i == 0: - continue - examples.append( - InputExample(uid="%s-%s" % (split, line[0]), text_a=line[1], text_b=line[2], cls_label=line[-1]) - ) - return examples - - -class TextClassificationProcessor(DataProcessor): - def __init__(self, task_name): - NUM_LABELS = {"mr": 2, "sst-5": 5, "subj": 2, "trec": 6, "cr": 2, "mpqa": 2} - assert task_name in NUM_LABELS, "task_name not supported." - self.task_name = task_name - self._labels = list(range(NUM_LABELS[self.task_name])) - - def get_examples(self, data_dir, split): - raw_data = self.read_csv(os.path.join(data_dir, split + ".csv")) - examples = [] - for i, line in enumerate(raw_data): - examples.append(InputExample(uid="%s-%d" % (split, i), text_a=line[1], cls_label=line[0])) - return examples - - -# The processor mapping for datasets in RGL paper. -PROCESSOR_MAPPING = { - "mrpc": MrpcProcesser(), - "mnli": MnliProcessor(), - "mnli-mm": MnliMismatchedProcessor(), - "snli": SnliProcessor(), - "cola": ColaProcessor(), - "sst-2": Sst2Processor(), - "sts-b": StsbProcessor(), - "qqp": QqpProcessor(), - "qnli": QnliProcessor(), - "rte": RteProcessor(), - "wnli": WnliProcessor(), - "cr": TextClassificationProcessor("cr"), - "mr": TextClassificationProcessor("mr"), - "sst-5": TextClassificationProcessor("sst-5"), - "subj": TextClassificationProcessor("subj"), - "mpqa": TextClassificationProcessor("mpqa"), - "trec": TextClassificationProcessor("trec"), - "boolq": BoolQProcessor(), -} - -# The task mapping for datasets. -TASK_MAPPING = defaultdict(lambda: "classification") -TASK_MAPPING["sts-b"] = "regression" - -# The metric mapping for datasets. -METRIC_MAPPING = defaultdict(Accuracy) -METRIC_MAPPING.update( - { - "mrpc": AccuracyAndF1(name=["acc", "precision", "recall", "f1", "acc_and_f1"]), - "qqp": AccuracyAndF1(name=["acc", "precision", "recall", "f1", "acc_and_f1"]), - "cola": Mcc(), - "sts-b": PearsonAndSpearman(name=["pearson", "spearman", "corr"]), - } -) - - -def load_dataset(dataset, data_path=None, splits=[]): - """ - Read datasets from files. - - Args: - dataset (str): - The dataset name in lowercase. - data_path (str): - The path to the dataset directory, including train, dev or test file. - splits (list): - Which file(s) of dataset to read, such as ['train', 'dev', 'test']. - - """ - assert len(splits) > 0, "No splits, can not load dataset {}".format(dataset) - processor = PROCESSOR_MAPPING[dataset] - data = [] - if "train" in splits: - train_examples = processor.get_train_examples(data_path) - data.append(MapDataset(train_examples)) - if "dev" in splits: - dev_examples = processor.get_dev_examples(data_path) - data.append(MapDataset(dev_examples)) - if "test" in splits: - test_examples = processor.get_test_exaples(data_path) - data.append(MapDataset(test_examples)) - data.append(processor.labels) - return data diff --git a/examples/few_shot/RGL/rgl.py b/examples/few_shot/RGL/rgl.py deleted file mode 100644 index dd137c71b700..000000000000 --- a/examples/few_shot/RGL/rgl.py +++ /dev/null @@ -1,239 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os -from functools import partial - -import numpy as np -import paddle -import paddle.nn as nn -from data import METRIC_MAPPING, TASK_MAPPING, InputFeatures, load_dataset -from template import ManualTemplate -from tokenizer import MLMTokenizerWrapper -from utils import ( - LinearSchedulerWarmup, - check_args, - convert_example, - create_dataloader, - set_seed, -) -from verbalizer import ManualVerbalizer -from visualdl import LogWriter - -from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer -from paddlenlp.utils.log import logger - -# yapf: disable -parser = argparse.ArgumentParser('Implementation of RGL paper.') -parser.add_argument('--seed', type=int, default=1000, help='Random seed.') -parser.add_argument('--device', type=str, default='gpu', choices=['gpu', 'cpu'], help='Device for training, default to gpu.') -parser.add_argument('--dataset', type=str, default='SST-2', help='The build-in few-shot dataset.') -parser.add_argument('--data_path', type=str, default=None, help='The path to local dataset in .tsv files.') - -parser.add_argument('--model_name_or_path', type=str, default='roberta-large', help='The build-in pretrained LM or the path to local model parameters.') -parser.add_argument('--template', type=str, default="{'text':'text_a'} It was {'mask'}.", help='The input template.') -parser.add_argument('--verbalizer', type=str, default="{'0':'terrible', '1':'great'}", help='The label mapping of output.') -parser.add_argument('--alpha', type=float, default=0, help='The weight of link prediction loss in RGL.') -parser.add_argument('--max_seq_length', type=int, default=512, help='The maximum length of input text.') -parser.add_argument('--max_grad_norm', type=float, default=1.0, help='The maximum norm of all parameters.') - -parser.add_argument('--num_epoch', type=int, default=0, help='The number of epoch for training.') -parser.add_argument('--max_steps', type=int, default=1000, help='Maximum steps, which overwrites num_epoch.') -parser.add_argument('--batch_size', type=int, default=32, help='The number of samples used per step.') -parser.add_argument('--learning_rate', type=float, default=1e-5, help='The learning rate of optimizer.') -parser.add_argument('--weight_decay', type=float, default=0.0, help='Weight decay if we apply some.') -parser.add_argument('--warmup_steps', type=int, default=0, help='The warmup steps for leanring rate scheduler.') -parser.add_argument('--logging_step', type=int, default=100, help='Print logs every logging_step steps.') -parser.add_argument('--eval_step', type=int, default=100, help='Evaluate model every eval_step steps.') -parser.add_argument('--save_best', action='store_true', help='Save the best model according to evaluation results. Save the last checkpoint if False.') -parser.add_argument('--output_dir', type=str, default='./checkpoints/', help='The path to save checkpoints.') -parser.add_argument('--overwrite_output', action='store_true', help='Whether overwrite the output_dir.') -args = parser.parse_args() -# yapf: enable - -check_args(args) -for arg in vars(args): - logger.info(format(arg, "<20") + format(str(getattr(args, arg)), "<")) - - -@paddle.no_grad() -def evaluate(model, dataloader, metric, verbalizer, task_type, bound=(0, 5)): - if task_type == "regression": - logsoftmax = nn.LogSoftmax(axis=-1) - lb, ub = bound - model.eval() - metric.reset() - for batch in dataloader: - logits = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]) - label_logits = verbalizer.process_logits(logits, batch["mask_ids"]) - if task_type == "regression": - label_logits = logsoftmax(label_logits) - label_logits = paddle.exp(label_logits[..., 1].unsqueeze(-1)) * (ub - lb) + lb - correct = metric.compute(label_logits, batch["label"]) - metric.update(correct) - score = metric.accumulate() - score = score if isinstance(score, (list, tuple)) else [score] - logger.info("{:>20}".format("Evaluation results:")) - for name, value in zip(metric.name(), score): - logger.info("{:>20} = {:.6f}".format(name, value)) - model.train() - return score[0] - - -def contrastive_loss(sentence_embeddings, labels, task_type="classification"): - """Compute the loss proposed in RGL method.""" - - def _raw_equal(x, y): - return int(x == y) - - def _max_equal(x, y): - return int(np.argmax(x, axis=0) == np.argmax(y, axis=0)) - - equal_int = _raw_equal if task_type == "classification" else _max_equal - bce_metric = nn.CrossEntropyLoss() - cos_metric = nn.CosineSimilarity(axis=0, eps=1e-6) - batch_size = sentence_embeddings.shape[0] - loss = 0 - for i in range(batch_size): - for j in range(batch_size): - score = cos_metric(sentence_embeddings[i], sentence_embeddings[j]) - score = score.unsqueeze(0) - logits = paddle.concat([(1 - score) * 50, (1 + score) * 50], axis=-1) - label = paddle.to_tensor(equal_int(labels[i], labels[j])) - loss += bce_metric(logits.reshape([-1, logits.shape[-1]]), label.unsqueeze(0)) - loss = loss / (batch_size * (batch_size - 1)) - loss = loss / 100 - return loss - - -def main(): - paddle.set_device(args.device) - set_seed(args.seed) - - task_type = TASK_MAPPING[args.dataset] - model = AutoModelForMaskedLM.from_pretrained(args.model_name_or_path) - tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path) - tokenizer_wrapper = MLMTokenizerWrapper(args.max_seq_length, tokenizer) - - train_ds, dev_ds, test_ds, label_list = load_dataset( - args.dataset, data_path=args.data_path, splits=["train", "dev", "test"] - ) - - template = ManualTemplate(tokenizer, args.template) - logger.info("Set template: {}".format(template.template)) - verbalizer = ManualVerbalizer(tokenizer, labels=label_list, label_to_words=eval(args.verbalizer), prefix=" ") - logger.info("Set verbalizer: {}".format(args.verbalizer)) - - trans_fn = partial(convert_example, template=template, verbalizer=verbalizer, tokenizer_wrapper=tokenizer_wrapper) - - train_loader = create_dataloader(train_ds, "train", args.batch_size, InputFeatures.collate_fn, trans_fn) - dev_loader = create_dataloader(dev_ds, "dev", args.batch_size, InputFeatures.collate_fn, trans_fn) - test_loader = create_dataloader(test_ds, "test", args.batch_size, InputFeatures.collate_fn, trans_fn) - if args.max_steps > 0: - num_epoch = args.max_steps // len(train_loader) + int(args.max_steps % len(train_loader) > 0) - max_steps = args.max_steps - else: - num_epoch = args.num_epoch - max_steps = args.num_epoch * len(train_loader) - - lr_scheduler = LinearSchedulerWarmup(args.learning_rate, args.warmup_steps, max_steps) - decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] - optimizer = paddle.optimizer.AdamW( - learning_rate=lr_scheduler, - parameters=model.parameters(), - weight_decay=args.weight_decay, - grad_clip=paddle.nn.ClipGradByGlobalNorm(args.max_grad_norm), - apply_decay_param_fun=lambda x: x in decay_params, - ) - - metric_fn = METRIC_MAPPING[args.dataset] - if task_type == "regression": - loss_fn = nn.KLDivLoss() - lb, ub = 0, 5 - logsoftmax = nn.LogSoftmax(axis=-1) - else: - loss_fn = nn.CrossEntropyLoss() - with LogWriter(logdir="./log/pet/train") as writer: - best_metric = -float("inf") - global_step = 1 - global_loss = 0 - for epoch in range(1, num_epoch + 1): - for step, batch in enumerate(train_loader, start=1): - writer.add_scalar("train/lr", lr_scheduler.get_lr(), global_step) - - logits = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]) - label_logits = verbalizer.process_logits(logits, batch["mask_ids"]) - if task_type == "regression": - label_logits = logsoftmax(label_logits) - - labels = paddle.stack( - [ - 1 - (batch["label"].reshape([-1]) - lb) / (ub - lb), - (batch["label"].reshape([-1]) - lb) / (ub - lb), - ], - axis=-1, - ) - loss = loss_fn(label_logits.reshape([-1, 2]), labels) - else: - labels = paddle.to_tensor(batch["label"], dtype="int64") - loss = loss_fn(label_logits.reshape([-1, label_logits.shape[-1]]), labels.reshape([-1])) - if args.alpha > 0: - con_loss = contrastive_loss(logits, labels, task_type=task_type) - loss += args.alpha * con_loss - global_loss += loss.item() - - loss.backward() - optimizer.step() - lr_scheduler.step() - optimizer.clear_grad() - - writer.add_scalar("train/loss", loss.item(), global_step) - - if global_step % args.logging_step == 0: - avg_loss = global_loss / args.logging_step - logger.info( - "Epoch: {:3d}/{:3d}, Global Step: {:4d}, Loss: {:e}".format( - epoch, num_epoch, global_step, avg_loss - ) - ) - global_loss = 0 - - if global_step % args.eval_step == 0: - logger.info("{0:-^30}".format(" Validate ")) - value = evaluate(model, dev_loader, metric_fn, verbalizer, task_type) - if args.save_best and value > best_metric: - best_metric = value - save_path = os.path.join(args.output_dir, "model_best") - if not os.path.exists(save_path): - os.makedirs(save_path) - model.save_pretrained(save_path) - tokenizer.save_pretrained(save_path) - - global_step += 1 - if global_step > max_steps: - break - - logger.info("{0:-^30}".format(" Test ")) - evaluate(model, test_loader, metric_fn, verbalizer, task_type) - if not args.save_best: - save_path = os.path.join(args.output_dir, "model_last") - if not os.path.exists(save_path): - os.makedirs(save_path) - model.save_pretrained(save_path) - tokenizer.save_pretrained(save_path) - - -if __name__ == "__main__": - main() diff --git a/examples/few_shot/RGL/scripts/run_pet.sh b/examples/few_shot/RGL/scripts/run_pet.sh deleted file mode 100644 index ca50d8c6bd00..000000000000 --- a/examples/few_shot/RGL/scripts/run_pet.sh +++ /dev/null @@ -1,114 +0,0 @@ -dataset=$1 -device=$2 - -MAX_LEN=128 -dataname=$dataset - -case $dataset in - CoLA) - temp="{'text':'text_a'} This is {'mask'}." - verb="{'0':'incorrect','1':'correct'}" - ;; - MRPC) - temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" - verb="{'0':'No','1':'Yes'}" - ;; - QQP) - temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" - verb="{'0':'No','1':'Yes'}" - ;; - STS-B) - temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" - verb="{'0':'No','1':'Yes'}" - ;; - MNLI) - temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" - verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" - MAX_LEN=256 - ;; - MNLI-mm) - temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" - verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" - MAX_LEN=256 - dataname='MNLI' - ;; - SNLI) - temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" - verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" - MAX_LEN=256 - ;; - QNLI) - temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" - verb="{'not_entailment':'No','entailment':'Yes'}" - ;; - RTE) - temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" - verb="{'not_entailment':'No','entailment':'Yes'}" - MAX_LEN=256 - ;; - mr) - temp="{'text':'text_a'} It was {'mask'}" - verb="{0:'terrible',1:'great'}" - MAX_LEN=160 - ;; - sst-5) - temp="{'text':'text_a'} It was {'mask'}." - temp="{'text':'text_a'} {'mask'}" - verb="{0:'terrible',1:'bad',2:'okay',3:'good',4:'great'}" - ;; - SST-2) - temp="{'text':'text_a'} It was {'mask'}." - verb="{'0':'terrible','1':'great'}" - ;; - subj) - temp="{'text':'text_a'} This is {'mask'}." - verb="{0:'subjective',1:'objective'}" - MAX_LEN=256 - ;; - trec) - temp="{'mask'}:{'text':'text_a'}" - verb="{0:'Description',1:'Entity',2:'Expression',3:'Human',4:'Location',5:'Number'}" - ;; - cr) - temp="{'text':'text_a'} It was {'mask'}." - verb="{0:'terrible',1:'great'}" - MAX_LEN=160 - ;; - mpqa) - temp="{'text':'text_a'} It was {'mask'}" - verb="{0:'terrible',1:'great'}" - MAX_LEN=128 - ;; - -esac - -echo $temp -echo $verb - - -ALPHA=0 -for seed in 13 21 42 87 100 -do - for lr in 1e-5 2e-5 5e-5 - do - for bs in 2 4 8 - do - CUDA_VISIBLE_DEVICES=$device python rgl.py \ - --output_dir ./ckpt_pet_roberta_$seed/ \ - --dataset $dataset \ - --data_path ./data/k-shot/$dataname/16-$seed/ \ - --max_seq_length $MAX_LEN \ - --max_steps 1000 \ - --logging_step 10 \ - --eval_step 100 \ - --batch_size $bs \ - --alpha $ALPHA \ - --seed $seed \ - --learning_rate $lr \ - --template "$temp" \ - --verbalizer "$verb" \ - --overwrite_output - done - done -done - diff --git a/examples/few_shot/RGL/scripts/run_rgl.sh b/examples/few_shot/RGL/scripts/run_rgl.sh deleted file mode 100644 index 9b1a5d2dc216..000000000000 --- a/examples/few_shot/RGL/scripts/run_rgl.sh +++ /dev/null @@ -1,115 +0,0 @@ -dataset=$1 -device=$2 - -MAX_LEN=128 -dataname=$dataset - -case $dataset in - CoLA) - temp="{'text':'text_a'} This is {'mask'}." - verb="{'0':'incorrect','1':'correct'}" - ;; - MRPC) - temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" - verb="{'0':'No','1':'Yes'}" - ;; - QQP) - temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" - verb="{'0':'No','1':'Yes'}" - ;; - STS-B) - temp="{'text':'text_a'}{'mask'},{'text':'text_b'}" - verb="{'0':'No','1':'Yes'}" - ;; - MNLI) - temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" - verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" - MAX_LEN=256 - ;; - MNLI-mm) - temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" - verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" - MAX_LEN=256 - dataname='MNLI' - ;; - SNLI) - temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" - verb="{'contradiction':'No','entailment':'Yes','neutral':'Maybe'}" - MAX_LEN=256 - ;; - QNLI) - temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" - verb="{'not_entailment':'No','entailment':'Yes'}" - ;; - RTE) - temp="{'text':'text_a'}?{'mask'},{'text':'text_b'}" - verb="{'not_entailment':'No','entailment':'Yes'}" - MAX_LEN=256 - ;; - mr) - temp="{'text':'text_a'} It was {'mask'}" - verb="{0:'terrible',1:'great'}" - MAX_LEN=160 - ;; - sst-5) - temp="{'text':'text_a'} It was {'mask'}." - temp="{'text':'text_a'} {'mask'}" - verb="{0:'terrible',1:'bad',2:'okay',3:'good',4:'great'}" - ;; - SST-2) - temp="{'text':'text_a'} It was {'mask'}." - verb="{'0':'terrible','1':'great'}" - ;; - subj) - temp="{'text':'text_a'} This is {'mask'}." - verb="{0:'subjective',1:'objective'}" - MAX_LEN=256 - ;; - trec) - temp="{'mask'}:{'text':'text_a'}" - verb="{0:'Description',1:'Entity',2:'Expression',3:'Human',4:'Location',5:'Number'}" - ;; - cr) - temp="{'text':'text_a'} It was {'mask'}." - verb="{0:'terrible',1:'great'}" - MAX_LEN=160 - ;; - mpqa) - temp="{'text':'text_a'} It was {'mask'}" - verb="{0:'terrible',1:'great'}" - MAX_LEN=128 - ;; - -esac - -echo $temp -echo $verb - - -for seed in 13 21 42 87 100 -do - for lr in 1e-5 2e-5 5e-5 - do - for bs in 2 4 8 - do - for alpha in 0.1 0.3 0.5 0.7 1 - do - CUDA_VISIBLE_DEVICES=$device python rgl.py \ - --output_dir ./ckpt_rgl_$seed/ \ - --dataset $dataset \ - --data_path ./data/k-shot/$dataname/16-$seed/ \ - --max_seq_length $MAX_LEN \ - --max_steps 1000 \ - --logging_step 100 \ - --eval_step 1000 \ - --batch_size $bs \ - --alpha $alpha \ - --seed $seed \ - --learning_rate $lr \ - --template "$temp" \ - --verbalizer "$verb" \ - --overwrite_output - done - done - done -done diff --git a/examples/few_shot/RGL/template.py b/examples/few_shot/RGL/template.py deleted file mode 100644 index 9f0561fc2402..000000000000 --- a/examples/few_shot/RGL/template.py +++ /dev/null @@ -1,391 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from abc import abstractmethod - -import paddle -import paddle.nn as nn -from data import InputExample - -from paddlenlp.utils.log import logger - - -class Template(nn.Layer): - """ - Base template class used to preprocess the inputs of model. - - Args: - tokenizer (paddlenlp.transformers.PretrainedTokenizer): - The tokenizer of pretrained models. - text_mapping (dict): - The dictionary to map text name in template to that in InputExample. - For example, {'premise': 'text_a', 'hypothesis': 'text_b'}. - - """ - - registered_input_names = ["mask_ids", "shortenable_ids"] - - def __init__(self, tokenizer, text_mapping=None): - super().__init__() - self.tokenizer = tokenizer - self.text_mapping = text_mapping - self._process_lock = False - - self.part_start = "{" - self.part_end = "}" - - @property - def template(self): - if not hasattr(self, "_template"): - raise RuntimeError("Property template has not been set before used.") - return self._template - - @template.setter - def template(self, template): - if template is None: - return - self._template = template - self.process_template() - - @abstractmethod - def process_template(self): - """A hook to process template text when it is set.""" - raise NotImplementedError - - def get_default_mask_ids(self): - """List to denote whether an item in template is a mask token.""" - return [1 if "mask" in p else 0 for p in self.template] - - def get_default_shortenable_ids(self): - """List to denote whther an item in template can be truncated.""" - idx = [] - for p in self.template: - if "shortenable" in p: - idx.append(1 if p["shortenable"] else 0) - else: - idx.append(1 if "text" in p else 0) - return idx - - def incorporate_template_text(self, example, template=None): - """Replace each item in template with real text.""" - inputs = template.copy() if self.template is None else self.template.copy() - - for i, p in enumerate(inputs): - if "text" in p: - inputs[i] = p["add_prefix_space"] + getattr(example, p["text"]) - elif "mask" in p: - inputs[i] = self.tokenizer.mask_token - elif "hard" in p: - inputs[i] = p["add_prefix_space"] + p["hard"] - elif "sep" in p: - inputs[i] = self.tokenizer.sep_token - else: - raise ValueError("can not parse {}".format(p)) - - return inputs - - def parse_inputs(self, inputs: str): - """Parse items from the input template text.""" - parsed = [] - i = 0 - while i < len(inputs): - p = {"add_prefix_space": " " if (i > 0 and inputs[i - 1] == " ") else ""} - while i < len(inputs) and inputs[i] == " ": - p["add_prefix_space"] = " " - i = i + 1 - if i == len(inputs): - break - - if inputs[i] == self.part_start: - j = i + 1 - count_part = 1 - while j < len(inputs): - if inputs[j] == self.part_end: - count_part -= 1 - if count_part == 0: - break - elif inputs[j] == self.part_start: - count_part += 1 - j = j + 1 - if j == len(inputs): - raise ValueError( - "{} at position {} has no corresponding {}".format(self.part_start, i, self.part_end) - ) - try: - part = eval("{%s}" % inputs[i + 1 : j]) - if isinstance(part, set): - part = {k: None for k in part} - p.update(part) - except: - import traceback - - logger.error(traceback.format_exc()) - logger.error("syntax error in {}".format("{%s}" % inputs[i + 1 : j])) - exit() - i = j + 1 - else: - j = i + 1 - while j < len(inputs): - if inputs[j] == self.part_start: - break - j = j + 1 - p["hard"] = inputs[i:j].rstrip(" ") - i = j - parsed.append(p) - - return parsed - - def wrap_one_example(self, example): - """Process InputExample according to the predefined template.""" - if self.template is None: - raise ValueError("template has not been initialized.") - if isinstance(example, InputExample): - text = self.incorporate_template_text(example) - - non_empty_keys = example.keys() - for key in self.text_mapping: - if self.text_mapping[key] in non_empty_keys: - non_empty_keys.remove(self.text_mapping[key]) - - keys, values = ["text"], [text] - for name in self.registered_input_names: - keys.append(name) - v = None - if hasattr(self, name) and getattr(self, name) is not None: - v = getattr(self, name) - elif hasattr(self, "get_default_" + name): - v = getattr(self, "get_default_" + name)() - setattr(self, name, v) - else: - raise ValueError( - """ - Template's part attribute '{}' is registered but not - initialized. Try using template.{} = [...] to - initialize or create a get_default_{}(self) - method in your template.""".format( - name, name, name - ) - ) - values.append(v) - - wrapped_parts_to_tokenize = [] - for value in list(zip(*values)): - wrapped_parts_to_tokenize.append(dict(zip(keys, value))) - - wrapped_parts_not_to_tokenize = {key: getattr(example, key) for key in non_empty_keys} - return [wrapped_parts_to_tokenize, wrapped_parts_not_to_tokenize] - else: - raise TypeError("InputExample") - - -class ManualTemplate(Template): - """ - ManualTemplate for hard prompt methods, such as PET, EFL. - """ - - def __init__(self, tokenizer, template=None, text_mapping={"text_a": "text_a", "text_b": "text_b"}): - super().__init__(tokenizer=tokenizer, text_mapping=text_mapping) - self.template = template - - def process_template(self): - self._template = self.parse_inputs(self._template) - - -class SoftTemplate(Template): - """ - SoftTemplate on the input layer for soft prompt methods, such as p-tuning. - """ - - registered_input_names = ["soft_token_ids", "mask_ids", "shortenable_ids"] - - def __init__(self, tokenizer, model, template=None, text_mapping={"text_a": "text_a", "text_b": "text_b"}): - super().__init__(tokenizer=tokenizer, text_mapping=text_mapping) - for module in model.children(): - if type(module).__name__.endswith("Model"): - self.token_embeddings = module.embeddings.word_embeddings - break - self.token_embeddings.weight.stop_gradient = True - self.embedding_size = self.token_embeddings.weight.shape[-1] - self.template = template - - def process_template(self): - self._template = self.parse_inputs(self._template) - self.process_soft_tokens() - self.generate_parameters() - - def incorporate_template_text(self, example, template=None): - """Replace each item in template with real text.""" - inputs = template.copy() if self.template is None else self.template.copy() - - for i, p in enumerate(inputs): - if "text" in p: - inputs[i] = p["add_prefix_space"] + getattr(example, p["text"]) - elif "mask" in p: - inputs[i] = self.tokenizer.mask_token - elif "hard" in p: - inputs[i] = p["add_prefix_space"] + p["hard"] - elif "soft" in p: - inputs[i] = p["add_prefix_space"] + p["soft"] - elif "sep" in p: - inputs[i] = self.tokenizer.sep_token - else: - raise ValueError("can not parse {}".format(p)) - - return inputs - - def process_soft_tokens(self): - inputs = [] - soft_token_ids = [] - num_soft_token = 0 - soft2word_init = {} - soft_id_reindex = {} - - for part in self.template: - if "soft" not in part and "soft_id" not in part: - soft_token_ids.append(0) - inputs.append(part) - continue - - if "soft" in part and part["soft"] is not None: - if "duplicate" in part: - logger.warnings("Ignore ``duplicate``. It is " "incompatible with ``soft`` with text values.") - - # Get word tokens and ids for soft token initialization. - init_token_ids = self.tokenizer( - part["add_prefix_space"] + part["soft"], add_special_tokens=False, return_token_type_ids=False - )["input_ids"] - init_tokens = self.tokenizer.convert_ids_to_tokens(init_token_ids) - assert len(init_tokens) == len(init_token_ids) - - # Create soft ids and corresponding ``soft`` part in template. - next_num_soft = num_soft_token + 1 - num_soft_token += len(init_tokens) - id_list = list(range(next_num_soft, num_soft_token + 1)) - - soft_token_ids.extend(id_list) - inputs.extend([{"add_prefix_space": part["add_prefix_space"], "soft": token} for token in init_tokens]) - for soft_id, word_id in zip(id_list, init_token_ids): - soft2word_init[soft_id] = word_id - - # Check the ids of ``soft`` and ``soft_id``. - if "soft_id" in part: - if part["soft_id"] in soft_id_reindex: - assert id_list == soft_id_reindex[part["soft_id"]] - else: - soft_id_reindex[part["soft_id"]] = id_list - continue - - if "soft_id" in part and part["soft_id"] in soft_id_reindex: - if "duplicate" in part: - logger.warnings("Ignore ``duplicate``. Initialize " "``soft`` by ``soft_id`` directly.") - id_list = soft_id_reindex[part["soft_id"]] - - elif "duplicate" in part: - assert isinstance(part["duplicate"], int) - if "same" in part: - num_soft_token += 1 - id_list = [num_soft_token for _ in range(part["duplicate"])] - else: - next_num_soft = num_soft_token + 1 - num_soft_token += part["duplicate"] - id_list = list(range(next_num_soft, num_soft_token + 1)) - else: - num_soft_token += 1 - id_list = [num_soft_token] - - if "soft_id" in part: - soft_id_reindex[part["soft_id"]] = id_list - - soft_token_ids.extend(id_list) - inputs.extend([{"add_prefix_space": part["add_prefix_space"], "soft": ""} for _ in range(len(id_list))]) - - self._template = inputs - self.soft_token_ids = soft_token_ids - self.num_soft_token = num_soft_token - self.soft2word_init = soft2word_init - - if self.num_soft_token == 0: - logger.warnings("No soft tokens in template. " "Use ManualTemplate for better performance.") - - def generate_parameters(self): - """ - Generate parameters for soft tokens. - """ - if self.num_soft_token == 0: - return None - self.soft_embeddings = nn.Embedding(self.num_soft_token + 1, self.embedding_size) - - weight = self.soft_embeddings.weight.clone().detach() - for soft_id, word_id in self.soft2word_init.items(): - weight[soft_id] = self.token_embeddings(paddle.to_tensor(word_id)) - self.soft_embeddings.weight.set_value(weight) - - def process_batch(self, batch): - word_embeds = self.token_embeddings(batch["input_ids"]) - batch["input_ids"] = None - if not hasattr(self, "soft_embeddings"): - batch["input_embeds"] = word_embeds - else: - soft_embeds = self.soft_embeddings(batch["soft_token_ids"]) - input_embeds = paddle.where((batch["soft_token_ids"] > 0).unsqueeze(-1), soft_embeds, word_embeds) - batch["input_embeds"] = input_embeds - return batch - - -class PTuningTemplate(SoftTemplate): - def __init__( - self, tokenizer, model, template, prompt_encoder="lstm", text_mapping={"text_a": "text_a", "text_b": "text_b"} - ): - super().__init__(tokenizer=tokenizer, model=model, text_mapping=text_mapping) - self.prompt_encoder = prompt_encoder - self.template = template - - def generate_parameters(self): - super().generate_parameters() - if self.prompt_encoder == "lstm": - self.lstm_head = nn.LSTM( - input_size=self.embedding_size, - hidden_size=self.embedding_size, - num_layers=2, - direction="bidirect", - time_major=False, - ) - self.mlp_head = nn.Sequential( - nn.Linear(2 * self.embedding_size, self.embedding_size), - nn.ReLU(), - nn.Linear(self.embedding_size, self.embedding_size), - ) - elif self.prompt_encoder == "mlp": - self.mlp_head = nn.Sequential( - nn.Linear(self.embedding_size, self.embedding_size), - nn.ReLU(), - nn.Linear(self.embedding_size, self.embedding_size), - ) - else: - raise ValueError("Unsupported soft token encoder: {}".format(self.prompt_encoder)) - - def process_batch(self, batch): - word_embeds = self.token_embeddings(batch["input_ids"]) - batch["input_ids"] = None - if not hasattr(self, "soft_embeddings"): - batch["input_embeds"] = word_embeds - else: - soft_embeds = self.soft_embeddings(batch["soft_token_ids"]) - if self.prompt_encoder == "lstm": - soft_embeds = self.lstm_head(soft_embeds)[0] - soft_embeds = self.mlp_head(soft_embeds) - - input_embeds = paddle.where((batch["soft_token_ids"] > 0).unsqueeze(-1), soft_embeds, word_embeds) - batch["input_embeds"] = input_embeds - return batch diff --git a/examples/few_shot/RGL/tokenizer.py b/examples/few_shot/RGL/tokenizer.py deleted file mode 100644 index 91f2fbd1fad6..000000000000 --- a/examples/few_shot/RGL/tokenizer.py +++ /dev/null @@ -1,261 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import itertools -import warnings -from collections import defaultdict -from functools import partial - -import numpy as np - - -class TokenizerWrapper: - """ - Process examples encoded by template, such as truncating and padding. - - Args: - max_seq_length (int): - The maximum length of input data (prompt and text). - tokenizer (paddlenlp.transformers.PreTrainedTokenizer): - The tokenizer of pretrained model. - truncate_method (str): - How to truncate input data. - Choices: ``tail``, ``head``, ``manual``. - create_token_type_ids (bool): - Whether to create token_type_ids for inputs. - seq_length_list (list, optional): - The list of maximum length for every part in input data. - """ - - def __init__(self, max_seq_length, tokenizer, truncate_method="tail", create_token_type_ids=False, **kwargs): - self.max_seq_length = max_seq_length - self.tokenizer = tokenizer - if truncate_method == "manual": - assert hasattr(kwargs, "seq_length_list"), "seq_length_list " "should be defined for manual truncation." - self.seq_length_list = kwargs["seq_length_list"] - self.truncate_fn = partial(self.truncate_from_end, etype="tail") - elif truncate_method == "tail" or truncate_method == "head": - self.truncate_fn = partial(self.truncate_from_end, etype=truncate_method) - else: - raise NotImplementedError - - self.create_token_type_ids = create_token_type_ids - - self.num_truncated_sentences = 0 - self.total_passed_sentences = 0 - - @property - def special_tokens_maps(self): - if not hasattr(self, "_special_tokens_map"): - self._special_tokens_map = { - "": getattr(self.tokenizer, "cls_token", ""), - "": getattr(self.tokenizer, "sep_token", ""), - "": getattr(self.tokenizer, "pad_token", ""), - "": getattr(self.tokenizer, "mask_token", ""), - "": getattr(self.tokenizer, "unk_token", ""), - } - return self._special_tokens_map - - @property - def truncate_rate(self): - if self.total_passed_sentences == 0: - return None - else: - return self.num_truncated_sentences / self.total_passed_sentences - - @staticmethod - def truncate_by_manual(input_dict, max_len_list=[]): - """ - Truncate input data by manually defined maximum sequence length. - - Args: - input_dict (dict): - The dictionary of an input example. - max_len_list (list): - The maximum length of every part in example. - ``-1`` denotes that there is no limit on length. - """ - truncated_dict = defaultdict(list) - shortenable_ids = input_dict["shortenable_ids"] - truncated_dict["shortenable_ids"] = shortenable_ids - for attr_name, attr_values in input_dict.items(): - text_idx = 0 - for i, value in enumerate(attr_values): - if shortenable_ids[i][0] == 0: - continue - if text_idx >= len(max_len_list): - break - if len(value) > 0: - max_len = max_len_list[text_idx] - if max_len < 0: - attr_values[i] = value - else: - attr_values[i] = value[:max_len] - text_idx += 1 - truncated_dict[attr_name] = attr_values - return truncated_dict - - @staticmethod - def truncate_from_end(input_dict, num_tokens_to_truncate=0, etype="tail"): - assert etype in ["head", "tail"] - step = 1 if etype == "head" else -1 - idx_offset = 0 if etype == "head" else 1 - truncated_dict = defaultdict(list) - shortenable_ids = input_dict["shortenable_ids"] - for attr_name in input_dict: - attr_values = input_dict[attr_name] - count = num_tokens_to_truncate - for i, value in enumerate(attr_values[::step]): - index = int(step * (idx_offset + i)) - if len(value) == 0 or shortenable_ids[index][0] == 0: - continue - if count < len(value): - attr_values[index] = value[:-count] - else: - attr_values[index] = [] - count -= len(value) - if count <= 0: - break - truncated_dict[attr_name] = attr_values - - return truncated_dict - - @staticmethod - def concate_parts(input_dict): - for key in input_dict: - input_dict[key] = list(itertools.chain(*input_dict[key])) - return input_dict - - @staticmethod - def padding(input_dict, max_len, pad_id_for_inputs=0, pad_id_for_others: int = 0) -> None: - for key, value in input_dict.items(): - if len(input_dict[key]) > max_len: - raise ValueError( - f"""Truncated seq length of '{key}' still greater than - max length {max_len}. One possible reason is that - no enough shortenable parts in template. Try adding - {{"shortenable": "True"}} property. - """ - ) - if "input" in key: - input_dict[key].extend([pad_id_for_inputs] * (max_len - len(value))) - else: - input_dict[key].extend([pad_id_for_others] * (max_len - len(value))) - return input_dict - - def truncate(self, inputs): - if hasattr(self, "seq_length_list"): - inputs = self.truncate_by_manual(inputs, self.seq_length_list) - total_tokens = sum([len(part) for part in inputs["input_ids"]]) - num_specials = self.num_special_tokens_to_add - num_tokens_to_truncate = total_tokens - self.max_seq_length + num_specials - self.total_passed_sentences += 1 - if num_tokens_to_truncate > 0: - self.num_truncated_sentences += 1 - inputs = self.truncate_fn(input_dict=inputs, num_tokens_to_truncate=num_tokens_to_truncate) - return inputs - - def add_special_tokens(self, encode_inputs): - for key in encode_inputs: - if key == "input_ids": - with warnings.catch_warnings(): - warnings.simplefilter("ignore") - encode_inputs[key] = self.tokenizer.build_inputs_with_special_tokens(encode_inputs[key]) - else: - special_tokens_mask = np.array(self.tokenizer.get_special_tokens_mask(encode_inputs[key])) - with_special_tokens = np.array(self.tokenizer.build_inputs_with_special_tokens(encode_inputs[key])) - with_special_tokens[special_tokens_mask == 1] = 0 - encode_inputs[key] = with_special_tokens.tolist() - return encode_inputs - - -class MLMTokenizerWrapper(TokenizerWrapper): - input_keys = ["input_ids", "attention_mask", "token_type_ids"] - - @property - def mask_token(self): - return self.tokenizer.mask_token - - @property - def mask_token_id(self): - return self.tokenizer.mask_token_id - - @property - def soft_token(self): - return self.tokenizer.unk_token - - @property - def soft_token_id(self): - return self.tokenizer.unk_token_id - - @property - def num_special_tokens_to_add(self): - if not hasattr(self, "_num_specials"): - self._num_specials = self.tokenizer.num_special_tokens_to_add() - return self._num_specials - - def get_token_type_ids(self, encoded_inputs): - token_type_ids = [0] * len(encoded_inputs["input_ids"]) - sep_token = getattr(self.tokenizer, "sep_token", -1) - if sep_token >= 0: - sep_index = np.where([x == sep_token for x in encoded_inputs["input_ids"]])[0] - for i, x in enumerate(sep_index[1:]): - pre_x = sep_index[i - 1] - sep_index[pre_x + 1 : x + 1] = [i + 1] * (x - pre_x) - return token_type_ids - - def tokenize_one_example(self, wrapped_example): - to_tokenize, not_to_tokenize = wrapped_example - - encode_inputs = defaultdict(list) - for part in to_tokenize: - if part["mask_ids"] == 1: - text = [self.mask_token_id] - - if part["text"] in self.special_tokens_maps.keys(): - to_replace = self.special_tokens_maps[part["text"]] - if to_replace is not None: - part["text"] = to_replace - else: - raise KeyError("This tokenizer doesn't specify {} token.".format(part["prompt"])) - - if "soft_token_ids" in part and part["soft_token_ids"] == 1: - text = [self.soft_token_id] - else: - text = self.tokenizer.encode(part["text"], add_special_tokens=False, return_token_type_ids=False)[ - "input_ids" - ] - - text_len = len(text) - encode_inputs["input_ids"].append(text) - for key in part: - if key not in ["text"]: - encode_inputs[key].append([part[key]] * text_len) - encode_inputs = self.truncate(inputs=encode_inputs) - encode_inputs.pop("shortenable_ids") - encode_inputs = self.concate_parts(encode_inputs) - encode_inputs = self.add_special_tokens(encode_inputs) - encode_inputs["attention_mask"] = [1] * len(encode_inputs["input_ids"]) - if self.create_token_type_ids: - encode_inputs["token_type_ids"] = self.get_token_type_ids(encode_inputs) - encode_inputs = self.padding( - encode_inputs, max_len=self.max_seq_length, pad_id_for_inputs=self.tokenizer.pad_token_id - ) - - return {**encode_inputs} - - -tokenizer_mapping = { - "roberta": MLMTokenizerWrapper, -} diff --git a/examples/few_shot/RGL/utils.py b/examples/few_shot/RGL/utils.py deleted file mode 100644 index f855145c444d..000000000000 --- a/examples/few_shot/RGL/utils.py +++ /dev/null @@ -1,81 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import random - -import numpy as np -import paddle -from data import InputFeatures -from paddle.io import DataLoader -from paddle.optimizer.lr import LambdaDecay - -from paddlenlp.datasets import MapDataset - - -def set_seed(seed): - """set random seed""" - random.seed(seed) - np.random.seed(seed) - paddle.seed(seed) - - -def check_args(args): - """check output_dir and make it when not exist""" - if os.path.exists(args.output_dir): - if os.listdir(args.output_dir) and not args.overwrite_output: - raise ValueError("Path Configuration: output_dir {} exists!".format(args.output_dir)) - if not os.path.exists(args.output_dir): - os.makedirs(args.output_dir) - - args.dataset = args.dataset.lower() - - -def convert_example(example, template, tokenizer_wrapper, verbalizer=None): - if verbalizer is not None and hasattr(verbalizer, "wrap_one_example"): - example = verbalizer.wrap_one_example(example) - example = template.wrap_one_example(example) - encoded_inputs = InputFeatures(**tokenizer_wrapper.tokenize_one_example(example), **example[1]) - return encoded_inputs - - -def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): - if isinstance(dataset, list): - dataset = MapDataset(dataset) - assert isinstance(dataset, MapDataset) - - if trans_fn: - dataset = dataset.map(trans_fn) - - shuffle = True if mode == "train" else False - if mode == "train": - batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - else: - batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - - return DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) - - -class LinearSchedulerWarmup(LambdaDecay): - """ - Linear scheduler with warm up. - """ - - def __init__(self, learning_rate, warmup_steps, max_steps, last_epoch=-1, verbose=False): - def lr_lambda(current_step): - if current_step < warmup_steps: - return float(current_step) / float(max(1, warmup_steps)) - return max(0.0, float(max_steps - current_step) / float(max(1, max_steps - warmup_steps))) - - super().__init__(learning_rate, lr_lambda, last_epoch, verbose) diff --git a/examples/few_shot/RGL/verbalizer.py b/examples/few_shot/RGL/verbalizer.py deleted file mode 100644 index 0e741235dcc0..000000000000 --- a/examples/few_shot/RGL/verbalizer.py +++ /dev/null @@ -1,188 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from abc import abstractmethod -from typing import Dict, List, Union - -import numpy as np -import paddle -import paddle.nn as nn -import paddle.nn.functional as F -from data import InputExample - -from paddlenlp.transformers import PretrainedTokenizer - - -class Verbalizer(nn.Layer): - """ - Base verbalizer class used to process the outputs and labels. - - Args: - tokenizer (paddlenlp.transformers.PretrainedTokenizer): - The tokenizer of pretrained models. - labels (list): - The sequence of labels in task. - - """ - - def __init__(self, tokenizer: PretrainedTokenizer = None, labels: List = None): - super().__init__() - assert labels is not None, "Label list for current task is not set yet." - self.tokenizer = tokenizer - self.labels = sorted(labels) - self._process_lock = False - - @property - def vocab(self): - if not hasattr(self, "_vocab"): - self._vocab = self.tokenizer.convert_ids_to_tokens(np.arange(self.vocab_size).tolist()) - return self._vocab - - @property - def vocab_size(self): - return self.tokenizer.vocab_size - - @property - def label_to_words(self): - if not hasattr(self, "_label_to_words"): - raise RuntimeError("Property label_to_words has not been set before used.") - return self._label_to_words - - @label_to_words.setter - def label_to_words(self, label_to_words: Union[List, Dict]): - if label_to_words is None: - return - if isinstance(label_to_words, dict): - new_keys = sorted(list(label_to_words.keys())) - assert new_keys == self.labels, "label_to_words {} does not match the predefined labels {}.".format( - new_keys, self.labels - ) - self._label_to_words = {k: label_to_words[k] for k in self.labels} - elif isinstance(label_to_words, list): - assert len(self.labels) == len( - label_to_words - ), "The lengths of label_to_words and predefined labels do not match." - self._label_to_words = {k: v for k, v in zip(self.labels, label_to_words)} - else: - raise TypeError("Unsupported type {} for label_to_words".format(type(label_to_words))) - self.process_label_words() - - @property - def labels_to_ids(self): - if not hasattr(self, "labels"): - raise RuntimeError("Property labels_to_ids has not been set before used.") - return {k: i for i, k in enumerate(self.labels)} - - @property - def ids_to_labels(self): - if not hasattr(self, "labels"): - raise RuntimeError("Property ids_to_labels has not been set before used.") - return {i: k for i, k in enumerate(self.labels)} - - @abstractmethod - def process_label_words( - self, - ): - """A hook to process verbalizer when it is set.""" - raise NotImplementedError - - @abstractmethod - def project(self, logits, **kwargs): - """ - Project the logits with shape ```[batch_size, vocab_size]``` into - label_word_logits with shape ```[batch_size, num_label_words]```. - """ - raise NotImplementedError - - @staticmethod - def aggregate(label_words_logits, atype="mean", ndim=2): - """ - Aggregate embeddings when multiple words are mapped to one label. - - Args: - label_words_logits (paddle.Tensor): - The logits of words which could be mapped to labels. - atype (str): - The aggregation strategy, including mean and first. - ndim (str): - The aggregated embeddings' number of dimensions. - - """ - if label_words_logits.ndim > ndim: - if atype == "mean": - return label_words_logits.mean(axis=-1) - elif atype == "max": - return label_words_logits.max(axis=-1) - elif atype == "first": - return label_words_logits[..., 0, :] - else: - raise ValueError("Unsupported aggreate type {}".format(atype)) - return label_words_logits - - def normalize(self, logits): - """Normalize the logits of every example.""" - new_logits = F.softmax(logits.reshape(logits.shape[0], -1), axis=-1) - return new_logits.reshape(*logits.shape) - - -class ManualVerbalizer(Verbalizer): - """ - Manual Verbalizer to map labels to words for hard prompt methods. - - Args: - tokenizer (paddlenlp.transformers.PretrainedTokenizer): - The tokenizer of pretrained models. - labels (list): - The sequence of all labels. - label_to_words (dict or list): - The dictionary or corresponding list to map labels to words. - prefix (str): - The prefix string of words, used in PLMs like RoBERTa, which is sensitive to the prefix. - """ - - def __init__(self, tokenizer, labels=None, label_to_words=None, prefix=""): - super().__init__(tokenizer=tokenizer, labels=labels) - self.tokenizer = tokenizer - self.labels = labels - self.prefix = prefix - self.label_to_words = label_to_words - - def process_label_words(self): - word_ids = [] - for label in self.labels: - word_ids.append( - self.tokenizer.encode( - self.prefix + self._label_to_words[label], add_special_tokens=False, return_token_type_ids=False - )["input_ids"] - ) - self.word_ids = paddle.to_tensor(word_ids, dtype="int64").squeeze() - self.label_to_words_ids = {k: v for k, v in zip(self.labels, word_ids)} - - def process_logits(self, logits, mask_ids=None, **kwargs): - if mask_ids is not None: - logits = logits[mask_ids == 1] - label_words_logits = logits.index_select(index=self.word_ids, axis=-1) - return label_words_logits - - def wrap_one_example(self, example): - """Process labels in InputExample According to the predefined verbalizer.""" - if isinstance(example, InputExample): - try: - example.label = self.labels_to_ids[example.cls_label] - except KeyError: - # Regression tasks. - example.label = eval(example.cls_label) - return example - else: - raise TypeError("InputExample") diff --git a/examples/few_shot/efl/README.md b/examples/few_shot/efl/README.md deleted file mode 100644 index f8656b690172..000000000000 --- a/examples/few_shot/efl/README.md +++ /dev/null @@ -1,85 +0,0 @@ -# EFL - - -[Entailment as Few-Shot Learner](https://arxiv.org/abs/2104.14690) - - -## 算法简介 - -Entailment as Few-Shot Learner(EFL)提出将 NLP Fine-tune 任务转换统一转换为 Entailment 二分类任务,为小样本场景下的任务求解提供了新的视角。EFL 的主要思想如下图所示,该算法也可以使用 `Template` 实现标签描述与数据文本的拼接,定义方式详见[Prompt API 文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md)。 - -![EFL](https://user-images.githubusercontent.com/25607475/204245126-bd94e87c-f25f-471e-af1c-d1e05f7a2897.png) - -## 快速开始 - -CLUE(Chinese Language Understanding Evaluation)作为中文语言理解权威测评榜单,在学术界和工业界都有着广泛影响。FewCLUE 是其设立的中文小样本学习测评子榜,旨在探索小样本学习最佳模型和中文实践。PaddleNLP 内置了 FewCLUE 数据集,可以直接用来进行 EFL 算法训练、评估、预测,并生成 FewCLUE 榜单的提交结果,参与 FewCLUE 竞赛。 - -### 代码结构说明 -``` -├── run_train.py # EFL 算法提示学习脚本 -├── data.py # 数据集构造、数据增强 -├── utils.py # FewCLUE 提交结果保存等工具函数 -└── prompt/ # FewCLUE 各数据集的 prompt 定义文件 -``` - -### 数据准备 - -读取 FewCLUE 数据集只需要 1 行代码,这部分代码在 `data.py` 脚本中。以情感分类数据集 `eprstmt` 为例: - -``` -from paddlenlp.datasets import load_dataset - -# 通过指定 "fewclue" 和数据集名字 name="eprstmt" 即可一键加载 FewCLUE 中的 eprstmt 数据集 -train_ds, dev_ds, public_test_ds = load_dataset("fewclue", name="eprstmt", splits=("train_0", "dev_0", "test_public")) -``` - -### 模型训练、评估、预测 - -通过如下命令,指定 GPU 0 卡, 在 FewCLUE 的 `eprstmt` 数据集上进行训练&评估 -``` -python -u -m paddle.distributed.launch --gpus "0" run_train.py \ - --output_dir checkpoint_eprstmt \ - --task_name eprstmt \ - --split_id few_all \ - --prompt_path prompt/eprstmt.json \ - --prompt_index 0 \ - --do_train \ - --do_eval \ - --do_test \ - --do_predict \ - --do_label \ - --max_steps 1000 \ - --learning_rate 3e-5 \ - --eval_steps 100 \ - --save_steps 100 \ - --logging_steps 5 \ - --per_device_train_batch_size 16 \ - --max_seq_length 128 \ - --load_best_model_at_end \ - --metric_for_best_model accuracy \ - --save_total_limit 1 -``` -参数含义说明 -- `task_name`: FewCLUE 中的数据集名字 -- `split_id`: 数据集编号,包括0, 1, 2, 3, 4 和 few_all -- `prompt_path`: prompt 定义文件名 -- `prompt_index`: 使用定义文件中第 `prompt_index` 个 prompt -- `augment_type`: 数据增强策略,可选 swap, delete, insert, substitute -- `num_augment`: 数据增强策略为每个样本生成的样本数量 -- `word_augment_percent`: 每个序列中数据增强词所占的比例 -- `pseudo_data_path`: 使用模型标注的伪标签数据文件路径 -- `do_label`: 是否使用训练后的模型给无标签数据标注伪标签 -- `do_test`: 是否在公开测试集上评估模型效果 -- `model_name_or_path`: 预训练模型名,默认为 `ernie-1.0-large-zh-cw` -- `use_rdrop`: 是否使用对比学习策略 R-Drop -- `alpha_rdrop`: R-Drop 损失值权重,默认为 0.5 -- `dropout`: 预训练模型的 dropout 参数值,用于 R-Drop 策略中参数配置 -- `export_type`: 模型导出格式,默认为 `paddle`,动态图转静态图 -- 更多配置参考 [Trainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md#trainingarguments-%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D) 和 [PromptTrainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md#prompttrainer%E5%8F%82%E6%95%B0%E5%88%97%E8%A1%A8) - -### 模型部署 - -Coming soon... - -## References -[1] Wang, Sinong, Han Fang, Madian Khabsa, Hanzi Mao, and Hao Ma. “Entailment as Few-Shot Learner.” ArXiv:2104.14690 [Cs], April 29, 2021. http://arxiv.org/abs/2104.14690. diff --git a/examples/few_shot/efl/data.py b/examples/few_shot/efl/data.py deleted file mode 100644 index b33ea7927166..000000000000 --- a/examples/few_shot/efl/data.py +++ /dev/null @@ -1,134 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import json - -import numpy as np - -from paddlenlp.datasets import MapDataset, load_dataset - - -def extend_with_pseudo_data(data_ds, pseudo_path, labels_to_ids): - """ - Extend train dataset with pseudo labeled examples if exists. - """ - if pseudo_path is None: - return data_ds - with open(pseudo_path, "r", encoding="utf-8") as fp: - pseudo_data = [json.loads(x.strip()) for x in fp] - data_ds = MapDataset([x for x in data_ds] + pseudo_data) - return data_ds - - -def convert_efl(data_ds, label_words, orig_key, is_train=False, num_neg=5): - efl_data_ds = [] - label_list = sorted(label_words.keys()) - for example in data_ds: - label = label_words[example[orig_key]] if orig_key in example else None - sub_list = label_list - if is_train and label is not None and len(label_list) > num_neg: - rand_index = np.random.permutation(len(label_list)) - sub_list = [example[orig_key]] + [label_list[i] for i in rand_index[:num_neg]] - for key in sub_list: - new_example = example.copy() - cand = label_words[key] - new_example["candidate_label"] = cand - if label is not None: - new_example["labels"] = int(cand == label) - efl_data_ds.append(new_example) - return MapDataset(efl_data_ds) - - -def convert_chid(data_ds): - """ - Insert idioms into positions of `#idiom#` so that the task is converted - to binary classification. - """ - split_data_ds = [] - for example in data_ds: - fragments = example["content"].split("#idiom#") - label = example.get("answer", None) - for index, cand in enumerate(example["candidates"]): - new_example = {"content_pre": fragments[0], "content_post": fragments[1], "idiom": cand} - if label is not None: - new_example["label"] = str(int(index == label)) - split_data_ds.append(new_example) - return MapDataset(split_data_ds) - - -def convert_cluewsc(data_ds): - """ - Mark the pronoun and entity with special tokens. - """ - marked_data_ds = [] - for example in data_ds: - target, text = example["target"], list(example["text"]) - pronoun, p_index = target["span2_text"], target["span2_index"] - entity, e_index = target["span1_text"], target["span1_index"] - label = example.get("label", None) - if p_index > e_index: - text.insert(p_index, "_") - text.insert(p_index + len(pronoun) + 1, "_") - text.insert(e_index, "[") - text.insert(e_index + len(entity) + 1, "]") - else: - text.insert(e_index, "[") - text.insert(e_index + len(entity) + 1, "]") - text.insert(p_index, "_") - text.insert(p_index + len(pronoun) + 1, "_") - new_example = {"text": "".join(text), "pronoun": pronoun, "entity": entity} - if label is not None: - new_example["label"] = label - marked_data_ds.append(new_example) - return MapDataset(marked_data_ds) - - -def load_fewclue_dataset(args, verbalizer): - """ - Load fewclue datasets and convert them to the standard format of PET. - """ - split_id = args.split_id - splits = [f"train_{split_id}", f"dev_{split_id}", "test_public", "test"] - if args.task_name == "cluewsc": - train_ds, dev_ds, public_test_ds, test_ds = load_dataset("fewclue", name=args.task_name, splits=splits) - unlabeled_ds = None - else: - splits.append("unlabeled") - train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_dataset( - "fewclue", name=args.task_name, splits=splits - ) - data_ds = [train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds] - - # Preprocess data for EFL. - if args.task_name == "chid": - for index, sub_data_ds in enumerate(data_ds): - data_ds[index] = convert_chid(sub_data_ds) - elif args.task_name == "cluewsc": - for index, sub_data_ds in enumerate(data_ds[:-1]): - data_ds[index] = convert_cluewsc(sub_data_ds) - - orig_key = "label" - if args.task_name == "tnews": - orig_key = "label_desc" - elif args.task_name == "iflytek": - orig_key = "label_des" - for index, sub_data_ds in enumerate(data_ds): - is_train = index == 0 - if sub_data_ds is not None: - data_ds[index] = convert_efl(sub_data_ds, args.label_words, orig_key, is_train) - - # Extend train dataset with pseudo-label data. - data_ds[0] = extend_with_pseudo_data(data_ds[0], args.pseudo_data_path, verbalizer.labels_to_ids) - - return data_ds diff --git a/examples/few_shot/efl/prompt/bustm.json b/examples/few_shot/efl/prompt/bustm.json deleted file mode 100644 index b44363510642..000000000000 --- a/examples/few_shot/efl/prompt/bustm.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "下边两个句子说的是{'text': 'candidate_label'}的事情。“{'text': 'sentence1'}”和“{'text': 'sentence2'}”"} - ], - "verbalizer": [ - {"0": "不同", "1": "相关"} - ] -} diff --git a/examples/few_shot/efl/prompt/chid.json b/examples/few_shot/efl/prompt/chid.json deleted file mode 100644 index b3c1e648e29a..000000000000 --- a/examples/few_shot/efl/prompt/chid.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "成语[{'text':'idiom'}]使用{'text': 'candidate_label'}的例子:{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}"} - ], - "verbalizer": [ - {"0": "错误", "1": "正确"} - ] -} diff --git a/examples/few_shot/efl/prompt/cluewsc.json b/examples/few_shot/efl/prompt/cluewsc.json deleted file mode 100644 index 1e736c43332d..000000000000 --- a/examples/few_shot/efl/prompt/cluewsc.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "{'text': 'text'}{'text': 'pronoun'}指的{'text': 'candidate_label'}{'text': 'entity'}"} - ], - "verbalizer": [ - {"false": "不是", "true": "是"} - ] -} diff --git a/examples/few_shot/efl/prompt/csl.json b/examples/few_shot/efl/prompt/csl.json deleted file mode 100644 index 6d19eee927f8..000000000000 --- a/examples/few_shot/efl/prompt/csl.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "给定以下几个词语:{'options': 'keyword', 'add_prompt': '[OPT],'}{'text': 'candidate_label'}扩写成“{'text': 'abst'}”"} - ], - "verbalizer": [ - {"0": "不能", "1": "可以"} - ] -} diff --git a/examples/few_shot/efl/prompt/csldcp.json b/examples/few_shot/efl/prompt/csldcp.json deleted file mode 100644 index 2dd84e36ca21..000000000000 --- a/examples/few_shot/efl/prompt/csldcp.json +++ /dev/null @@ -1,76 +0,0 @@ -{ - "template": [ - {"text": "这篇论文阐述了{'text': 'candidate_label'}。{'text': 'content'}"} - ], - "verbalizer": [ - [ - "材料科学与工程", - "作物学", - "口腔医学", - "药学", - "教育学", - "水利工程", - "理论经济学", - "食品科学与工程", - "畜牧学/兽医学", - "体育学", - "核科学与技术", - "力学", - "园艺学", - "水产", - "法学", - "地质学/地质资源与地质工程", - "石油与天然气工程", - "农林经济管理", - "信息与通信工程", - "图书馆、情报与档案管理", - "政治学", - "电气工程", - "海洋科学", - "民族学", - "航空宇航科学与技术", - "化学/化学工程与技术", - "哲学", - "公共卫生与预防医学", - "艺术学", - "农业工程", - "船舶与海洋工程", - "计算机科学与技术", - "冶金工程", - "交通运输工程", - "动力工程及工程热物理", - "纺织科学与工程", - "建筑学", - "环境科学与工程", - "公共管理", - "数学", - "物理学", - "林学/林业工程", - "心理学", - "历史学", - "工商管理", - "应用经济学", - "中医学/中药学", - "天文学", - "机械工程", - "土木工程", - "光学工程", - "地理学", - "农业资源利用", - "生物学/生物科学与工程", - "兵器科学与技术", - "矿业工程", - "大气科学", - "基础医学/临床医学", - "电子科学与技术", - "测绘科学与技术", - "控制科学与工程", - "军事学", - "中国语言文学", - "新闻传播学", - "社会学", - "地球物理学", - "植物保护" - ] - ] -} diff --git a/examples/few_shot/efl/prompt/eprstmt.json b/examples/few_shot/efl/prompt/eprstmt.json deleted file mode 100644 index 309146c5e7e5..000000000000 --- a/examples/few_shot/efl/prompt/eprstmt.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "这表达了{'text': 'candidate_label'}的情感。{'text':'sentence'}"} - ], - "verbalizer": [ - {"Negative": "不满意", "Positive": "满意"} - ] -} diff --git a/examples/few_shot/efl/prompt/iflytek.json b/examples/few_shot/efl/prompt/iflytek.json deleted file mode 100644 index 5199508e6f03..000000000000 --- a/examples/few_shot/efl/prompt/iflytek.json +++ /dev/null @@ -1,129 +0,0 @@ -{ - "template": [ - {"text": "这段文本的应用描述主题是{'text': 'candidate_label'}。{'text': 'sentence'}"} - ], - "verbalizer": [ - [ - "银行", - "社区服务", - "电商", - "支付", - "经营养成", - "卡牌", - "借贷", - "驾校", - "理财", - "职考", - "新闻", - "旅游资讯", - "公共交通", - "魔幻", - "医疗服务", - "影像剪辑", - "动作类", - "工具", - "体育竞技", - "小说", - "运动健身", - "相机", - "辅助工具", - "快递物流", - "高等教育", - "股票", - "菜谱", - "行车辅助", - "仙侠", - "亲子儿童", - "购物咨询", - "射击游戏", - "漫画", - "中小学", - "同城服务", - "成人教育", - "求职", - "电子产品", - "艺术", - "薅羊毛", - "约会社交", - "经营", - "兼职", - "短视频", - "音乐", - "英语", - "棋牌中心", - "摄影修图", - "养生保健", - "办公", - "政务", - "视频", - "论坛圈子", - "彩票", - "直播", - "其他", - "休闲益智", - "策略", - "即时通讯", - "汽车交易", - "违章", - "地图导航", - "民航", - "电台", - "语言(非英语)", - "搞笑", - "婚恋社交", - "社区超市", - "日常养车", - "杂志", - "视频教育", - "家政", - "影视娱乐", - "装修家居", - "体育咨讯", - "社交工具", - "餐饮店", - "美颜", - "问诊挂号", - "飞行空战", - "综合预定", - "电影票务", - "笔记", - "买房", - "外卖", - "母婴", - "打车", - "情侣社交", - "日程管理", - "租车", - "微博博客", - "百科", - "绘画", - "铁路", - "生活社交", - "租房", - "酒店", - "保险", - "问答交流", - "收款", - "MOBA", - "K歌", - "技术", - "减肥瘦身", - "工作社交", - "团购", - "记账", - "女性", - "公务员", - "二手", - "美妆美业", - "汽车咨询", - "行程管理", - "免费WIFI", - "教辅", - "成人", - "婚庆", - "民宿短租", - "出国" - ] - ] -} - diff --git a/examples/few_shot/efl/prompt/ocnli.json b/examples/few_shot/efl/prompt/ocnli.json deleted file mode 100644 index caa7fd2c5719..000000000000 --- a/examples/few_shot/efl/prompt/ocnli.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”之间{'text': 'candidate_label'}。"} - ], - "verbalizer": [ - {"contradiction": "互相矛盾", "entailment": "相互包含", "neutral": "没有关系"} - ] -} diff --git a/examples/few_shot/efl/prompt/tnews.json b/examples/few_shot/efl/prompt/tnews.json deleted file mode 100644 index 4580cd766208..000000000000 --- a/examples/few_shot/efl/prompt/tnews.json +++ /dev/null @@ -1,24 +0,0 @@ -{ - "template": [ - {"text": "下边报道一条{'text': 'candidate_label'}新闻{'text':'sentence'}"} - ], - "verbalizer": [ - { - "news_story": "故事", - "news_entertainment": "明星", - "news_finance": "财经", - "news_sports": "体育", - "news_edu": "校园", - "news_game": "游戏", - "news_culture": "文化", - "news_tech": "科技", - "news_car": "汽车", - "news_travel": "旅行", - "news_world": "国际", - "news_agriculture": "农业", - "news_military": "军事", - "news_house": "房产", - "news_stock": "股票" - } - ] -} diff --git a/examples/few_shot/efl/run_train.py b/examples/few_shot/efl/run_train.py deleted file mode 100644 index 8dd47043d762..000000000000 --- a/examples/few_shot/efl/run_train.py +++ /dev/null @@ -1,164 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import time -from dataclasses import dataclass, field -from functools import partial - -import paddle -from data import load_fewclue_dataset -from paddle.metric import Accuracy -from paddle.static import InputSpec -from utils import load_prompt_arguments, save_fewclue_prediction, save_pseudo_data - -from paddlenlp.prompt import ( - ManualTemplate, - ManualVerbalizer, - PromptModelForSequenceClassification, - PromptTrainer, - PromptTuningArguments, -) -from paddlenlp.trainer import PdArgumentParser -from paddlenlp.transformers import AutoModelForSequenceClassification, AutoTokenizer -from paddlenlp.utils.log import logger - - -# yapf: disable -@dataclass -class DataArguments: - task_name: str = field(default="eprstmt", metadata={"help": "The task name in FewCLUE."}) - split_id: str = field(default="0", metadata={"help": "The split id of datasets, including 0, 1, 2, 3, 4, few_all."}) - prompt_path: str = field(default="prompt/eprstmt.json", metadata={"help": "Path to the defined prompts."}) - prompt_index: int = field(default=0, metadata={"help": "The index of defined prompt for training."}) - pseudo_data_path: str = field(default=None, metadata={"help": "Path to data with pseudo labels."}) - do_label: bool = field(default=False, metadata={"help": "Whether to label unsupervised data in unlabeled datasets"}) - do_test: bool = field(default=False, metadata={"help": "Whether to evaluate model on public test datasets."}) - - -@dataclass -class ModelArguments: - model_name_or_path: str = field(default="ernie-1.0-large-zh-cw", metadata={"help": "Build-in pretrained model name or the path to local model."}) - export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."}) - dropout: float = field(default=0.1, metadata={"help": "The dropout used for pretrained model."}) -# yapf: enable - - -def main(): - # Parse the arguments. - parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments)) - model_args, data_args, training_args = parser.parse_args_into_dataclasses() - data_args = load_prompt_arguments(data_args) - training_args.print_config(model_args, "Model") - training_args.print_config(data_args, "Data") - paddle.set_device(training_args.device) - - # Load the pretrained language model. - tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) - model = AutoModelForSequenceClassification.from_pretrained( - model_args.model_name_or_path, - num_labels=2, - hidden_dropout_prob=model_args.dropout, - attention_probs_dropout_prob=model_args.dropout, - ) - - # Define template for preprocess and verbalizer for postprocess. - template = ManualTemplate(data_args.prompt, tokenizer, training_args.max_seq_length) - logger.info("Using template: {}".format(template.prompt)) - - verbalizer = ManualVerbalizer(data_args.label_words, tokenizer) - ids_to_labels = {idx: label for idx, label in enumerate(verbalizer.labels)} - logger.info("Using verbalizer: {}".format(data_args.label_words)) - - # Load datasets. - train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_fewclue_dataset(data_args, verbalizer=verbalizer) - - # Define the criterion. - criterion = paddle.nn.CrossEntropyLoss() - - # Initialize the prompt model with the above variables. - prompt_model = PromptModelForSequenceClassification( - model, template, None, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout - ) - - # Define the metric function. - def compute_metrics(eval_preds, num_labels): - metric = Accuracy() - preds = paddle.to_tensor(eval_preds.predictions) - preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1] - preds = preds.reshape([-1, num_labels]) - labels = paddle.to_tensor(eval_preds.label_ids) - labels = paddle.argmax(labels.reshape([-1, num_labels]), axis=1) - correct = metric.compute(preds, labels) - metric.update(correct) - acc = metric.accumulate() - return {"accuracy": acc} - - # Initialize the trainer. - compute_metrics = partial(compute_metrics, num_labels=len(verbalizer.labels)) - trainer = PromptTrainer( - model=prompt_model, - tokenizer=tokenizer, - args=training_args, - criterion=criterion, - train_dataset=train_ds, - eval_dataset=dev_ds, - callbacks=None, - compute_metrics=compute_metrics, - ) - - # Traininig. - if training_args.do_train: - train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) - metrics = train_result.metrics - trainer.save_model() - trainer.log_metrics("train", metrics) - trainer.save_metrics("train", metrics) - trainer.save_state() - - time_stamp = time.strftime("%m%d-%H-%M-%S", time.localtime()) - - # Test. - if data_args.do_test and public_test_ds is not None: - test_ret = trainer.predict(public_test_ds) - trainer.log_metrics("test", test_ret.metrics) - - # Predict. - if training_args.do_predict and test_ds is not None: - pred_ret = trainer.predict(test_ds) - logger.info("Prediction done.") - predict_path = os.path.join(training_args.output_dir, "fewclue_submit_examples_" + time_stamp) - save_fewclue_prediction(predict_path, data_args.task_name, pred_ret, verbalizer, ids_to_labels) - - # Label unsupervised data. - if data_args.do_label and unlabeled_ds is not None: - label_ret = trainer.predict(unlabeled_ds) - logger.info("Labeling done.") - pseudo_path = os.path.join(training_args.output_dir, "pseudo_data_" + time_stamp + ".txt") - save_pseudo_data(pseudo_path, data_args.task_name, label_ret, verbalizer, ids_to_labels) - - # Export static model. - if training_args.do_export: - input_spec = [ - InputSpec(shape=[None, None], dtype="int64"), # input_ids, - InputSpec(shape=[None, None], dtype="int64"), # token_type_ids - InputSpec(shape=[None, None], dtype="int64"), # position_ids - InputSpec(shape=[None, None, None, None], dtype="float32"), # attention_mask - ] - export_path = os.path.join(training_args.output_dir, "export") - trainer.export_model(export_path, input_spec=input_spec, export_type=model_args.export_type) - - -if __name__ == "__main__": - main() diff --git a/examples/few_shot/efl/utils.py b/examples/few_shot/efl/utils.py deleted file mode 100644 index dfc6463bb69d..000000000000 --- a/examples/few_shot/efl/utils.py +++ /dev/null @@ -1,252 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import json -import os -import pathlib - -import numpy as np -import paddle - -from paddlenlp.datasets import load_dataset - -LABEL_TO_STANDARD = { - "tnews": { - "news_story": "100", - "news_culture": "101", - "news_entertainment": "102", - "news_sports": "103", - "news_finance": "104", - "news_house": "106", - "news_car": "107", - "news_edu": "108", - "news_tech": "109", - "news_military": "110", - "news_travel": "112", - "news_world": "113", - "news_stock": "114", - "news_agriculture": "115", - "news_game": "116", - }, - "iflytek": { - "打车": 0, - "美颜": 100, - "影像剪辑": 101, - "摄影修图": 102, - "相机": 103, - "绘画": 104, - "二手": 105, - "电商": 106, - "团购": 107, - "外卖": 108, - "电影票务": 109, - "社区服务": 10, - "社区超市": 110, - "购物咨询": 111, - "笔记": 112, - "办公": 113, - "日程管理": 114, - "女性": 115, - "经营": 116, - "收款": 117, - "其他": 118, - "薅羊毛": 11, - "魔幻": 12, - "仙侠": 13, - "卡牌": 14, - "飞行空战": 15, - "射击游戏": 16, - "休闲益智": 17, - "动作类": 18, - "体育竞技": 19, - "地图导航": 1, - "棋牌中心": 20, - "经营养成": 21, - "策略": 22, - "MOBA": 23, - "辅助工具": 24, - "约会社交": 25, - "即时通讯": 26, - "工作社交": 27, - "论坛圈子": 28, - "婚恋社交": 29, - "免费WIFI": 2, - "情侣社交": 30, - "社交工具": 31, - "生活社交": 32, - "微博博客": 33, - "新闻": 34, - "漫画": 35, - "小说": 36, - "技术": 37, - "教辅": 38, - "问答交流": 39, - "租车": 3, - "搞笑": 40, - "杂志": 41, - "百科": 42, - "影视娱乐": 43, - "求职": 44, - "兼职": 45, - "视频": 46, - "短视频": 47, - "音乐": 48, - "直播": 49, - "同城服务": 4, - "电台": 50, - "K歌": 51, - "成人": 52, - "中小学": 53, - "职考": 54, - "公务员": 55, - "英语": 56, - "视频教育": 57, - "高等教育": 58, - "成人教育": 59, - "快递物流": 5, - "艺术": 60, - "语言(非英语)": 61, - "旅游资讯": 62, - "综合预定": 63, - "民航": 64, - "铁路": 65, - "酒店": 66, - "行程管理": 67, - "民宿短租": 68, - "出国": 69, - "婚庆": 6, - "工具": 70, - "亲子儿童": 71, - "母婴": 72, - "驾校": 73, - "违章": 74, - "汽车咨询": 75, - "汽车交易": 76, - "日常养车": 77, - "行车辅助": 78, - "租房": 79, - "家政": 7, - "买房": 80, - "装修家居": 81, - "电子产品": 82, - "问诊挂号": 83, - "养生保健": 84, - "医疗服务": 85, - "减肥瘦身": 86, - "美妆美业": 87, - "菜谱": 88, - "餐饮店": 89, - "公共交通": 8, - "体育咨讯": 90, - "运动健身": 91, - "支付": 92, - "保险": 93, - "股票": 94, - "借贷": 95, - "理财": 96, - "彩票": 97, - "记账": 98, - "银行": 99, - "政务": 9, - }, -} - - -def load_prompt_arguments(args): - """ - Load prompt and label words according to prompt index. - """ - with open(args.prompt_path, "r", encoding="utf-8") as fp: - configs = json.load(fp) - assert len(configs["verbalizer"]) == len(configs["template"]) - assert configs["verbalizer"][0] is not None - verbalizer = [configs["verbalizer"][0]] - last_verb_index = 0 - for index, verb in enumerate(configs["verbalizer"][1:]): - if verb is None or len(verb) == 0: - verbalizer.append(configs["verbalizer"][last_verb_index]) - else: - verbalizer.append(verb) - last_verb_index = index + 1 - configs["verbalizer"] = verbalizer - args.prompt = configs["template"][args.prompt_index]["text"] - label_words = configs["verbalizer"][args.prompt_index] - if isinstance(label_words, list): - label_words = {k: k for k in label_words} - args.label_words = label_words - return args - - -def save_pseudo_data(save_path, task_name, label_preds, verbalizer, labels): - """ - Combine unsupervised data and corresponding predicted labels and - save one example per line. - """ - if task_name == "cluewsc": - return None - - num_labels = len(labels) - data_ds = load_dataset("fewclue", name=task_name, splits="unlabeled") - preds = paddle.to_tensor(label_preds.predictions) - preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1].numpy() - preds = preds.reshape([-1, num_labels]) - label_preds = np.argmax(preds, axis=1) - label_probs = np.max(preds, axis=1) - pseudo_data = [] - for index, example in enumerate(data_ds): - example["labels"] = labels[label_preds[index]] - example["prob"] = str(label_probs[index]) - pseudo_data.append(example) - save_data(pseudo_data, save_path) - - -def save_fewclue_prediction(save_path, task_name, label_preds, verbalizer, labels): - """ - Extract predicted labels and save as the format required by FewCLUE. - """ - num_labels = len(labels) - preds = paddle.to_tensor(label_preds.predictions) - preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1] - preds = preds.reshape([-1, num_labels]) - if task_name == "chid": - batch_size = preds.shape[0] - preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1] - preds = preds.reshape([batch_size // 7, 7]) - preds = paddle.nn.functional.softmax(preds, axis=1).numpy() - preds = np.argmax(preds, axis=1) - test_ds = load_dataset("fewclue", name=task_name, splits="test") - - ret_list = [] - maps = LABEL_TO_STANDARD.get(task_name, None) - for idx, example in enumerate(test_ds): - uid = example.get("id", idx) - if task_name in ["bustm", "csl"]: - ret_list.append({"id": uid, "label": str(preds[idx])}) - elif task_name == "chid": - ret_list.append({"id": uid, "answer": preds[idx]}) - elif task_name in ["cluewsc", "eprstmt", "ocnli", "csldcp"]: - ret_list.append({"id": uid, "label": labels[preds[idx]]}) - elif task_name in ["iflytek", "tnews"]: - ret_list.append({"id": uid, "label": str(maps[labels[preds[idx]]])}) - save_file = task_name if task_name in ["bustm", "csldcp", "eprstmt"] else task_name + "f" - save_data(ret_list, save_path, save_file + "_predict.json") - - -def save_data(data, save_path, save_file=None): - if save_file is not None: - pathlib.Path(save_path).mkdir(parents=True, exist_ok=True) - save_path = os.path.join(save_path, save_file) - with open(save_path, "w") as fp: - for example in data: - fp.write(json.dumps(example, ensure_ascii=False) + "\n") diff --git a/examples/few_shot/p-tuning/README.md b/examples/few_shot/p-tuning/README.md deleted file mode 100644 index 38d2f0af9c52..000000000000 --- a/examples/few_shot/p-tuning/README.md +++ /dev/null @@ -1,85 +0,0 @@ -# P-Tuning - -[GPT Understands, Too](https://arxiv.org/pdf/2103.10385.pdf) - -## 算法简介 - -P-tuning 引入可学习的连续型提示向量 prompt embeddings 参数, 让模型自己去学习最优的 prompt embedding, 而不再依赖人工去设置自然语言形式的提示(Prompt)信息。P-Tuning 算法的数据和模型定义如下图所示,对应于数据预处理模块 `SoftTemplate` 和标签词映射模块 `MaskedLMVerbalizer`,详细介绍及定义方法参见 [Prompt API 文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md)。 - -![p-tuning](https://user-images.githubusercontent.com/25607475/204214359-3036c6c6-f101-4a5f-958c-abe0e40c243a.png) - - -## 快速开始 - -CLUE(Chinese Language Understanding Evaluation)作为中文语言理解权威测评榜单,在学术界和工业界都有着广泛影响。FewCLUE 是其设立的中文小样本学习测评子榜,旨在探索小样本学习最佳模型和中文实践。PaddleNLP 内置了 FewCLUE 数据集,可以直接用来进行 PET 策略训练、评估、预测,并生成 FewCLUE 榜单的提交结果,参与 FewCLUE 竞赛。 -PaddleNLP 内置了 FewCLUE 数据集,可以直接用来进行 P-tuning 策略训练、评估、预测,并生成 FewCLUE 榜单的提交结果,参与 FewCLUE 竞赛。 - -### 代码结构及说明 -``` -├── run_train.py # P-Tuning 算法提示学习脚本 -├── data.py # 数据集构造、数据增强 -├── utils.py # FewCLUE 提交结果保存等工具函数 -└── prompt/ # FewCLUE 各数据集的 prompt 定义文件 -``` - -### 数据准备 - -读取 FewCLUE 数据集只需要 1 行代码,这部分代码在 `data.py` 脚本中。以情感分类数据集 `eprstmt` 为例: -``` -from paddlenlp.datasets import load_dataset - -# 通过指定 "fewclue" 和数据集名字 name="eprstmt" 即可一键加载 FewCLUE 中的 eprstmt 数据集 -train_ds, dev_ds, public_test_ds = load_dataset("fewclue", name="eprstmt", splits=("train_0", "dev_0", "test_public")) -``` - -### 模型训练、评估、预测 - -通过如下命令,指定 GPU 0 卡, 使用一个连续型提示向量在 FewCLUE 的 `eprstmt` 数据集上进行训练和评估。如果要使用多个可学习连续型提示向量,可修改 `./prompt/` 目录下相应的文件,修改 `soft` 的长度属性 `length` 即可。 -``` -python -u -m paddle.distributed.launch --gpus "0" run_train.py \ - --output_dir checkpoint_eprstmt \ - --task_name eprstmt \ - --split_id few_all \ - --prompt_path prompt/eprstmt.json \ - --prompt_index 0 \ - --do_train \ - --do_eval \ - --do_test \ - --do_predict \ - --do_label \ - --max_steps 1000 \ - --learning_rate 3e-5 \ - --eval_steps 100 \ - --save_steps 100 \ - --logging_steps 5 \ - --per_device_train_batch_size 16 \ - --max_seq_length 128 \ - --load_best_model_at_end \ - --metric_for_best_model accuracy \ - --save_total_limit 1 -``` - -参数含义说明 -- `task_name`: FewCLUE 中的数据集名字 -- `split_id`: 数据集编号,包括0, 1, 2, 3, 4 和 few_all -- `prompt_path`: prompt 定义文件名 -- `prompt_index`: 使用定义文件中第 `prompt_index` 个 prompt -- `augment_type`: 数据增强策略,可选 swap, delete, insert, substitute -- `num_augment`: 数据增强策略为每个样本生成的样本数量 -- `word_augment_percent`: 每个序列中数据增强词所占的比例 -- `pseudo_data_path`: 使用模型标注的伪标签数据文件路径 -- `do_label`: 是否使用训练后的模型给无标签数据标注伪标签 -- `do_test`: 是否在公开测试集上评估模型效果 -- `model_name_or_path`: 预训练模型名,默认为 `ernie-1.0-large-zh-cw` -- `use_rdrop`: 是否使用对比学习策略 R-Drop -- `alpha_rdrop`: R-Drop 损失值权重,默认为 0.5 -- `dropout`: 预训练模型的 dropout 参数值,用于 R-Drop 策略中参数配置 -- `export_type`: 模型导出格式,默认为 `paddle`,动态图转静态图 -- 更多配置参考 [Trainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md#trainingarguments-%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D) 和 [PromptTrainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md#prompttrainer%E5%8F%82%E6%95%B0%E5%88%97%E8%A1%A8) - -### 模型部署 - -Coming soon... - -## References -[1]X. Liu et al., “GPT Understands, Too,” arXiv:2103.10385 [cs], Mar. 2021, Accessed: Mar. 22, 2021. [Online]. Available: http://arxiv.org/abs/2103.10385 diff --git a/examples/few_shot/p-tuning/data.py b/examples/few_shot/p-tuning/data.py deleted file mode 100644 index 6f96ac02cdc8..000000000000 --- a/examples/few_shot/p-tuning/data.py +++ /dev/null @@ -1,202 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import json -from functools import partial - -import paddle - -from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap -from paddlenlp.datasets import MapDataset, load_dataset - - -def extend_with_pseudo_data(data_ds, pseudo_path, labels_to_ids): - """ - Extend train dataset with pseudo labeled examples if exists. - """ - if pseudo_path is None: - return data_ds - with open(pseudo_path, "r", encoding="utf-8") as fp: - pseudo_data = [json.loads(x.strip()) for x in fp] - data_ds = MapDataset([x for x in data_ds] + pseudo_data) - return data_ds - - -def extend_with_data_augment(data_ds, aug_type, num_aug=10, percent=0.1, aug_base="mlm", example_keys=None): - """ - Extend train dataset with augmentation. - """ - if example_keys is None: - return data_ds - if aug_type is None or aug_type == "None": - return data_ds - if aug_type == "delete": - aug = WordDelete(create_n=num_aug, aug_percent=percent) - elif aug_type == "substitute": - aug = WordSubstitute(aug_base, create_n=num_aug, aug_percent=percent) - elif aug_type == "insert": - aug = WordInsert(aug_base, create_n=num_aug, aug_percent=percent) - elif aug_type == "swap": - aug = WordSwap(create_n=num_aug, aug_percent=percent) - else: - raise ValueError("Unsupported data augment strategy `{}`".format(aug_type)) - - aug_data = [] - for example in data_ds: - for key in example_keys: - text_aug = aug.augment(example[key]) - for text in text_aug: - new_example = example.copy() - example[key] = text - aug_data.append(new_example) - - data_ds = MapDataset([x for x in data_ds] + aug_data) - return data_ds - - -def convert_chid(data_ds): - """ - Insert idioms into positions of `#idiom#` so that the task is converted - to binary classification. - """ - split_data_ds = [] - for example in data_ds: - fragments = example["content"].split("#idiom#") - label = example.get("answer", None) - for index, cand in enumerate(example["candidates"]): - new_example = {"content_pre": fragments[0], "content_post": fragments[1], "idiom": cand} - if label is not None: - new_example["label"] = str(int(index == label)) - split_data_ds.append(new_example) - return MapDataset(split_data_ds) - - -def convert_csl(data_ds): - """ - Concatanate keywords and it can be replaced by keyword `options` in develop versioin. - """ - concat_data_ds = [] - for example in data_ds: - example["keyword"] = ",".join(example["keyword"]) - concat_data_ds.append(example) - return MapDataset(concat_data_ds) - - -def convert_cluewsc(data_ds): - """ - Mark the pronoun and entity with special tokens. - """ - marked_data_ds = [] - for example in data_ds: - target, text = example["target"], list(example["text"]) - pronoun, p_index = target["span2_text"], target["span2_index"] - entity, e_index = target["span1_text"], target["span1_index"] - label = example.get("label", None) - if p_index > e_index: - text.insert(p_index, "_") - text.insert(p_index + len(pronoun) + 1, "_") - text.insert(e_index, "[") - text.insert(e_index + len(entity) + 1, "]") - else: - text.insert(e_index, "[") - text.insert(e_index + len(entity) + 1, "]") - text.insert(p_index, "_") - text.insert(p_index + len(pronoun) + 1, "_") - new_example = {"text": "".join(text), "pronoun": pronoun, "entity": entity} - if label is not None: - new_example["label"] = label - marked_data_ds.append(new_example) - return MapDataset(marked_data_ds) - - -def convert_labels_to_ids(example, orig_key, labels_to_ids, pop_keys=None): - """ - Convert the keyword in datasets to `labels`. - """ - if orig_key in example: - example["label_ids"] = labels_to_ids[example.pop(orig_key)] - if pop_keys is not None: - for key in pop_keys: - if key in example: - example.pop(key) - return example - - -def convert_ids_to_words(example, token_ids): - """ - Convert label id to the first word in mapping from labels to words, - the length of which should coincide with that of `mask` in prompt. - """ - if "label_ids" in example: - labels = paddle.index_select(token_ids, paddle.to_tensor(example.pop("label_ids")), axis=0).squeeze(0) - example["labels"] = labels - return example - - -def load_fewclue_dataset(args, verbalizer, example_keys=None): - """ - Load fewclue datasets and convert them to the standard format of PET. - """ - split_id = args.split_id - splits = [f"train_{split_id}", f"dev_{split_id}", "test_public", "test"] - if args.task_name == "cluewsc": - train_ds, dev_ds, public_test_ds, test_ds = load_dataset("fewclue", name=args.task_name, splits=splits) - unlabeled_ds = None - else: - splits.append("unlabeled") - train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_dataset( - "fewclue", name=args.task_name, splits=splits - ) - data_ds = [train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds] - - # Preprocess data for mask prediction task. - if args.task_name == "chid": - for index, sub_data_ds in enumerate(data_ds): - data_ds[index] = convert_chid(sub_data_ds) - elif args.task_name == "cluewsc": - for index, sub_data_ds in enumerate(data_ds[:-1]): - data_ds[index] = convert_cluewsc(sub_data_ds) - elif args.task_name == "csl": - for index, sub_data_ds in enumerate(data_ds): - data_ds[index] = convert_csl(sub_data_ds) - orig_key = "label" - pop_keys = ["id"] - if args.task_name == "tnews": - orig_key = "label_desc" - pop_keys = ["keywords", "label", "id"] - elif args.task_name == "iflytek": - orig_key = "label_des" - pop_keys = ["id", "label"] - elif args.task_name == "ocnli": - pop_keys = ["level", "label0", "label1", "label2", "label3", "label4", "genre", "prem_id", "id"] - convert_label = partial( - convert_labels_to_ids, orig_key=orig_key, labels_to_ids=verbalizer.labels_to_ids, pop_keys=pop_keys - ) - for index, sub_data_ds in enumerate(data_ds): - if sub_data_ds is not None: - data_ds[index] = sub_data_ds.map(convert_label) - - # Extend train dataset with data augmentation and pseudo-label data. - data_ds[0] = extend_with_data_augment( - data_ds[0], args.augment_type, args.num_augment, args.word_augment_percent, args.augment_method, example_keys - ) - data_ds[0] = extend_with_pseudo_data(data_ds[0], args.pseudo_data_path, verbalizer.labels_to_ids) - - dev_labels = [x["label_ids"] for x in data_ds[1]] - test_labels = [x["label_ids"] for x in data_ds[2]] - - convert_fn = partial(convert_ids_to_words, token_ids=verbalizer.token_ids[:, 0, :]) - data_ds[:3] = [x.map(convert_fn) for x in data_ds[:3]] - - return data_ds, (dev_labels, test_labels) diff --git a/examples/few_shot/p-tuning/prompt/bustm.json b/examples/few_shot/p-tuning/prompt/bustm.json deleted file mode 100644 index 345930ea51a9..000000000000 --- a/examples/few_shot/p-tuning/prompt/bustm.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "{'mask'}{'soft'}{'text': 'sentence1'}{'text': 'sentence2'}"} - ], - "verbalizer": [ - {"0": "不", "1": "很"} - ] -} diff --git a/examples/few_shot/p-tuning/prompt/chid.json b/examples/few_shot/p-tuning/prompt/chid.json deleted file mode 100644 index cc3b30195fa7..000000000000 --- a/examples/few_shot/p-tuning/prompt/chid.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "{'mask'}{'soft'}{'text':'content_pre'}{'text': 'idiom'}{'text': 'content_post'}"} - ], - "verbalizer": [ - {"0": "否", "1": "是"} - ] -} diff --git a/examples/few_shot/p-tuning/prompt/cluewsc.json b/examples/few_shot/p-tuning/prompt/cluewsc.json deleted file mode 100644 index c0ef7573441b..000000000000 --- a/examples/few_shot/p-tuning/prompt/cluewsc.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "{'mask'}{'mask'}{'soft'}{'text': 'text'}{'text': 'pronoun'}指的是{'text': 'entity'}"} - ], - "verbalizer": [ - {"false": "错误", "true": "正确"} - ] -} diff --git a/examples/few_shot/p-tuning/prompt/csl.json b/examples/few_shot/p-tuning/prompt/csl.json deleted file mode 100644 index 443ba172a2fe..000000000000 --- a/examples/few_shot/p-tuning/prompt/csl.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "{'mask'}{'soft'}本文关键词有{'text': 'keyword'}{'text': 'abst'}"} - ], - "verbalizer": [ - {"0": "不", "1": "很"} - ] -} diff --git a/examples/few_shot/p-tuning/prompt/csldcp.json b/examples/few_shot/p-tuning/prompt/csldcp.json deleted file mode 100644 index 5bb12c680f4e..000000000000 --- a/examples/few_shot/p-tuning/prompt/csldcp.json +++ /dev/null @@ -1,76 +0,0 @@ -{ - "template": [ - {"text": "{'mask'}{'mask'}{'soft'}{'text': 'content'}"} - ], - "verbalizer": [ - { - "材料科学与工程": "材料", - "作物学": "作物", - "口腔医学": "口腔", - "药学": "药学", - "教育学": "教育", - "水利工程": "水利", - "理论经济学": "理经", - "食品科学与工程": "食品", - "畜牧学/兽医学": "畜牧", - "体育学": "体育", - "核科学与技术": "核科", - "力学": "力学", - "园艺学": "园艺", - "水产": "水产", - "法学": "法学", - "地质学/地质资源与地质工程": "地质", - "石油与天然气工程": "石油", - "农林经济管理": "农林", - "信息与通信工程": "通信", - "图书馆、情报与档案管理": "图书", - "政治学": "政治", - "电气工程": "电气", - "海洋科学": "海洋", - "民族学": "民族", - "航空宇航科学与技术": "航空", - "化学/化学工程与技术": "化学", - "哲学": "哲学", - "公共卫生与预防医学": "卫生", - "艺术学": "艺术", - "农业工程": "农工", - "船舶与海洋工程": "船舶", - "计算机科学与技术": "计科", - "冶金工程": "冶金", - "交通运输工程": "交通", - "动力工程及工程热物理": "动力", - "纺织科学与工程": "纺织", - "建筑学": "建筑", - "环境科学与工程": "环境", - "公共管理": "公管", - "数学": "数学", - "物理学": "物理", - "林学/林业工程": "林学", - "心理学": "心理", - "历史学": "历史", - "工商管理": "工管", - "应用经济学": "应经", - "中医学/中药学": "中医", - "天文学": "天文", - "机械工程": "机械", - "土木工程": "土木", - "光学工程": "光学", - "地理学": "地理", - "农业资源利用": "农业", - "生物学/生物科学与工程": "生物", - "兵器科学与技术": "兵器", - "矿业工程": "矿业", - "大气科学": "大气", - "基础医学/临床医学": "基础", - "电子科学与技术": "电子", - "测绘科学与技术": "测绘", - "控制科学与工程": "控制", - "军事学": "军事", - "中国语言文学": "中文", - "新闻传播学": "新闻", - "社会学": "社会", - "地球物理学":"地球", - "植物保护":"植保" - } - ] -} diff --git a/examples/few_shot/p-tuning/prompt/eprstmt.json b/examples/few_shot/p-tuning/prompt/eprstmt.json deleted file mode 100644 index ea6941cdd963..000000000000 --- a/examples/few_shot/p-tuning/prompt/eprstmt.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "{'mask'}{'soft'}{'text':'sentence'}"} - ], - "verbalizer": [ - {"Negative": "不", "Positive": "很"} - ] -} diff --git a/examples/few_shot/p-tuning/prompt/iflytek.json b/examples/few_shot/p-tuning/prompt/iflytek.json deleted file mode 100644 index 198ce1994973..000000000000 --- a/examples/few_shot/p-tuning/prompt/iflytek.json +++ /dev/null @@ -1,129 +0,0 @@ -{ - "template": [ - {"text": "{'mask': None, 'length': 4}{'soft'}{'text': 'sentence'}"} - ], - "verbalizer": [ - { - "银行": "银行办理", - "社区服务": "社区服务", - "电商": "电商网购", - "支付": "支付交易", - "经营养成": "经营养成", - "卡牌": "卡牌游戏", - "借贷": "借贷借款", - "驾校": "驾校学车", - "理财": "投资理财", - "职考": "职业考试", - "新闻": "新闻资讯", - "旅游资讯": "旅游资讯", - "公共交通": "公共交通", - "魔幻": "魔幻游戏", - "医疗服务": "医疗服务", - "影像剪辑": "影像剪辑", - "动作类": "动作游戏", - "工具": "使用工具", - "体育竞技": "体育竞技", - "小说": "小说阅读", - "运动健身": "运动健身", - "相机": "相机拍照", - "辅助工具": "辅助工具", - "快递物流": "快递物流", - "高等教育": "高等教育", - "股票": "股票炒股", - "菜谱": "做菜菜谱", - "行车辅助": "行车帮助", - "仙侠": "仙侠小说", - "亲子儿童": "亲子儿童", - "购物咨询": "购物资讯", - "射击游戏": "射击游戏", - "漫画": "动漫漫画", - "中小学": "中学小学", - "同城服务": "同城跑腿", - "成人教育": "成人教育", - "求职": "面试求职", - "电子产品": "电子产品", - "艺术": "艺术学习", - "薅羊毛": "比价省钱", - "约会社交": "约会社交", - "经营": "经营管理", - "兼职": "兼职赚钱", - "短视频": "拍短视频", - "音乐": "音乐乐库", - "英语": "英语学习", - "棋牌中心": "棋牌中心", - "摄影修图": "摄影修图", - "养生保健": "养生保健", - "办公": "办公工具", - "政务": "政务服务", - "视频": "视频拍摄", - "论坛圈子": "论坛圈子", - "彩票": "彩票乐透", - "直播": "直播娱乐", - "其他": "其他类别", - "休闲益智": "休闲益智", - "策略": "策略游戏", - "即时通讯": "即时通讯", - "汽车交易": "汽车交易", - "违章": "违章罚款", - "地图导航": "地图导航", - "民航": "民用航空", - "电台": "电台播报", - "语言(非英语)": "小语种类", - "搞笑": "搞笑娱乐", - "婚恋社交": "婚恋社交", - "社区超市": "社区超市", - "日常养车": "日常养车", - "杂志": "杂志期刊", - "视频教育": "线上教育", - "家政": "家政服务", - "影视娱乐": "影视娱乐", - "装修家居": "装修家居", - "体育咨讯": "体育资讯", - "社交工具": "社交工具", - "餐饮店": "餐饮美食", - "美颜": "美颜相机", - "问诊挂号": "问诊挂号", - "飞行空战": "飞行空战", - "综合预定": "综合预定", - "电影票务": "电影票务", - "笔记": "笔记记录", - "买房": "买房购房", - "外卖": "外卖配送", - "母婴": "母婴产品", - "打车": "打车出行", - "情侣社交": "情侣社交", - "日程管理": "日程管理", - "租车": "租车出行", - "微博博客": "微博博客", - "百科": "知识百科", - "绘画": "绘画学习", - "铁路": "铁路交通", - "生活社交": "生活社交", - "租房": "租房房源", - "酒店": "酒店住宿", - "保险": "保险理赔", - "问答交流": "问答交流", - "收款": "收款交易", - "MOBA": "多人竞技", - "K歌": "唱歌K歌", - "技术": "技术学习", - "减肥瘦身": "减肥瘦身", - "工作社交": "工作社交", - "团购": "团购拼单", - "记账": "记录记账", - "女性": "女性生活", - "公务员": "公务员类", - "二手": "二手交易", - "美妆美业": "美妆美业", - "汽车咨询": "汽车资讯", - "行程管理": "行程管理", - "免费WIFI": "WIFI", - "教辅": "教育辅助", - "成人": "成人两性", - "婚庆": "婚庆结婚", - "民宿短租": "民宿短租", - "出国": "出国相关" - } - ] -} - diff --git a/examples/few_shot/p-tuning/prompt/ocnli.json b/examples/few_shot/p-tuning/prompt/ocnli.json deleted file mode 100644 index 796cb691f99d..000000000000 --- a/examples/few_shot/p-tuning/prompt/ocnli.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "template": [ - {"text": "{'mask'}{'mask'}{'soft'}{'text': 'sentence1'}{'text': 'sentence2'}"} - ], - "verbalizer": [ - {"contradiction": "不同", "entailment": "相似", "neutral": "无关"} - ] -} diff --git a/examples/few_shot/p-tuning/prompt/tnews.json b/examples/few_shot/p-tuning/prompt/tnews.json deleted file mode 100644 index 822c30badd52..000000000000 --- a/examples/few_shot/p-tuning/prompt/tnews.json +++ /dev/null @@ -1,24 +0,0 @@ -{ - "template": [ - {"text": "{'mask'}{'mask'}{'soft'}{'text':'sentence'}"} - ], - "verbalizer": [ - { - "news_story": "八卦", - "news_entertainment": "明星", - "news_finance": "财经", - "news_sports": "体育", - "news_edu": "校园", - "news_game": "游戏", - "news_culture": "文化", - "news_tech": "科技", - "news_car": "汽车", - "news_travel": "旅行", - "news_world": "国际", - "news_agriculture": "农业", - "news_military": "军事", - "news_house": "房子", - "news_stock": "股票" - } - ] -} diff --git a/examples/few_shot/p-tuning/run_train.py b/examples/few_shot/p-tuning/run_train.py deleted file mode 100644 index abe66b7bd3fa..000000000000 --- a/examples/few_shot/p-tuning/run_train.py +++ /dev/null @@ -1,175 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import time -from dataclasses import dataclass, field -from functools import partial - -import paddle -from data import load_fewclue_dataset -from paddle.metric import Accuracy -from paddle.static import InputSpec -from utils import load_prompt_arguments, save_fewclue_prediction, save_pseudo_data - -from paddlenlp.prompt import ( - MaskedLMVerbalizer, - PromptModelForSequenceClassification, - PromptTrainer, - PromptTuningArguments, - SoftTemplate, -) -from paddlenlp.trainer import PdArgumentParser -from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer -from paddlenlp.utils.log import logger - - -# yapf: disable -@dataclass -class DataArguments: - task_name: str = field(default="eprstmt", metadata={"help": "The task name in FewCLUE."}) - split_id: str = field(default="0", metadata={"help": "The split id of datasets, including 0, 1, 2, 3, 4, few_all."}) - prompt_path: str = field(default="prompt/eprstmt.json", metadata={"help": "Path to the defined prompts."}) - prompt_index: int = field(default=0, metadata={"help": "The index of defined prompt for training."}) - augment_type: str = field(default=None, metadata={"help": "The strategy used for data augmentation, including `swap`, `delete`, `insert`, `subsitute`."}) - num_augment: str = field(default=5, metadata={"help": "Number of augmented data per example, which works when `augment_type` is set."}) - word_augment_percent: str = field(default=0.1, metadata={"help": "Percentage of augmented words in sequences, used for `swap`, `delete`, `insert`, `subsitute`."}) - augment_method: str = field(default="mlm", metadata={"help": "Strategy used for `insert` and `subsitute`."}) - pseudo_data_path: str = field(default=None, metadata={"help": "Path to data with pseudo labels."}) - do_label: bool = field(default=False, metadata={"help": "Whether to label unsupervised data in unlabeled datasets"}) - do_test: bool = field(default=False, metadata={"help": "Whether to evaluate model on public test datasets."}) - - -@dataclass -class ModelArguments: - model_name_or_path: str = field(default="ernie-1.0-large-zh-cw", metadata={"help": "Build-in pretrained model name or the path to local model."}) - export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."}) - dropout: float = field(default=0.1, metadata={"help": "The dropout used for pretrained model."}) -# yapf: enable - - -def main(): - # Parse the arguments. - parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments)) - model_args, data_args, training_args = parser.parse_args_into_dataclasses() - data_args = load_prompt_arguments(data_args) - training_args.print_config(model_args, "Model") - training_args.print_config(data_args, "Data") - paddle.set_device(training_args.device) - - # Load the pretrained language model. - tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) - model = AutoModelForMaskedLM.from_pretrained( - model_args.model_name_or_path, - hidden_dropout_prob=model_args.dropout, - attention_probs_dropout_prob=model_args.dropout, - ) - - # Define template for preprocess and verbalizer for postprocess. - template = SoftTemplate(data_args.prompt, tokenizer, training_args.max_seq_length, model.get_input_embeddings()) - logger.info("Using template: {}".format(template.prompt)) - - verbalizer = MaskedLMVerbalizer(data_args.label_words, tokenizer) - labels_to_ids = verbalizer.labels_to_ids - ids_to_labels = {idx: label for label, idx in labels_to_ids.items()} - logger.info("Using verbalizer: {}".format(data_args.label_words)) - - # Load datasets. - data_ds, label_list = load_fewclue_dataset(data_args, verbalizer=verbalizer, example_keys=template.example_keys) - train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = data_ds - dev_labels, test_labels = label_list - - # Define the criterion. - criterion = paddle.nn.CrossEntropyLoss() - - # Initialize the prompt model with the above variables. - prompt_model = PromptModelForSequenceClassification( - model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout - ) - - # Define the metric function. - def compute_metrics(eval_preds, labels, verbalizer): - metric = Accuracy() - predictions = paddle.to_tensor(eval_preds.predictions) - predictions = verbalizer.aggregate_multiple_mask(predictions) - correct = metric.compute(predictions, paddle.to_tensor(labels)) - metric.update(correct) - acc = metric.accumulate() - return {"accuracy": acc} - - # Initialize the trainer. - dev_compute_metrics = partial(compute_metrics, labels=dev_labels, verbalizer=verbalizer) - trainer = PromptTrainer( - model=prompt_model, - tokenizer=tokenizer, - args=training_args, - criterion=criterion, - train_dataset=train_ds, - eval_dataset=dev_ds, - callbacks=None, - compute_metrics=dev_compute_metrics, - ) - - # Traininig. - if training_args.do_train: - train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) - metrics = train_result.metrics - trainer.save_model() - trainer.log_metrics("train", metrics) - trainer.save_metrics("train", metrics) - trainer.save_state() - - time_stamp = time.strftime("%m%d-%H-%M-%S", time.localtime()) - - # Test. - if data_args.do_test and public_test_ds is not None: - test_compute_metrics = partial(compute_metrics, labels=test_labels, verbalizer=verbalizer) - trainer.compute_metrics = test_compute_metrics - test_ret = trainer.predict(public_test_ds) - trainer.log_metrics("test", test_ret.metrics) - - # Predict. - if training_args.do_predict and test_ds is not None: - pred_ret = trainer.predict(test_ds) - logger.info("Prediction done.") - predict_path = os.path.join(training_args.output_dir, "fewclue_submit_examples_" + time_stamp) - save_fewclue_prediction(predict_path, data_args.task_name, pred_ret, verbalizer, ids_to_labels) - - # Label unsupervised data. - if data_args.do_label and unlabeled_ds is not None: - label_ret = trainer.predict(unlabeled_ds) - logger.info("Labeling done.") - pseudo_path = os.path.join(training_args.output_dir, "pseudo_data_" + time_stamp + ".txt") - save_pseudo_data(pseudo_path, data_args.task_name, label_ret, verbalizer, ids_to_labels) - - # Export static model. - if training_args.do_export: - template = prompt_model.template - template_keywords = template.extract_template_keywords(template.prompt) - input_spec = [ - InputSpec(shape=[None, None], dtype="int64"), # input_ids, - InputSpec(shape=[None, None], dtype="int64"), # token_type_ids - InputSpec(shape=[None, None], dtype="int64"), # position_ids - InputSpec(shape=[None, None, None, None], dtype="float32"), # attention_mask - InputSpec(shape=[None], dtype="int64"), # masked_positions - InputSpec(shape=[None, None], dtype="int64"), # soft_token_ids - ] - if "encoder" in template_keywords: - input_spec.append(InputSpec(shape=[None, None], dtype="int64")) # encoder_ids - export_path = os.path.join(training_args.output_dir, "export") - trainer.export_model(export_path, input_spec=input_spec, export_type=model_args.export_type) - - -if __name__ == "__main__": - main() diff --git a/examples/few_shot/p-tuning/utils.py b/examples/few_shot/p-tuning/utils.py deleted file mode 100644 index 989b4e6b81a8..000000000000 --- a/examples/few_shot/p-tuning/utils.py +++ /dev/null @@ -1,249 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import json -import os -import pathlib - -import numpy as np -import paddle - -from paddlenlp.datasets import load_dataset - -LABEL_TO_STANDARD = { - "tnews": { - "news_story": "100", - "news_culture": "101", - "news_entertainment": "102", - "news_sports": "103", - "news_finance": "104", - "news_house": "106", - "news_car": "107", - "news_edu": "108", - "news_tech": "109", - "news_military": "110", - "news_travel": "112", - "news_world": "113", - "news_stock": "114", - "news_agriculture": "115", - "news_game": "116", - }, - "iflytek": { - "打车": 0, - "美颜": 100, - "影像剪辑": 101, - "摄影修图": 102, - "相机": 103, - "绘画": 104, - "二手": 105, - "电商": 106, - "团购": 107, - "外卖": 108, - "电影票务": 109, - "社区服务": 10, - "社区超市": 110, - "购物咨询": 111, - "笔记": 112, - "办公": 113, - "日程管理": 114, - "女性": 115, - "经营": 116, - "收款": 117, - "其他": 118, - "薅羊毛": 11, - "魔幻": 12, - "仙侠": 13, - "卡牌": 14, - "飞行空战": 15, - "射击游戏": 16, - "休闲益智": 17, - "动作类": 18, - "体育竞技": 19, - "地图导航": 1, - "棋牌中心": 20, - "经营养成": 21, - "策略": 22, - "MOBA": 23, - "辅助工具": 24, - "约会社交": 25, - "即时通讯": 26, - "工作社交": 27, - "论坛圈子": 28, - "婚恋社交": 29, - "免费WIFI": 2, - "情侣社交": 30, - "社交工具": 31, - "生活社交": 32, - "微博博客": 33, - "新闻": 34, - "漫画": 35, - "小说": 36, - "技术": 37, - "教辅": 38, - "问答交流": 39, - "租车": 3, - "搞笑": 40, - "杂志": 41, - "百科": 42, - "影视娱乐": 43, - "求职": 44, - "兼职": 45, - "视频": 46, - "短视频": 47, - "音乐": 48, - "直播": 49, - "同城服务": 4, - "电台": 50, - "K歌": 51, - "成人": 52, - "中小学": 53, - "职考": 54, - "公务员": 55, - "英语": 56, - "视频教育": 57, - "高等教育": 58, - "成人教育": 59, - "快递物流": 5, - "艺术": 60, - "语言(非英语)": 61, - "旅游资讯": 62, - "综合预定": 63, - "民航": 64, - "铁路": 65, - "酒店": 66, - "行程管理": 67, - "民宿短租": 68, - "出国": 69, - "婚庆": 6, - "工具": 70, - "亲子儿童": 71, - "母婴": 72, - "驾校": 73, - "违章": 74, - "汽车咨询": 75, - "汽车交易": 76, - "日常养车": 77, - "行车辅助": 78, - "租房": 79, - "家政": 7, - "买房": 80, - "装修家居": 81, - "电子产品": 82, - "问诊挂号": 83, - "养生保健": 84, - "医疗服务": 85, - "减肥瘦身": 86, - "美妆美业": 87, - "菜谱": 88, - "餐饮店": 89, - "公共交通": 8, - "体育咨讯": 90, - "运动健身": 91, - "支付": 92, - "保险": 93, - "股票": 94, - "借贷": 95, - "理财": 96, - "彩票": 97, - "记账": 98, - "银行": 99, - "政务": 9, - }, -} - - -def load_prompt_arguments(args): - """ - Load prompt and label words according to prompt index. - """ - with open(args.prompt_path, "r", encoding="utf-8") as fp: - configs = json.load(fp) - assert len(configs["verbalizer"]) == len(configs["template"]) - assert configs["verbalizer"][0] is not None - verbalizer = [configs["verbalizer"][0]] - last_verb_index = 0 - for index, verb in enumerate(configs["verbalizer"][1:]): - if verb is None or len(verb) == 0: - verbalizer.append(configs["verbalizer"][last_verb_index]) - else: - verbalizer.append(verb) - last_verb_index = index + 1 - configs["verbalizer"] = verbalizer - args.prompt = configs["template"][args.prompt_index]["text"] - label_words = configs["verbalizer"][args.prompt_index] - if isinstance(label_words, list): - label_words = {k: k for k in label_words} - args.label_words = label_words - return args - - -def save_pseudo_data(save_path, task_name, label_preds, verbalizer, labels): - """ - Combine unsupervised data and corresponding predicted labels and - save one example per line. - """ - if task_name == "cluewsc": - return None - - data_ds = load_dataset("fewclue", name=task_name, splits="unlabeled") - preds = paddle.to_tensor(label_preds.predictions) - preds = verbalizer.aggregate_multiple_mask(preds) - preds = paddle.nn.functional.softmax(preds, axis=1).numpy() - label_preds = np.argmax(preds, axis=1) - label_probs = np.max(preds, axis=1) - pseudo_data = [] - for index, example in enumerate(data_ds): - example["labels"] = labels[label_preds[index]] - example["prob"] = str(label_probs[index]) - pseudo_data.append(example) - save_data(pseudo_data, save_path) - - -def save_fewclue_prediction(save_path, task_name, label_preds, verbalizer, labels): - """ - Extract predicted labels and save as the format required by FewCLUE. - """ - preds = paddle.to_tensor(label_preds.predictions) - preds = verbalizer.aggregate_multiple_mask(preds) - if task_name == "chid": - batch_size = preds.shape[0] - preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1] - preds = preds.reshape([batch_size // 7, 7]) - preds = paddle.nn.functional.softmax(preds, axis=1).numpy() - preds = np.argmax(preds, axis=1) - test_ds = load_dataset("fewclue", name=task_name, splits="test") - - ret_list = [] - maps = LABEL_TO_STANDARD.get(task_name, None) - for idx, example in enumerate(test_ds): - uid = example.get("id", idx) - if task_name in ["bustm", "csl"]: - ret_list.append({"id": uid, "label": str(preds[idx])}) - elif task_name == "chid": - ret_list.append({"id": uid, "answer": preds[idx]}) - elif task_name in ["cluewsc", "eprstmt", "ocnli", "csldcp"]: - ret_list.append({"id": uid, "label": labels[preds[idx]]}) - elif task_name in ["iflytek", "tnews"]: - ret_list.append({"id": uid, "label": str(maps[labels[preds[idx]]])}) - save_file = task_name if task_name in ["bustm", "csldcp", "eprstmt"] else task_name + "f" - save_data(ret_list, save_path, save_file + "_predict.json") - - -def save_data(data, save_path, save_file=None): - if save_file is not None: - pathlib.Path(save_path).mkdir(parents=True, exist_ok=True) - save_path = os.path.join(save_path, save_file) - with open(save_path, "w") as fp: - for example in data: - fp.write(json.dumps(example, ensure_ascii=False) + "\n") diff --git a/examples/few_shot/pet/README.md b/examples/few_shot/pet/README.md deleted file mode 100644 index 0499883d4707..000000000000 --- a/examples/few_shot/pet/README.md +++ /dev/null @@ -1,84 +0,0 @@ -# PET - -[Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference](https://arxiv.org/abs/2001.07676) - -## 算法简介 - -自然语言处理任务可以通过给预训练模型提供“任务描述”等方式来进行无监督学习,但效果一般低于有监督训练。而 Pattern-Exploiting Training (PET) 是一种半监督方法,通过将输入转换为完形填空形式的短语来帮助语言模型理解任务。然后用这些短语来给无标注数据打软标签。最后在得到的标注数据集上用有监督方法进行训练。在小样本设置下,PET 在部分任务上远超有监督学习和强半监督学习方法。以 PET 为代表的提示学习与微调学习的区别如下图所示,包括数据预处理模块 `Template` 和标签词映射模块 `Verbalizer`。详细介绍及定义方法参见 [Prompt API 文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md)。 - -![PET_and_FT](https://user-images.githubusercontent.com/25607475/192727706-0a17b5ef-db6b-46be-894d-0ee315306776.png) - - -## 快速开始 - -CLUE(Chinese Language Understanding Evaluation)作为中文语言理解权威测评榜单,在学术界和工业界都有着广泛影响。FewCLUE 是其设立的中文小样本学习测评子榜,旨在探索小样本学习最佳模型和中文实践。PaddleNLP 内置了 FewCLUE 数据集,可以直接用来进行 PET 算法训练、评估、预测,并生成 FewCLUE 榜单的提交结果,参与 FewCLUE 竞赛。 - -### 代码结构说明 -``` -├── run_train.py # PET 算法提示学习脚本 -├── data.py # 数据集构造、数据增强 -├── utils.py # FewCLUE 提交结果保存等工具函数 -└── prompt/ # FewCLUE 各数据集的 prompt 定义文件 -``` - -### 数据准备 - -读取 FewCLUE 数据集只需要 1 行代码,这部分代码在 `data.py` 脚本中。以情感分类数据集 `eprstmt` 为例: - -``` -from paddlenlp.datasets import load_dataset - -# 通过指定 "fewclue" 和数据集名字 name="eprstmt" 即可一键加载 FewCLUE 中的eprstmt 数据集 -train_ds, dev_ds, public_test_ds = load_dataset("fewclue", name="eprstmt", splits=("train_0", "dev_0", "test_public")) -``` - -### 模型训练、评估、预测 - -通过如下命令,指定 GPU 0 卡, 在 FewCLUE 的 `eprstmt` 数据集上进行训练&评估 -``` -python -u -m paddle.distributed.launch --gpus "0" run_train.py \ - --output_dir checkpoint_eprstmt \ - --task_name eprstmt \ - --split_id few_all \ - --prompt_path prompt/eprstmt.json \ - --prompt_index 0 \ - --do_train \ - --do_eval \ - --do_test \ - --do_predict \ - --do_label \ - --max_steps 1000 \ - --learning_rate 3e-5 \ - --eval_steps 100 \ - --save_steps 100 \ - --logging_steps 5 \ - --per_device_train_batch_size 16 \ - --max_seq_length 128 \ - --load_best_model_at_end \ - --metric_for_best_model accuracy \ - --save_total_limit 1 -``` -参数含义说明 -- `task_name`: FewCLUE 中的数据集名字 -- `split_id`: 数据集编号,包括0, 1, 2, 3, 4 和 few_all -- `prompt_path`: prompt 定义文件名 -- `prompt_index`: 使用定义文件中第 `prompt_index` 个 prompt -- `augment_type`: 数据增强策略,可选 swap, delete, insert, substitute -- `num_augment`: 数据增强策略为每个样本生成的样本数量 -- `word_augment_percent`: 每个序列中数据增强词所占的比例 -- `pseudo_data_path`: 使用模型标注的伪标签数据文件路径 -- `do_label`: 是否使用训练后的模型给无标签数据标注伪标签 -- `do_test`: 是否在公开测试集上评估模型效果 -- `model_name_or_path`: 预训练模型名,默认为 `ernie-1.0-large-zh-cw` -- `use_rdrop`: 是否使用对比学习策略 R-Drop -- `alpha_rdrop`: R-Drop 损失值权重,默认为 0.5 -- `dropout`: 预训练模型的 dropout 参数值,用于 R-Drop 策略中参数配置 -- `export_type`: 模型导出格式,默认为 `paddle`,动态图转静态图 -- 更多配置参考 [Trainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/trainer.md#trainingarguments-%E5%8F%82%E6%95%B0%E4%BB%8B%E7%BB%8D) 和 [PromptTrainer 参数文档](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/docs/advanced_guide/prompt.md#prompttrainer%E5%8F%82%E6%95%B0%E5%88%97%E8%A1%A8) - -### 模型部署 - -Coming soon... - -## References -[1] Schick, Timo, and Hinrich Schütze. “Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference.” ArXiv:2001.07676 [Cs], January 25, 2021. http://arxiv.org/abs/2001.07676. diff --git a/examples/few_shot/pet/data.py b/examples/few_shot/pet/data.py deleted file mode 100644 index ba2cca683830..000000000000 --- a/examples/few_shot/pet/data.py +++ /dev/null @@ -1,191 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import json -from functools import partial - -import paddle - -from paddlenlp.dataaug import WordDelete, WordInsert, WordSubstitute, WordSwap -from paddlenlp.datasets import MapDataset, load_dataset - - -def extend_with_pseudo_data(data_ds, pseudo_path, labels_to_ids): - """ - Extend train dataset with pseudo labeled examples if exists. - """ - if pseudo_path is None: - return data_ds - with open(pseudo_path, "r", encoding="utf-8") as fp: - pseudo_data = [json.loads(x.strip()) for x in fp] - data_ds = MapDataset([x for x in data_ds] + pseudo_data) - return data_ds - - -def extend_with_data_augment(data_ds, aug_type, num_aug=10, percent=0.1, aug_base="mlm", example_keys=None): - """ - Extend train dataset with augmentation. - """ - if example_keys is None: - return data_ds - if aug_type is None or aug_type == "None": - return data_ds - if aug_type == "delete": - aug = WordDelete(create_n=num_aug, aug_percent=percent) - elif aug_type == "substitute": - aug = WordSubstitute(aug_base, create_n=num_aug, aug_percent=percent) - elif aug_type == "insert": - aug = WordInsert(aug_base, create_n=num_aug, aug_percent=percent) - elif aug_type == "swap": - aug = WordSwap(create_n=num_aug, aug_percent=percent) - else: - raise ValueError("Unsupported data augment strategy `{}`".format(aug_type)) - - aug_data = [] - for example in data_ds: - for key in example_keys: - text_aug = aug.augment(example[key]) - for text in text_aug: - new_example = example.copy() - example[key] = text - aug_data.append(new_example) - - data_ds = MapDataset([x for x in data_ds] + aug_data) - return data_ds - - -def convert_chid(data_ds): - """ - Insert idioms into positions of `#idiom#` so that the task is converted - to binary classification. - """ - split_data_ds = [] - for example in data_ds: - fragments = example["content"].split("#idiom#") - label = example.get("answer", None) - for index, cand in enumerate(example["candidates"]): - new_example = {"content_pre": fragments[0], "content_post": fragments[1], "idiom": cand} - if label is not None: - new_example["label"] = str(int(index == label)) - split_data_ds.append(new_example) - return MapDataset(split_data_ds) - - -def convert_csl(data_ds): - """ - Concatanate keywords and it can be replaced by keyword `options` in develop versioin. - """ - concat_data_ds = [] - for example in data_ds: - example["keyword"] = ",".join(example["keyword"]) - concat_data_ds.append(example) - return MapDataset(concat_data_ds) - - -def convert_cluewsc(data_ds): - """ - Mark the pronoun and entity with special tokens. - """ - marked_data_ds = [] - for example in data_ds: - target, text = example["target"], list(example["text"]) - pronoun, p_index = target["span2_text"], target["span2_index"] - entity, e_index = target["span1_text"], target["span1_index"] - label = example.get("label", None) - if p_index > e_index: - text.insert(p_index, "_") - text.insert(p_index + len(pronoun) + 1, "_") - text.insert(e_index, "[") - text.insert(e_index + len(entity) + 1, "]") - else: - text.insert(e_index, "[") - text.insert(e_index + len(entity) + 1, "]") - text.insert(p_index, "_") - text.insert(p_index + len(pronoun) + 1, "_") - new_example = {"text": "".join(text), "pronoun": pronoun, "entity": entity} - if label is not None: - new_example["label"] = label - marked_data_ds.append(new_example) - return MapDataset(marked_data_ds) - - -def convert_labels_to_ids(example, orig_key, labels_to_ids): - """ - Convert the keyword in datasets to `labels`. - """ - if orig_key in example: - example["label_ids"] = labels_to_ids[example.pop(orig_key)] - return example - - -def convert_ids_to_words(example, token_ids): - """ - Convert label id to the first word in mapping from labels to words, - the length of which should coincide with that of `mask` in prompt. - """ - if "label_ids" in example: - labels = paddle.index_select(token_ids, paddle.to_tensor(example.pop("label_ids")), axis=0).squeeze(0) - example["labels"] = labels - return example - - -def load_fewclue_dataset(args, verbalizer, example_keys=None): - """ - Load fewclue datasets and convert them to the standard format of PET. - """ - split_id = args.split_id - splits = [f"train_{split_id}", f"dev_{split_id}", "test_public", "test"] - if args.task_name == "cluewsc": - train_ds, dev_ds, public_test_ds, test_ds = load_dataset("fewclue", name=args.task_name, splits=splits) - unlabeled_ds = None - else: - splits.append("unlabeled") - train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = load_dataset( - "fewclue", name=args.task_name, splits=splits - ) - data_ds = [train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds] - - # Preprocess data for mask prediction task. - if args.task_name == "chid": - for index, sub_data_ds in enumerate(data_ds): - data_ds[index] = convert_chid(sub_data_ds) - elif args.task_name == "cluewsc": - for index, sub_data_ds in enumerate(data_ds[:-1]): - data_ds[index] = convert_cluewsc(sub_data_ds) - elif args.task_name == "csl": - for index, sub_data_ds in enumerate(data_ds): - data_ds[index] = convert_csl(sub_data_ds) - orig_key = "label" - if args.task_name == "tnews": - orig_key = "label_desc" - elif args.task_name == "iflytek": - orig_key = "label_des" - convert_label = partial(convert_labels_to_ids, orig_key=orig_key, labels_to_ids=verbalizer.labels_to_ids) - for index, sub_data_ds in enumerate(data_ds): - if sub_data_ds is not None: - data_ds[index] = sub_data_ds.map(convert_label) - - # Extend train dataset with data augmentation and pseudo-label data. - data_ds[0] = extend_with_data_augment( - data_ds[0], args.augment_type, args.num_augment, args.word_augment_percent, args.augment_method, example_keys - ) - data_ds[0] = extend_with_pseudo_data(data_ds[0], args.pseudo_data_path, verbalizer.labels_to_ids) - - dev_labels = [x["label_ids"] for x in data_ds[1]] - test_labels = [x["label_ids"] for x in data_ds[2]] - - convert_fn = partial(convert_ids_to_words, token_ids=verbalizer.token_ids[:, 0, :]) - data_ds[:3] = [x.map(convert_fn) for x in data_ds[:3]] - - return data_ds, (dev_labels, test_labels) diff --git a/examples/few_shot/pet/prompt/bustm.json b/examples/few_shot/pet/prompt/bustm.json deleted file mode 100644 index ab377ea85708..000000000000 --- a/examples/few_shot/pet/prompt/bustm.json +++ /dev/null @@ -1,14 +0,0 @@ -{ - "template": [ - {"text": "下边两句话说的是一个事情吗?{'mask'}“{'text': 'sentence1'}”和“{'text': 'sentence2'}”"}, - {"text": "下边两个句子说的是{'mask'}{'mask'}的事情。“{'text': 'sentence1'}”和“{'text': 'sentence2'}”"}, - {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”意思{'mask'}{'mask'}。"}, - {"text": "“{'text':'sentence1'}”和“{'text':'sentence2'}”描述的是{'mask'}{'mask'}的事情。"} - ], - "verbalizer": [ - {"0": "不", "1": "是"}, - {"0": "不同", "1": "相同"}, - {"0": "不同", "1": "一样"}, - {"0": "不同", "1": "相同"} - ] -} diff --git a/examples/few_shot/pet/prompt/chid.json b/examples/few_shot/pet/prompt/chid.json deleted file mode 100644 index 24dac2d41100..000000000000 --- a/examples/few_shot/pet/prompt/chid.json +++ /dev/null @@ -1,14 +0,0 @@ -{ - "template": [ - {"text": "{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}{'mask'}"}, - {"text": "{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}成语{'text':'idiom'}用在这个句子中{'mask'}合适。"}, - {"text": "选一个合适的词语填在括号里,你会选“{'text': 'idiom'}”吗?{'mask'}。“{'text':'content_pre'}(){'text': 'content_post'}”"}, - {"text": "下边句中成语[{'text':'idiom'}]的理解正确吗?{'mask'}{'mask'}。“{'text':'content_pre'}({'text': 'idiom'}){'text': 'content_post'}”"} - ], - "verbalizer": [ - {"0": "否", "1": "是"}, - {"0": "不", "1": "很"}, - {"0": "不", "1": "会"}, - {"0": "错误", "1": "正确"} - ] -} \ No newline at end of file diff --git a/examples/few_shot/pet/prompt/cluewsc.json b/examples/few_shot/pet/prompt/cluewsc.json deleted file mode 100644 index 76badab27eb8..000000000000 --- a/examples/few_shot/pet/prompt/cluewsc.json +++ /dev/null @@ -1,12 +0,0 @@ -{ - "template": [ - {"text": "{'text': 'text'}{'text': 'pronoun'}指的{'mask'}是{'text': 'entity'}"}, - {"text": "{'text': 'text'}{'text': 'pronoun'}指的是{'text': 'entity'}。这里{'text': 'pronoun'}理解得对吗?{'mask'}"}, - {"text": "{'text': 'text'}{'text': 'pronoun'}{'mask'}{'mask'}地代表了{'text': 'entity'}"} - ], - "verbalizer": [ - {"false": "不", "true": "就"}, - {"false": "错", "true": "对"}, - {"false": "错误", "true": "正确"} - ] -} diff --git a/examples/few_shot/pet/prompt/csl.json b/examples/few_shot/pet/prompt/csl.json deleted file mode 100644 index c604c90d0ce5..000000000000 --- a/examples/few_shot/pet/prompt/csl.json +++ /dev/null @@ -1,14 +0,0 @@ -{ - "template": [ - {"text": "给定以下几个词语:{'text': 'keyword'}{'mask'}{'mask'}扩写成“{'text': 'abst'}”"}, - {"text": "{'text':'abst'}这段话中关键词包括{'text':'keyword', 'truncate': False}对吗?{'mask'}。"}, - {"text": "{'text':'keyword'}这几个词和下边这段话内容{'mask'}关。“{'text':'abst'}”"}, - {"text": "“{'text':'abst'}”本文的内容{'mask'}{'mask'}“{'text':'keyword'}”"} - ], - "verbalizer": [ - {"0": "不能", "1": "可以"}, - {"0": "错", "1": "对"}, - {"0": "无", "1": "有"}, - {"0": "不含", "1": "包括"} - ] -} diff --git a/examples/few_shot/pet/prompt/csldcp.json b/examples/few_shot/pet/prompt/csldcp.json deleted file mode 100644 index e0fcf846b7ed..000000000000 --- a/examples/few_shot/pet/prompt/csldcp.json +++ /dev/null @@ -1,82 +0,0 @@ -{ - "template": [ - {"text": "阅读下边一段{'mask'}{'mask'}学的资料:“{'text': 'content'}”"}, - {"text": "阅读下边这段{'mask'}{'mask'}方面的材料:“{'text': 'content'}”"}, - {"text": "阅读这段{'mask'}{'mask'}学的文献:“{'text': 'content'}”"}, - {"text": "阅读这段{'mask'}{'mask'}学的材料:“{'text': 'content'}”"} - ], - "verbalizer": [ - { - "材料科学与工程": "材料", - "作物学": "作物", - "口腔医学": "口腔", - "药学": "药学", - "教育学": "教育", - "水利工程": "水利", - "理论经济学": "理经", - "食品科学与工程": "食品", - "畜牧学/兽医学": "畜牧", - "体育学": "体育", - "核科学与技术": "核科", - "力学": "力学", - "园艺学": "园艺", - "水产": "水产", - "法学": "法学", - "地质学/地质资源与地质工程": "地质", - "石油与天然气工程": "石油", - "农林经济管理": "农林", - "信息与通信工程": "通信", - "图书馆、情报与档案管理": "图书", - "政治学": "政治", - "电气工程": "电气", - "海洋科学": "海洋", - "民族学": "民族", - "航空宇航科学与技术": "航空", - "化学/化学工程与技术": "化学", - "哲学": "哲学", - "公共卫生与预防医学": "卫生", - "艺术学": "艺术", - "农业工程": "农工", - "船舶与海洋工程": "船舶", - "计算机科学与技术": "计科", - "冶金工程": "冶金", - "交通运输工程": "交通", - "动力工程及工程热物理": "动力", - "纺织科学与工程": "纺织", - "建筑学": "建筑", - "环境科学与工程": "环境", - "公共管理": "公管", - "数学": "数学", - "物理学": "物理", - "林学/林业工程": "林学", - "心理学": "心理", - "历史学": "历史", - "工商管理": "工管", - "应用经济学": "应经", - "中医学/中药学": "中医", - "天文学": "天文", - "机械工程": "机械", - "土木工程": "土木", - "光学工程": "光学", - "地理学": "地理", - "农业资源利用": "农业", - "生物学/生物科学与工程": "生物", - "兵器科学与技术": "兵器", - "矿业工程": "矿业", - "大气科学": "大气", - "基础医学/临床医学": "基础", - "电子科学与技术": "电子", - "测绘科学与技术": "测绘", - "控制科学与工程": "控制", - "军事学": "军事", - "中国语言文学": "中文", - "新闻传播学": "新闻", - "社会学": "社会", - "地球物理学":"地球", - "植物保护":"植保" - }, - {}, - {}, - {} - ] -} diff --git a/examples/few_shot/pet/prompt/eprstmt.json b/examples/few_shot/pet/prompt/eprstmt.json deleted file mode 100644 index 84e408def087..000000000000 --- a/examples/few_shot/pet/prompt/eprstmt.json +++ /dev/null @@ -1,18 +0,0 @@ -{ - "template": [ - {"text": "{'text':'sentence'}我{'mask'}喜欢。"}, - {"text": "我{'mask'}喜欢。{'text':'sentence'}"}, - {"text": "{'mask'}{'mask'}推荐这件商品!{'text':'sentence'}"}, - {"text": "我对这个东西{'mask'}满意。{'text':'sentence'}"}, - {"text": "{'mask'}理想。{'text':'sentence'}"}, - {"text": "{'text':'sentence'}这句话表示我{'mask'}满意。"} - ], - "verbalizer": [ - {"Negative": "不", "Positive": "很"}, - {"Negative": "不", "Positive": "很"}, - {"Negative": "很不", "Positive": "非常"}, - {"Negative": "不", "Positive": "很"}, - {"Negative": "不", "Positive": "很"}, - {"Negative": "不", "Positive": "很"} - ] -} diff --git a/examples/few_shot/pet/prompt/iflytek.json b/examples/few_shot/pet/prompt/iflytek.json deleted file mode 100644 index 9bce98d3f57a..000000000000 --- a/examples/few_shot/pet/prompt/iflytek.json +++ /dev/null @@ -1,253 +0,0 @@ -{ - "template": [ - {"text": "下边介绍的是和{'mask': None, 'length': 4}相关的产品:{'text': 'sentence'}"}, - {"text": "搜索更多{'mask'}{'mask'}相关的应用程序。{'text': 'sentence'}"}, - {"text": "这段话跟什么有关?{'mask'}{'mask'}“{'text': 'sentence'}”"} - ], - "verbalizer": [ - { - "银行": "银行办理", - "社区服务": "社区服务", - "电商": "电商网购", - "支付": "支付交易", - "经营养成": "经营养成", - "卡牌": "卡牌游戏", - "借贷": "借贷借款", - "驾校": "驾校学车", - "理财": "投资理财", - "职考": "职业考试", - "新闻": "新闻资讯", - "旅游资讯": "旅游资讯", - "公共交通": "公共交通", - "魔幻": "魔幻游戏", - "医疗服务": "医疗服务", - "影像剪辑": "影像剪辑", - "动作类": "动作游戏", - "工具": "使用工具", - "体育竞技": "体育竞技", - "小说": "小说阅读", - "运动健身": "运动健身", - "相机": "相机拍照", - "辅助工具": "辅助工具", - "快递物流": "快递物流", - "高等教育": "高等教育", - "股票": "股票炒股", - "菜谱": "做菜菜谱", - "行车辅助": "行车帮助", - "仙侠": "仙侠小说", - "亲子儿童": "亲子儿童", - "购物咨询": "购物资讯", - "射击游戏": "射击游戏", - "漫画": "动漫漫画", - "中小学": "中学小学", - "同城服务": "同城跑腿", - "成人教育": "成人教育", - "求职": "面试求职", - "电子产品": "电子产品", - "艺术": "艺术学习", - "薅羊毛": "比价省钱", - "约会社交": "约会社交", - "经营": "经营管理", - "兼职": "兼职赚钱", - "短视频": "拍短视频", - "音乐": "音乐乐库", - "英语": "英语学习", - "棋牌中心": "棋牌中心", - "摄影修图": "摄影修图", - "养生保健": "养生保健", - "办公": "办公工具", - "政务": "政务服务", - "视频": "视频拍摄", - "论坛圈子": "论坛圈子", - "彩票": "彩票乐透", - "直播": "直播娱乐", - "其他": "其他类别", - "休闲益智": "休闲益智", - "策略": "策略游戏", - "即时通讯": "即时通讯", - "汽车交易": "汽车交易", - "违章": "违章罚款", - "地图导航": "地图导航", - "民航": "民用航空", - "电台": "电台播报", - "语言(非英语)": "小语种类", - "搞笑": "搞笑娱乐", - "婚恋社交": "婚恋社交", - "社区超市": "社区超市", - "日常养车": "日常养车", - "杂志": "杂志期刊", - "视频教育": "线上教育", - "家政": "家政服务", - "影视娱乐": "影视娱乐", - "装修家居": "装修家居", - "体育咨讯": "体育资讯", - "社交工具": "社交工具", - "餐饮店": "餐饮美食", - "美颜": "美颜相机", - "问诊挂号": "问诊挂号", - "飞行空战": "飞行空战", - "综合预定": "综合预定", - "电影票务": "电影票务", - "笔记": "笔记记录", - "买房": "买房购房", - "外卖": "外卖配送", - "母婴": "母婴产品", - "打车": "打车出行", - "情侣社交": "情侣社交", - "日程管理": "日程管理", - "租车": "租车出行", - "微博博客": "微博博客", - "百科": "知识百科", - "绘画": "绘画学习", - "铁路": "铁路交通", - "生活社交": "生活社交", - "租房": "租房房源", - "酒店": "酒店住宿", - "保险": "保险理赔", - "问答交流": "问答交流", - "收款": "收款交易", - "MOBA": "多人竞技", - "K歌": "唱歌K歌", - "技术": "技术学习", - "减肥瘦身": "减肥瘦身", - "工作社交": "工作社交", - "团购": "团购拼单", - "记账": "记录记账", - "女性": "女性生活", - "公务员": "公务员类", - "二手": "二手交易", - "美妆美业": "美妆美业", - "汽车咨询": "汽车资讯", - "行程管理": "行程管理", - "免费WIFI": "WIFI", - "教辅": "教育辅助", - "成人": "成人两性", - "婚庆": "婚庆结婚", - "民宿短租": "民宿短租", - "出国": "出国相关" - }, - { - "银行": "银行", - "社区服务": "社区", - "电商": "网购", - "支付": "付钱", - "经营养成": "养成", - "卡牌": "纸牌", - "借贷": "借钱", - "驾校": "学车", - "理财": "投资", - "职考": "考试", - "新闻": "新闻", - "旅游资讯": "旅游", - "公共交通": "交通", - "魔幻": "魔幻", - "医疗服务": "医疗", - "影像剪辑": "剪辑", - "动作类": "动作", - "工具": "工具", - "体育竞技": "体育", - "小说": "小说", - "运动健身": "运动", - "相机": "相机", - "辅助工具": "辅助", - "快递物流": "快递", - "高等教育": "教育", - "股票": "股票", - "菜谱": "菜谱", - "行车辅助": "帮助", - "仙侠": "仙侠", - "亲子儿童": "小孩", - "购物咨询": "购物", - "射击游戏": "射击", - "漫画": "漫画", - "中小学": "小学", - "同城服务": "跑腿", - "成人教育": "成人", - "求职": "面试", - "电子产品": "电子", - "艺术": "艺术", - "薅羊毛": "赚钱", - "约会社交": "约会", - "经营": "经营", - "兼职": "兼职", - "短视频": "短片", - "音乐": "音乐", - "英语": "英语", - "棋牌中心": "棋牌", - "摄影修图": "拍照", - "养生保健": "养生", - "办公": "办公", - "政务": "政务", - "视频": "视频", - "论坛圈子": "论坛", - "彩票": "彩票", - "直播": "直播", - "其他": "其他", - "休闲益智": "休闲", - "策略": "策略", - "即时通讯": "通讯", - "汽车交易": "买车", - "违章": "违章", - "地图导航": "地图", - "民航": "航空", - "电台": "电台", - "语言(非英语)": "语言", - "搞笑": "搞笑", - "婚恋社交": "婚恋", - "社区超市": "超市", - "日常养车": "养车", - "杂志": "杂志", - "视频教育": "线上", - "家政": "家政", - "影视娱乐": "影视", - "装修家居": "装修", - "体育咨讯": "资讯", - "社交工具": "交流", - "餐饮店": "美食", - "美颜": "美颜", - "问诊挂号": "挂号", - "飞行空战": "飞行", - "综合预定": "预定", - "电影票务": "票务", - "笔记": "笔记", - "买房": "买房", - "外卖": "外卖", - "母婴": "母婴", - "打车": "打车", - "情侣社交": "情侣", - "日程管理": "日程", - "租车": "租车", - "微博博客": "博客", - "百科": "百科", - "绘画": "绘画", - "铁路": "铁路", - "生活社交": "生活", - "租房": "租房", - "酒店": "酒店", - "保险": "保险", - "问答交流": "问答", - "收款": "收款", - "MOBA": "多人", - "K歌": "唱歌", - "技术": "技术", - "减肥瘦身": "减肥", - "工作社交": "工作", - "团购": "团购", - "记账": "记录", - "女性": "女性", - "公务员": "公务", - "二手": "二手", - "美妆美业": "美妆", - "汽车咨询": "汽车", - "行程管理": "行程", - "免费WIFI": "上网", - "教辅": "教辅", - "成人": "两性", - "婚庆": "婚庆", - "民宿短租": "民宿", - "出国": "出国" - }, - {} - ] -} - diff --git a/examples/few_shot/pet/prompt/ocnli.json b/examples/few_shot/pet/prompt/ocnli.json deleted file mode 100644 index 0519816f080b..000000000000 --- a/examples/few_shot/pet/prompt/ocnli.json +++ /dev/null @@ -1,12 +0,0 @@ -{ - "template": [ - {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”之间的逻辑关系是{'mask'}{'mask'}"}, - {"text": "“{'text': 'sentence1'}”和“{'text': 'sentence2'}”说的是{'mask'}{'mask'}的东西。"}, - {"text": "下边两句话之间有什么逻辑关系?{'mask'}{'mask'}“{'text': 'sentence1'}”{'sep'}“{'text': 'sentence2'}”"} - ], - "verbalizer": [ - {"contradiction": "矛盾", "entailment": "蕴含", "neutral": "中立"}, - {"contradiction": "矛盾", "entailment": "蕴含", "neutral": "中立"}, - {"contradiction": "不同", "entailment": "类似", "neutral": "无关"} - ] -} \ No newline at end of file diff --git a/examples/few_shot/pet/prompt/tnews.json b/examples/few_shot/pet/prompt/tnews.json deleted file mode 100644 index 5a2c7449b8e1..000000000000 --- a/examples/few_shot/pet/prompt/tnews.json +++ /dev/null @@ -1,30 +0,0 @@ -{ - "template": [ - {"text": "阅读下边一则{'mask'}{'mask'}新闻:{'text':'sentence'}"}, - {"text": "阅读这篇标题为「{'text':'sentence'}」的文章,它讲的是{'mask'}{'mask'}。"}, - {"text": "下边这则新闻属于{'mask'}{'mask'}话题{'text':'sentence'}"}, - {"text": "下边这则新闻属于什么话题呢?{'mask'}{'mask'}{'text':'sentence'}"} - ], - "verbalizer": [ - { - "news_story": "八卦", - "news_entertainment": "明星", - "news_finance": "财经", - "news_sports": "体育", - "news_edu": "校园", - "news_game": "游戏", - "news_culture": "文化", - "news_tech": "科技", - "news_car": "汽车", - "news_travel": "旅行", - "news_world": "国际", - "news_agriculture": "农业", - "news_military": "军事", - "news_house": "房子", - "news_stock": "股票" - }, - {}, - {}, - {} - ] -} diff --git a/examples/few_shot/pet/run_train.py b/examples/few_shot/pet/run_train.py deleted file mode 100644 index 3bab91cfe712..000000000000 --- a/examples/few_shot/pet/run_train.py +++ /dev/null @@ -1,170 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import time -from dataclasses import dataclass, field -from functools import partial - -import paddle -from data import load_fewclue_dataset -from paddle.metric import Accuracy -from paddle.static import InputSpec -from utils import load_prompt_arguments, save_fewclue_prediction, save_pseudo_data - -from paddlenlp.prompt import ( - ManualTemplate, - MaskedLMVerbalizer, - PromptModelForSequenceClassification, - PromptTrainer, - PromptTuningArguments, -) -from paddlenlp.trainer import PdArgumentParser -from paddlenlp.transformers import AutoModelForMaskedLM, AutoTokenizer -from paddlenlp.utils.log import logger - - -# yapf: disable -@dataclass -class DataArguments: - task_name: str = field(default="eprstmt", metadata={"help": "The task name in FewCLUE."}) - split_id: str = field(default="0", metadata={"help": "The split id of datasets, including 0, 1, 2, 3, 4, few_all."}) - prompt_path: str = field(default="prompt/eprstmt.json", metadata={"help": "Path to the defined prompts."}) - prompt_index: int = field(default=0, metadata={"help": "The index of defined prompt for training."}) - augment_type: str = field(default=None, metadata={"help": "The strategy used for data augmentation, including `swap`, `delete`, `insert`, `subsitute`."}) - num_augment: str = field(default=5, metadata={"help": "Number of augmented data per example, which works when `augment_type` is set."}) - word_augment_percent: str = field(default=0.1, metadata={"help": "Percentage of augmented words in sequences, used for `swap`, `delete`, `insert`, `subsitute`."}) - augment_method: str = field(default="mlm", metadata={"help": "Strategy used for `insert` and `subsitute`."}) - pseudo_data_path: str = field(default=None, metadata={"help": "Path to data with pseudo labels."}) - do_label: bool = field(default=False, metadata={"help": "Whether to label unsupervised data in unlabeled datasets"}) - do_test: bool = field(default=False, metadata={"help": "Whether to evaluate model on public test datasets."}) - - -@dataclass -class ModelArguments: - model_name_or_path: str = field(default="ernie-1.0-large-zh-cw", metadata={"help": "Build-in pretrained model name or the path to local model."}) - export_type: str = field(default='paddle', metadata={"help": "The type to export. Support `paddle` and `onnx`."}) - dropout: float = field(default=0.1, metadata={"help": "The dropout used for pretrained model."}) -# yapf: enable - - -def main(): - # Parse the arguments. - parser = PdArgumentParser((ModelArguments, DataArguments, PromptTuningArguments)) - model_args, data_args, training_args = parser.parse_args_into_dataclasses() - data_args = load_prompt_arguments(data_args) - training_args.print_config(model_args, "Model") - training_args.print_config(data_args, "Data") - paddle.set_device(training_args.device) - - # Load the pretrained language model. - tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path) - model = AutoModelForMaskedLM.from_pretrained( - model_args.model_name_or_path, - hidden_dropout_prob=model_args.dropout, - attention_probs_dropout_prob=model_args.dropout, - ) - - # Define template for preprocess and verbalizer for postprocess. - template = ManualTemplate(data_args.prompt, tokenizer, training_args.max_seq_length) - logger.info("Using template: {}".format(template.prompt)) - - verbalizer = MaskedLMVerbalizer(data_args.label_words, tokenizer) - labels_to_ids = verbalizer.labels_to_ids - ids_to_labels = {idx: label for label, idx in labels_to_ids.items()} - logger.info("Using verbalizer: {}".format(data_args.label_words)) - - # Load datasets. - data_ds, label_list = load_fewclue_dataset(data_args, verbalizer=verbalizer, example_keys=template.example_keys) - train_ds, dev_ds, public_test_ds, test_ds, unlabeled_ds = data_ds - dev_labels, test_labels = label_list - - # Define the criterion. - criterion = paddle.nn.CrossEntropyLoss() - - # Initialize the prompt model with the above variables. - prompt_model = PromptModelForSequenceClassification( - model, template, verbalizer, freeze_plm=training_args.freeze_plm, freeze_dropout=training_args.freeze_dropout - ) - - # Define the metric function. - def compute_metrics(eval_preds, labels, verbalizer): - metric = Accuracy() - predictions = paddle.to_tensor(eval_preds.predictions) - predictions = verbalizer.aggregate_multiple_mask(predictions) - correct = metric.compute(predictions, paddle.to_tensor(labels)) - metric.update(correct) - acc = metric.accumulate() - return {"accuracy": acc} - - # Initialize the trainer. - dev_compute_metrics = partial(compute_metrics, labels=dev_labels, verbalizer=verbalizer) - trainer = PromptTrainer( - model=prompt_model, - tokenizer=tokenizer, - args=training_args, - criterion=criterion, - train_dataset=train_ds, - eval_dataset=dev_ds, - callbacks=None, - compute_metrics=dev_compute_metrics, - ) - - # Traininig. - if training_args.do_train: - train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) - metrics = train_result.metrics - trainer.save_model() - trainer.log_metrics("train", metrics) - trainer.save_metrics("train", metrics) - trainer.save_state() - - time_stamp = time.strftime("%m%d-%H-%M-%S", time.localtime()) - - # Test. - if data_args.do_test and public_test_ds is not None: - test_compute_metrics = partial(compute_metrics, labels=test_labels, verbalizer=verbalizer) - trainer.compute_metrics = test_compute_metrics - test_ret = trainer.predict(public_test_ds) - trainer.log_metrics("test", test_ret.metrics) - - # Predict. - if training_args.do_predict and test_ds is not None: - pred_ret = trainer.predict(test_ds) - logger.info("Prediction done.") - predict_path = os.path.join(training_args.output_dir, "fewclue_submit_examples_" + time_stamp) - save_fewclue_prediction(predict_path, data_args.task_name, pred_ret, verbalizer, ids_to_labels) - - # Label unsupervised data. - if data_args.do_label and unlabeled_ds is not None: - label_ret = trainer.predict(unlabeled_ds) - logger.info("Labeling done.") - pseudo_path = os.path.join(training_args.output_dir, "pseudo_data_" + time_stamp + ".txt") - save_pseudo_data(pseudo_path, data_args.task_name, label_ret, verbalizer, ids_to_labels) - - # Export static model. - if training_args.do_export: - input_spec = [ - InputSpec(shape=[None, None], dtype="int64"), # input_ids, - InputSpec(shape=[None, None], dtype="int64"), # token_type_ids - InputSpec(shape=[None, None], dtype="int64"), # position_ids - InputSpec(shape=[None, None, None, None], dtype="float32"), # attention_mask - InputSpec(shape=[None], dtype="int64"), # masked_positions - ] - export_path = os.path.join(training_args.output_dir, "export") - trainer.export_model(export_path, input_spec=input_spec, export_type=model_args.export_type) - - -if __name__ == "__main__": - main() diff --git a/examples/few_shot/pet/utils.py b/examples/few_shot/pet/utils.py deleted file mode 100644 index 989b4e6b81a8..000000000000 --- a/examples/few_shot/pet/utils.py +++ /dev/null @@ -1,249 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import json -import os -import pathlib - -import numpy as np -import paddle - -from paddlenlp.datasets import load_dataset - -LABEL_TO_STANDARD = { - "tnews": { - "news_story": "100", - "news_culture": "101", - "news_entertainment": "102", - "news_sports": "103", - "news_finance": "104", - "news_house": "106", - "news_car": "107", - "news_edu": "108", - "news_tech": "109", - "news_military": "110", - "news_travel": "112", - "news_world": "113", - "news_stock": "114", - "news_agriculture": "115", - "news_game": "116", - }, - "iflytek": { - "打车": 0, - "美颜": 100, - "影像剪辑": 101, - "摄影修图": 102, - "相机": 103, - "绘画": 104, - "二手": 105, - "电商": 106, - "团购": 107, - "外卖": 108, - "电影票务": 109, - "社区服务": 10, - "社区超市": 110, - "购物咨询": 111, - "笔记": 112, - "办公": 113, - "日程管理": 114, - "女性": 115, - "经营": 116, - "收款": 117, - "其他": 118, - "薅羊毛": 11, - "魔幻": 12, - "仙侠": 13, - "卡牌": 14, - "飞行空战": 15, - "射击游戏": 16, - "休闲益智": 17, - "动作类": 18, - "体育竞技": 19, - "地图导航": 1, - "棋牌中心": 20, - "经营养成": 21, - "策略": 22, - "MOBA": 23, - "辅助工具": 24, - "约会社交": 25, - "即时通讯": 26, - "工作社交": 27, - "论坛圈子": 28, - "婚恋社交": 29, - "免费WIFI": 2, - "情侣社交": 30, - "社交工具": 31, - "生活社交": 32, - "微博博客": 33, - "新闻": 34, - "漫画": 35, - "小说": 36, - "技术": 37, - "教辅": 38, - "问答交流": 39, - "租车": 3, - "搞笑": 40, - "杂志": 41, - "百科": 42, - "影视娱乐": 43, - "求职": 44, - "兼职": 45, - "视频": 46, - "短视频": 47, - "音乐": 48, - "直播": 49, - "同城服务": 4, - "电台": 50, - "K歌": 51, - "成人": 52, - "中小学": 53, - "职考": 54, - "公务员": 55, - "英语": 56, - "视频教育": 57, - "高等教育": 58, - "成人教育": 59, - "快递物流": 5, - "艺术": 60, - "语言(非英语)": 61, - "旅游资讯": 62, - "综合预定": 63, - "民航": 64, - "铁路": 65, - "酒店": 66, - "行程管理": 67, - "民宿短租": 68, - "出国": 69, - "婚庆": 6, - "工具": 70, - "亲子儿童": 71, - "母婴": 72, - "驾校": 73, - "违章": 74, - "汽车咨询": 75, - "汽车交易": 76, - "日常养车": 77, - "行车辅助": 78, - "租房": 79, - "家政": 7, - "买房": 80, - "装修家居": 81, - "电子产品": 82, - "问诊挂号": 83, - "养生保健": 84, - "医疗服务": 85, - "减肥瘦身": 86, - "美妆美业": 87, - "菜谱": 88, - "餐饮店": 89, - "公共交通": 8, - "体育咨讯": 90, - "运动健身": 91, - "支付": 92, - "保险": 93, - "股票": 94, - "借贷": 95, - "理财": 96, - "彩票": 97, - "记账": 98, - "银行": 99, - "政务": 9, - }, -} - - -def load_prompt_arguments(args): - """ - Load prompt and label words according to prompt index. - """ - with open(args.prompt_path, "r", encoding="utf-8") as fp: - configs = json.load(fp) - assert len(configs["verbalizer"]) == len(configs["template"]) - assert configs["verbalizer"][0] is not None - verbalizer = [configs["verbalizer"][0]] - last_verb_index = 0 - for index, verb in enumerate(configs["verbalizer"][1:]): - if verb is None or len(verb) == 0: - verbalizer.append(configs["verbalizer"][last_verb_index]) - else: - verbalizer.append(verb) - last_verb_index = index + 1 - configs["verbalizer"] = verbalizer - args.prompt = configs["template"][args.prompt_index]["text"] - label_words = configs["verbalizer"][args.prompt_index] - if isinstance(label_words, list): - label_words = {k: k for k in label_words} - args.label_words = label_words - return args - - -def save_pseudo_data(save_path, task_name, label_preds, verbalizer, labels): - """ - Combine unsupervised data and corresponding predicted labels and - save one example per line. - """ - if task_name == "cluewsc": - return None - - data_ds = load_dataset("fewclue", name=task_name, splits="unlabeled") - preds = paddle.to_tensor(label_preds.predictions) - preds = verbalizer.aggregate_multiple_mask(preds) - preds = paddle.nn.functional.softmax(preds, axis=1).numpy() - label_preds = np.argmax(preds, axis=1) - label_probs = np.max(preds, axis=1) - pseudo_data = [] - for index, example in enumerate(data_ds): - example["labels"] = labels[label_preds[index]] - example["prob"] = str(label_probs[index]) - pseudo_data.append(example) - save_data(pseudo_data, save_path) - - -def save_fewclue_prediction(save_path, task_name, label_preds, verbalizer, labels): - """ - Extract predicted labels and save as the format required by FewCLUE. - """ - preds = paddle.to_tensor(label_preds.predictions) - preds = verbalizer.aggregate_multiple_mask(preds) - if task_name == "chid": - batch_size = preds.shape[0] - preds = paddle.nn.functional.softmax(preds, axis=1)[:, 1] - preds = preds.reshape([batch_size // 7, 7]) - preds = paddle.nn.functional.softmax(preds, axis=1).numpy() - preds = np.argmax(preds, axis=1) - test_ds = load_dataset("fewclue", name=task_name, splits="test") - - ret_list = [] - maps = LABEL_TO_STANDARD.get(task_name, None) - for idx, example in enumerate(test_ds): - uid = example.get("id", idx) - if task_name in ["bustm", "csl"]: - ret_list.append({"id": uid, "label": str(preds[idx])}) - elif task_name == "chid": - ret_list.append({"id": uid, "answer": preds[idx]}) - elif task_name in ["cluewsc", "eprstmt", "ocnli", "csldcp"]: - ret_list.append({"id": uid, "label": labels[preds[idx]]}) - elif task_name in ["iflytek", "tnews"]: - ret_list.append({"id": uid, "label": str(maps[labels[preds[idx]]])}) - save_file = task_name if task_name in ["bustm", "csldcp", "eprstmt"] else task_name + "f" - save_data(ret_list, save_path, save_file + "_predict.json") - - -def save_data(data, save_path, save_file=None): - if save_file is not None: - pathlib.Path(save_path).mkdir(parents=True, exist_ok=True) - save_path = os.path.join(save_path, save_file) - with open(save_path, "w") as fp: - for example in data: - fp.write(json.dumps(example, ensure_ascii=False) + "\n") diff --git a/examples/information_extraction/DuIE/predict.sh b/examples/information_extraction/DuIE/predict.sh deleted file mode 100644 index dd4a1da7f4cd..000000000000 --- a/examples/information_extraction/DuIE/predict.sh +++ /dev/null @@ -1,14 +0,0 @@ -set -eux - -export CUDA_VISIBLE_DEVICES=0 -export BATCH_SIZE=64 -export CKPT=./checkpoints/model_90000.pdparams -export DATASET_FILE=./data/test1.json - -python run_duie.py \ - --do_predict \ - --init_checkpoint $CKPT \ - --predict_data_file $DATASET_FILE \ - --max_seq_length 128 \ - --batch_size $BATCH_SIZE - diff --git a/examples/information_extraction/waybill_ie/README.md b/examples/information_extraction/waybill_ie/README.md deleted file mode 100644 index c842c91c7242..000000000000 --- a/examples/information_extraction/waybill_ie/README.md +++ /dev/null @@ -1,102 +0,0 @@ -# 快递单信息抽取 (Waybill Information Extraction) - -## 简介 - -本示例将通过BiGRU-CRF和ERNIE + FC两类模型,演示如何从用户提供的快递单中,抽取姓名、电话、省、市、区、详细地址等内容,形成结构化信息。辅助物流行业从业者进行有效信息的提取,从而降低客户填单的成本。 - -## 快速开始 - -### 数据准备 - -执行以下命令,下载并解压示例数据集: - -```bash -python download.py --data_dir ./waybill_ie -``` - -数据示例如下: - -``` -1^B6^B6^B2^B0^B2^B0^B0^B0^B7^B7^B宣^B荣^B嗣^B甘^B肃^B省^B白^B银^B市^B会^B宁^B县^B河^B畔^B镇^B十^B字^B街^B金^B海^B超^B市^B西^B行^B5^B0^B米 T-B^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BP-B^BP-I^BP-I^BA1-B^BA1-I^BA1-I^BA2-B^BA2-I^BA2-I^BA3-B^BA3-I^BA3-I^BA4-B^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I -1^B3^B5^B5^B2^B6^B6^B4^B3^B0^B7^B姜^B骏^B炜^B云^B南^B省^B德^B宏^B傣^B族^B景^B颇^B族^B自^B治^B州^B盈^B江^B县^B平^B原^B镇^B蜜^B回^B路^B下^B段 T-B^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BT-I^BP-B^BP-I^BP-I^BA1-B^BA1-I^BA1-I^BA2-B^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA2-I^BA3-B^BA3-I^BA3-I^BA4-B^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I^BA4-I -``` -数据集中以特殊字符"\t"分隔文本、标签,以特殊字符"\002"(示例中显示为"^B")分隔每个字。标签的定义如下: - -| 标签 | 定义 | 标签 | 定义 | -| -------- | -------- |-------- | -------- | -| P-B | 姓名起始位置 | P-I | 姓名中间位置或结束位置 | -| T-B | 电话起始位置 | T-I | 电话中间位置或结束位置 | -| A1-B | 省份起始位置 | A1-I | 省份中间位置或结束位置 | -| A2-B | 城市起始位置 | A2-I | 城市中间位置或结束位置 | -| A3-B | 县区起始位置 | A3-I | 县区中间位置或结束位置 | -| A4-B | 详细地址起始位置 | A4-I | 详细地址中间位置或结束位置 | -| O | 无关字符 | | | - -数据标注采用**BIO模式**。其中 B(begin) 表示一个标签类别的开头,比如 P-B 指的是姓名的开头;相应的,I(inside) 表示一个标签的延续。O表示Outside无关字符。更多标注模式介绍请参考[Inside–outside–beginning (tagging)](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)) - -### 启动训练 - -本项目提供了两种模型结构,一种是BiGRU+CRF结构,另一种是ERNIE+FC结构,前者显存占用小,推理速度快;后者能够在更快收敛并取得更高的精度,但推理速度较慢。 - -#### 启动BiGRU + CRF训练 - -```bash -export CUDA_VISIBLE_DEVICES=0 -python run_bigru_crf.py -``` - -#### 启动ERNIE + FC训练 - -```bash -export CUDA_VISIBLE_DEVICES=0 -python run_ernie.py -``` -##### 模型导出 -使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,具体代码见export_model.py。静态图参数保存在output_path指定路径中。 运行方式: - -基于 `ERNIE` 的模型结构的导出方式 - -```bash -python export_ernie_model.py --params_path ernie_ckpt/model_80/model_state.pdparams --output_path=./output -``` - -基于 `ERNIE + CRF` 的模型结构的导出方式 - -```bash -python export_ernie_crf_model.py --params_path ernie_ckpt/model_80/model_state.pdparams --output_path=./output -``` - -基于 `BIGRU + CRF` 的模型结构的导出方式 - -```bash -python export_bigru_crf_model.py --params_path bigru_crf_ckpt/model_80/model_state.pdparams --output_path=./output -``` - -其中`params_path`是指动态图训练保存的参数路径,`output_path`是指静态图参数导出路径。 - -#### 模型部署 -导出模型之后,可以用于部署,deploy/python文件提供了python部署预测示例。运行方式: - -基于 `ERNIE` 的模型 - -```bash -python deploy/python/predict_ernie.py --model_dir ./output -``` - -基于 `ERNIE + CRF` 的模型 - -```bash -python deploy/python/predict_ernie_crf.py --model_dir ./output -``` - -基于 `BIGRU + CRF` 的模型 - -```bash -python deploy/python/predict_bigru_crf.py --model_dir ./output -``` - -## 更多详细教程请参考: - -[基于Bi-GRU+CRF的快递单信息抽取](https://aistudio.baidu.com/aistudio/projectdetail/1317771) - -[使用预训练模型ERNIE优化快递单信息抽取](https://aistudio.baidu.com/aistudio/projectdetail/1329361) diff --git a/examples/information_extraction/waybill_ie/data.py b/examples/information_extraction/waybill_ie/data.py deleted file mode 100644 index d276b4551074..000000000000 --- a/examples/information_extraction/waybill_ie/data.py +++ /dev/null @@ -1,79 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from paddlenlp.datasets import MapDataset - - -def load_dict(dict_path): - vocab = {} - i = 0 - with open(dict_path, "r", encoding="utf-8") as fin: - for line in fin: - key = line.strip("\n") - vocab[key] = i - i += 1 - return vocab - - -def load_dataset(datafiles): - def read(data_path): - with open(data_path, "r", encoding="utf-8") as fp: - next(fp) # Skip header - for line in fp.readlines(): - words, labels = line.strip("\n").split("\t") - words = words.split("\002") - labels = labels.split("\002") - yield words, labels - - if isinstance(datafiles, str): - return MapDataset(list(read(datafiles))) - elif isinstance(datafiles, list) or isinstance(datafiles, tuple): - return [MapDataset(list(read(datafile))) for datafile in datafiles] - - -def parse_decodes(sentences, predictions, lengths, label_vocab): - """Parse the padding result - - Args: - sentences (list): the tagging sentences. - predictions (list): the prediction tags. - lengths (list): the valid length of each sentence. - label_vocab (dict): the label vocab. - - Returns: - outputs (list): the formatted output. - """ - predictions = [x for batch in predictions for x in batch] - lengths = [x for batch in lengths for x in batch] - id_label = dict(zip(label_vocab.values(), label_vocab.keys())) - - outputs = [] - for idx, end in enumerate(lengths): - sent = sentences[idx][:end] - tags = [id_label[x] for x in predictions[idx][:end]] - sent_out = [] - tags_out = [] - words = "" - for s, t in zip(sent, tags): - if t.endswith("-B") or t == "O": - if len(words): - sent_out.append(words) - tags_out.append(t.split("-")[0]) - words = s - else: - words += s - if len(sent_out) < len(tags_out): - sent_out.append(words) - outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)])) - return outputs diff --git a/examples/information_extraction/waybill_ie/deploy/python/predict_bigru_crf.py b/examples/information_extraction/waybill_ie/deploy/python/predict_bigru_crf.py deleted file mode 100644 index 2578b69c4c6c..000000000000 --- a/examples/information_extraction/waybill_ie/deploy/python/predict_bigru_crf.py +++ /dev/null @@ -1,290 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import os -from functools import partial - -import paddle -from paddle import inference - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.datasets import load_dataset -from paddlenlp.utils.log import logger - -parser = argparse.ArgumentParser(__doc__) -parser.add_argument("--model_dir", type=str, default="./output", help="The path to parameters in static graph.") -parser.add_argument( - "--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located." -) -parser.add_argument("--batch_size", type=int, default=200, help="The number of sequences contained in a mini-batch.") -parser.add_argument( - "--device", - default="gpu", - type=str, - choices=["cpu", "gpu"], - help="The device to select to train the model, is must be cpu/gpu.", -) -parser.add_argument( - "--use_tensorrt", default=False, type=eval, choices=[True, False], help="Enable to use tensorrt to speed up." -) -parser.add_argument( - "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision." -) -parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.") -parser.add_argument( - "--enable_mkldnn", - default=False, - type=eval, - choices=[True, False], - help="Enable to use mkldnn to speed up when using cpu.", -) -parser.add_argument( - "--benchmark", type=eval, default=False, help="To log some information about environment and running." -) -parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") -args = parser.parse_args() - - -def load_dict(dict_path): - vocab = {} - i = 0 - with open(dict_path, "r", encoding="utf-8") as fin: - for line in fin: - key = line.strip("\n") - vocab[key] = i - i += 1 - return vocab - - -def load_vocab(dict_path): - """Load vocab from file""" - vocab = {} - reverse = None - with open(dict_path, "r", encoding="utf8") as fin: - for i, line in enumerate(fin): - terms = line.strip("\n").split("\t") - if len(terms) == 2: - if reverse is None: - reverse = True if terms[0].isdigit() else False - if reverse: - value, key = terms - else: - key, value = terms - elif len(terms) == 1: - key, value = terms[0], i - else: - raise ValueError("Error line: %s in file: %s" % (line, dict_path)) - vocab[key] = value - return vocab - - -def parse_decodes(sentences, predictions, lengths, label_vocab): - """Parse the padding result - - Args: - sentences (list): the tagging sentences. - predictions (list): the prediction tags. - lengths (list): the valid length of each sentence. - label_vocab (dict): the label vocab. - - Returns: - outputs (list): the formatted output. - """ - predictions = [x for batch in predictions for x in batch] - lengths = [x for batch in lengths for x in batch] - id_label = dict(zip(label_vocab.values(), label_vocab.keys())) - - outputs = [] - for idx, end in enumerate(lengths): - sent = sentences[idx][:end] - print(predictions[idx][:end]) - tags = [id_label[x] for x in predictions[idx][:end]] - sent_out = [] - tags_out = [] - words = "" - for s, t in zip(sent, tags): - if t.endswith("-B") or t == "O": - if len(words): - sent_out.append(words) - tags_out.append(t.split("-")[0]) - words = s - else: - words += s - if len(sent_out) < len(tags_out): - sent_out.append(words) - outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)])) - return outputs - - -def convert_tokens_to_ids(tokens, vocab, oov_token=None): - token_ids = [] - oov_id = vocab.get(oov_token) if oov_token else None - for token in tokens: - token_id = vocab.get(token, oov_id) - token_ids.append(token_id) - return token_ids - - -def convert_to_features(example, word_vocab): - tokens = example[0] - token_ids = convert_tokens_to_ids(tokens, word_vocab, "OOV") - return token_ids, len(token_ids) - - -def read(data_path): - with open(data_path, "r", encoding="utf-8") as fp: - next(fp) # Skip header - for line in fp.readlines(): - words, labels = line.strip("\n").split("\t") - words = words.split("\002") - labels = labels.split("\002") - yield words, labels - - -class Predictor(object): - def __init__( - self, - model_dir, - device="gpu", - batch_size=200, - use_tensorrt=False, - precision="fp32", - enable_mkldnn=False, - benchmark=False, - save_log_path="", - ): - self.batch_size = batch_size - model_file = os.path.join(model_dir, "inference.pdmodel") - param_file = os.path.join(model_dir, "inference.pdiparams") - if not os.path.exists(model_file): - raise ValueError("not find model file path {}".format(model_file)) - if not os.path.exists(param_file): - raise ValueError("not find params file path {}".format(param_file)) - config = paddle.inference.Config(model_file, param_file) - if device == "gpu": - # set GPU configs accordingly - # such as initialize the gpu memory, enable tensorrt - config.enable_use_gpu(100, 0) - precision_map = { - "fp16": inference.PrecisionType.Half, - "fp32": inference.PrecisionType.Float32, - "int8": inference.PrecisionType.Int8, - } - precision_mode = precision_map[precision] - - if use_tensorrt: - config.enable_tensorrt_engine( - max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode - ) - elif device == "cpu": - # set CPU configs accordingly, - # such as enable_mkldnn, set_cpu_math_library_num_threads - config.disable_gpu() - if enable_mkldnn: - # cache 10 different shapes for mkldnn to avoid memory leak - config.set_mkldnn_cache_capacity(10) - config.enable_mkldnn() - config.set_cpu_math_library_num_threads(args.cpu_threads) - elif device == "xpu": - # set XPU configs accordingly - config.enable_xpu(100) - - config.switch_use_feed_fetch_ops(False) - self.predictor = paddle.inference.create_predictor(config) - self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] - self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) - - if args.benchmark: - import auto_log - - pid = os.getpid() - self.autolog = auto_log.AutoLogger( - model_name="ernie-3.0-medium-zh", - model_precision=precision, - batch_size=self.batch_size, - data_shape="dynamic", - save_path=save_log_path, - inference_config=config, - pids=pid, - process_name=None, - gpu_ids=0, - time_keys=["preprocess_time", "inference_time", "postprocess_time"], - warmup=0, - logger=logger, - ) - - def predict(self, dataset, batchify_fn, word_vocab, label_vocab): - if args.benchmark: - self.autolog.times.start() - all_preds = [] - all_lens = [] - num_of_examples = len(dataset) - trans_func = partial(convert_to_features, word_vocab=word_vocab) - start_idx = 0 - while start_idx < num_of_examples: - end_idx = start_idx + self.batch_size - end_idx = end_idx if end_idx < num_of_examples else num_of_examples - batch_data = [trans_func(example) for example in dataset[start_idx:end_idx]] - - if args.benchmark: - self.autolog.times.stamp() - input_ids, lens = batchify_fn(batch_data) - self.input_handles[0].copy_from_cpu(input_ids) - self.input_handles[1].copy_from_cpu(lens) - self.predictor.run() - preds = self.output_handle.copy_to_cpu() - - if args.benchmark: - self.autolog.times.stamp() - # Drop CLS prediction - all_preds.append(preds) - print(preds.shape) - all_lens.append(lens) - - start_idx += self.batch_size - - if args.benchmark: - self.autolog.times.end(stamp=True) - sentences = [example[0] for example in dataset.data] - results = parse_decodes(sentences, all_preds, all_lens, label_vocab) - return results - - -if __name__ == "__main__": - test_ds = load_dataset(read, data_path=os.path.join(args.data_dir, "test.txt"), lazy=False) - label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) - word_vocab = load_dict(os.path.join(args.data_dir, "word.dic")) - - trans_func = partial(convert_to_features, word_vocab=word_vocab) - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=word_vocab.get("OOV", 0), dtype="int64"), # token_ids - Stack(dtype="int64"), # seq_len - ): fn(samples) - - predictor = Predictor( - args.model_dir, - args.device, - args.batch_size, - args.use_tensorrt, - args.precision, - args.enable_mkldnn, - args.benchmark, - args.save_log_path, - ) - - results = predictor.predict(test_ds, batchify_fn, word_vocab, label_vocab) - print("\n".join(results)) - if args.benchmark: - predictor.autolog.report() diff --git a/examples/information_extraction/waybill_ie/deploy/python/predict_ernie.py b/examples/information_extraction/waybill_ie/deploy/python/predict_ernie.py deleted file mode 100644 index bdd3ccfeba9b..000000000000 --- a/examples/information_extraction/waybill_ie/deploy/python/predict_ernie.py +++ /dev/null @@ -1,283 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import os -from functools import partial - -import numpy as np -import paddle -from paddle import inference - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.datasets import load_dataset -from paddlenlp.transformers import AutoTokenizer -from paddlenlp.utils.log import logger - -parser = argparse.ArgumentParser(__doc__) -parser.add_argument("--model_dir", type=str, default="./output", help="The path to parameters in static graph.") -parser.add_argument( - "--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located." -) -parser.add_argument("--batch_size", type=int, default=200, help="The number of sequences contained in a mini-batch.") -parser.add_argument( - "--device", - default="gpu", - type=str, - choices=["cpu", "gpu"], - help="The device to select to train the model, is must be cpu/gpu.", -) -parser.add_argument( - "--use_tensorrt", default=False, type=eval, choices=[True, False], help="Enable to use tensorrt to speed up." -) -parser.add_argument( - "--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help="The tensorrt precision." -) -parser.add_argument("--cpu_threads", default=10, type=int, help="Number of threads to predict when using cpu.") -parser.add_argument( - "--enable_mkldnn", - default=False, - type=eval, - choices=[True, False], - help="Enable to use mkldnn to speed up when using cpu.", -) -parser.add_argument( - "--benchmark", type=eval, default=False, help="To log some information about environment and running." -) -parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") -args = parser.parse_args() - - -def load_dict(dict_path): - vocab = {} - i = 0 - with open(dict_path, "r", encoding="utf-8") as fin: - for line in fin: - key = line.strip("\n") - vocab[key] = i - i += 1 - return vocab - - -def load_vocab(dict_path): - """Load vocab from file""" - vocab = {} - reverse = None - with open(dict_path, "r", encoding="utf8") as fin: - for i, line in enumerate(fin): - terms = line.strip("\n").split("\t") - if len(terms) == 2: - if reverse is None: - reverse = True if terms[0].isdigit() else False - if reverse: - value, key = terms - else: - key, value = terms - elif len(terms) == 1: - key, value = terms[0], i - else: - raise ValueError("Error line: %s in file: %s" % (line, dict_path)) - vocab[key] = value - return vocab - - -def parse_decodes(sentences, predictions, lengths, label_vocab): - """Parse the padding result - - Args: - sentences (list): the tagging sentences. - predictions (list): the prediction tags. - lengths (list): the valid length of each sentence. - label_vocab (dict): the label vocab. - - Returns: - outputs (list): the formatted output. - """ - predictions = [x for batch in predictions for x in batch] - lengths = [x for batch in lengths for x in batch] - id_label = dict(zip(label_vocab.values(), label_vocab.keys())) - - outputs = [] - for idx, end in enumerate(lengths): - sent = sentences[idx][:end] - tags = [id_label[x] for x in predictions[idx][:end]] - sent_out = [] - tags_out = [] - words = "" - for s, t in zip(sent, tags): - if t.endswith("-B") or t == "O": - if len(words): - sent_out.append(words) - tags_out.append(t.split("-")[0]) - words = s - else: - words += s - if len(sent_out) < len(tags_out): - sent_out.append(words) - outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)])) - return outputs - - -def convert_to_features(example, tokenizer): - tokens = example[0] - tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token") - # Token '[CLS]' and '[SEP]' will get label 'O' - return tokenized_input["input_ids"], tokenized_input["token_type_ids"], tokenized_input["seq_len"] - - -def read(data_path): - with open(data_path, "r", encoding="utf-8") as fp: - next(fp) # Skip header - for line in fp.readlines(): - words, labels = line.strip("\n").split("\t") - words = words.split("\002") - labels = labels.split("\002") - yield words, labels - - -class Predictor(object): - def __init__( - self, - model_dir, - device="gpu", - batch_size=200, - use_tensorrt=False, - precision="fp32", - enable_mkldnn=False, - benchmark=False, - save_log_path="", - ): - self.batch_size = batch_size - model_file = os.path.join(model_dir, "inference.pdmodel") - param_file = os.path.join(model_dir, "inference.pdiparams") - if not os.path.exists(model_file): - raise ValueError("not find model file path {}".format(model_file)) - if not os.path.exists(param_file): - raise ValueError("not find params file path {}".format(param_file)) - config = paddle.inference.Config(model_file, param_file) - if device == "gpu": - # set GPU configs accordingly - # such as initialize the gpu memory, enable tensorrt - config.enable_use_gpu(100, 0) - precision_map = { - "fp16": inference.PrecisionType.Half, - "fp32": inference.PrecisionType.Float32, - "int8": inference.PrecisionType.Int8, - } - precision_mode = precision_map[precision] - - if use_tensorrt: - config.enable_tensorrt_engine( - max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode - ) - elif device == "cpu": - # set CPU configs accordingly, - # such as enable_mkldnn, set_cpu_math_library_num_threads - config.disable_gpu() - if enable_mkldnn: - # cache 10 different shapes for mkldnn to avoid memory leak - config.set_mkldnn_cache_capacity(10) - config.enable_mkldnn() - config.set_cpu_math_library_num_threads(args.cpu_threads) - elif device == "xpu": - # set XPU configs accordingly - config.enable_xpu(100) - - config.switch_use_feed_fetch_ops(False) - self.predictor = paddle.inference.create_predictor(config) - self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] - self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) - - if args.benchmark: - import auto_log - - pid = os.getpid() - self.autolog = auto_log.AutoLogger( - model_name="ernie-3.0-medium-zh", - model_precision=precision, - batch_size=self.batch_size, - data_shape="dynamic", - save_path=save_log_path, - inference_config=config, - pids=pid, - process_name=None, - gpu_ids=0, - time_keys=["preprocess_time", "inference_time", "postprocess_time"], - warmup=0, - logger=logger, - ) - - def predict(self, dataset, batchify_fn, tokenizer, label_vocab): - if args.benchmark: - self.autolog.times.start() - all_preds = [] - all_lens = [] - num_of_examples = len(dataset) - trans_func = partial(convert_to_features, tokenizer=tokenizer) - start_idx = 0 - while start_idx < num_of_examples: - end_idx = start_idx + self.batch_size - end_idx = end_idx if end_idx < num_of_examples else num_of_examples - batch_data = [trans_func(example) for example in dataset[start_idx:end_idx]] - - if args.benchmark: - self.autolog.times.stamp() - input_ids, segment_ids, lens = batchify_fn(batch_data) - self.input_handles[0].copy_from_cpu(input_ids) - self.input_handles[1].copy_from_cpu(segment_ids) - self.predictor.run() - logits = self.output_handle.copy_to_cpu() - - if args.benchmark: - self.autolog.times.stamp() - preds = np.argmax(logits, axis=-1) - # Drop CLS prediction - preds = preds[:, 1:] - all_preds.append(preds) - all_lens.append(lens) - - start_idx += self.batch_size - - if args.benchmark: - self.autolog.times.end(stamp=True) - sentences = [example[0] for example in dataset.data] - results = parse_decodes(sentences, all_preds, all_lens, label_vocab) - return results - - -if __name__ == "__main__": - tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") - test_ds = load_dataset(read, data_path=os.path.join(args.data_dir, "test.txt"), lazy=False) - label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids - Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids - Stack(dtype="int64"), # seq_len - ): fn(samples) - - predictor = Predictor( - args.model_dir, - args.device, - args.batch_size, - args.use_tensorrt, - args.precision, - args.enable_mkldnn, - args.benchmark, - args.save_log_path, - ) - - results = predictor.predict(test_ds, batchify_fn, tokenizer, label_vocab) - print("\n".join(results)) - if args.benchmark: - predictor.autolog.report() diff --git a/examples/information_extraction/waybill_ie/deploy/python/predict_ernie_crf.py b/examples/information_extraction/waybill_ie/deploy/python/predict_ernie_crf.py deleted file mode 100644 index 1158a49aafe2..000000000000 --- a/examples/information_extraction/waybill_ie/deploy/python/predict_ernie_crf.py +++ /dev/null @@ -1,263 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import os -from functools import partial - -import paddle -from paddle import inference - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.datasets import load_dataset -from paddlenlp.transformers import AutoTokenizer -from paddlenlp.utils.log import logger - -# yapf: disable -parser = argparse.ArgumentParser(__doc__) -parser.add_argument("--model_dir", type=str, default='./output', help="The path to parameters in static graph.") -parser.add_argument("--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located.") -parser.add_argument("--batch_size", type=int, default=200, help="The number of sequences contained in a mini-batch.") -parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") -parser.add_argument('--use_tensorrt', default=False, type=eval, choices=[True, False], help='Enable to use tensorrt to speed up.') -parser.add_argument("--precision", default="fp32", type=str, choices=["fp32", "fp16", "int8"], help='The tensorrt precision.') -parser.add_argument('--cpu_threads', default=10, type=int, help='Number of threads to predict when using cpu.') -parser.add_argument('--enable_mkldnn', default=False, type=eval, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.') -parser.add_argument("--benchmark", type=eval, default=False, help="To log some information about environment and running.") -parser.add_argument("--save_log_path", type=str, default="./log_output/", help="The file path to save log.") -args = parser.parse_args() -# yapf: enable - - -def load_dict(dict_path): - vocab = {} - i = 0 - with open(dict_path, "r", encoding="utf-8") as fin: - for line in fin: - key = line.strip("\n") - vocab[key] = i - i += 1 - return vocab - - -def load_vocab(dict_path): - """Load vocab from file""" - vocab = {} - reverse = None - with open(dict_path, "r", encoding="utf8") as fin: - for i, line in enumerate(fin): - terms = line.strip("\n").split("\t") - if len(terms) == 2: - if reverse is None: - reverse = True if terms[0].isdigit() else False - if reverse: - value, key = terms - else: - key, value = terms - elif len(terms) == 1: - key, value = terms[0], i - else: - raise ValueError("Error line: %s in file: %s" % (line, dict_path)) - vocab[key] = value - return vocab - - -def parse_decodes(sentences, predictions, lengths, label_vocab): - """Parse the padding result - - Args: - sentences (list): the tagging sentences. - predictions (list): the prediction tags. - lengths (list): the valid length of each sentence. - label_vocab (dict): the label vocab. - - Returns: - outputs (list): the formatted output. - """ - predictions = [x for batch in predictions for x in batch] - lengths = [x for batch in lengths for x in batch] - id_label = dict(zip(label_vocab.values(), label_vocab.keys())) - - outputs = [] - for idx, end in enumerate(lengths): - sent = sentences[idx][:end] - tags = [id_label[x] for x in predictions[idx][:end]] - sent_out = [] - tags_out = [] - words = "" - for s, t in zip(sent, tags): - if t.endswith("-B") or t == "O": - if len(words): - sent_out.append(words) - tags_out.append(t.split("-")[0]) - words = s - else: - words += s - if len(sent_out) < len(tags_out): - sent_out.append(words) - outputs.append("".join([str((s, t)) for s, t in zip(sent_out, tags_out)])) - return outputs - - -def convert_to_features(example, tokenizer): - tokens = example[0] - tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token") - # Token '[CLS]' and '[SEP]' will get label 'O' - return tokenized_input["input_ids"], tokenized_input["token_type_ids"], tokenized_input["seq_len"] - - -def read(data_path): - with open(data_path, "r", encoding="utf-8") as fp: - next(fp) # Skip header - for line in fp.readlines(): - words, labels = line.strip("\n").split("\t") - words = words.split("\002") - labels = labels.split("\002") - yield words, labels - - -class Predictor(object): - def __init__( - self, - model_dir, - device="gpu", - batch_size=200, - use_tensorrt=False, - precision="fp32", - enable_mkldnn=False, - benchmark=False, - save_log_path="", - ): - self.batch_size = batch_size - model_file = os.path.join(model_dir, "inference.pdmodel") - param_file = os.path.join(model_dir, "inference.pdiparams") - if not os.path.exists(model_file): - raise ValueError("not find model file path {}".format(model_file)) - if not os.path.exists(param_file): - raise ValueError("not find params file path {}".format(param_file)) - config = paddle.inference.Config(model_file, param_file) - if device == "gpu": - # set GPU configs accordingly - # such as initialize the gpu memory, enable tensorrt - config.enable_use_gpu(100, 0) - precision_map = { - "fp16": inference.PrecisionType.Half, - "fp32": inference.PrecisionType.Float32, - "int8": inference.PrecisionType.Int8, - } - precision_mode = precision_map[precision] - - if use_tensorrt: - config.enable_tensorrt_engine( - max_batch_size=batch_size, min_subgraph_size=30, precision_mode=precision_mode - ) - elif device == "cpu": - # set CPU configs accordingly, - # such as enable_mkldnn, set_cpu_math_library_num_threads - config.disable_gpu() - if enable_mkldnn: - # cache 10 different shapes for mkldnn to avoid memory leak - config.set_mkldnn_cache_capacity(10) - config.enable_mkldnn() - config.set_cpu_math_library_num_threads(args.cpu_threads) - elif device == "xpu": - # set XPU configs accordingly - config.enable_xpu(100) - - config.switch_use_feed_fetch_ops(False) - self.predictor = paddle.inference.create_predictor(config) - self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] - self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) - - if args.benchmark: - import auto_log - - pid = os.getpid() - self.autolog = auto_log.AutoLogger( - model_name="ernie-3.0-medium-zh", - model_precision=precision, - batch_size=self.batch_size, - data_shape="dynamic", - save_path=save_log_path, - inference_config=config, - pids=pid, - process_name=None, - gpu_ids=0, - time_keys=["preprocess_time", "inference_time", "postprocess_time"], - warmup=0, - logger=logger, - ) - - def predict(self, dataset, batchify_fn, tokenizer, label_vocab): - if args.benchmark: - self.autolog.times.start() - all_preds = [] - all_lens = [] - num_of_examples = len(dataset) - trans_func = partial(convert_to_features, tokenizer=tokenizer) - start_idx = 0 - while start_idx < num_of_examples: - end_idx = start_idx + self.batch_size - end_idx = end_idx if end_idx < num_of_examples else num_of_examples - batch_data = [trans_func(example) for example in dataset[start_idx:end_idx]] - - if args.benchmark: - self.autolog.times.stamp() - input_ids, segment_ids, lens = batchify_fn(batch_data) - self.input_handles[0].copy_from_cpu(input_ids) - self.input_handles[1].copy_from_cpu(segment_ids) - self.input_handles[2].copy_from_cpu(lens) - self.predictor.run() - preds = self.output_handle.copy_to_cpu() - - if args.benchmark: - self.autolog.times.stamp() - preds = [pred[1:] for pred in preds] - all_preds.append(preds) - all_lens.append(lens) - - start_idx += self.batch_size - - if args.benchmark: - self.autolog.times.end(stamp=True) - sentences = [example[0] for example in dataset.data] - results = parse_decodes(sentences, all_preds, all_lens, label_vocab) - return results - - -if __name__ == "__main__": - tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") - test_ds = load_dataset(read, data_path=os.path.join(args.data_dir, "test.txt"), lazy=False) - label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids - Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids - Stack(dtype="int64"), # seq_len - ): fn(samples) - - predictor = Predictor( - args.model_dir, - args.device, - args.batch_size, - args.use_tensorrt, - args.precision, - args.enable_mkldnn, - args.benchmark, - args.save_log_path, - ) - - results = predictor.predict(test_ds, batchify_fn, tokenizer, label_vocab) - print("\n".join(results)) - if args.benchmark: - predictor.autolog.report() diff --git a/examples/information_extraction/waybill_ie/download.py b/examples/information_extraction/waybill_ie/download.py deleted file mode 100644 index a76b56b99aee..000000000000 --- a/examples/information_extraction/waybill_ie/download.py +++ /dev/null @@ -1,32 +0,0 @@ -# -*- coding: utf-8 -*- -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. -# -# Licensed under the Apache License, Version 2.0 (the 'License'); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an 'AS IS' BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import sys - -from paddle.utils.download import get_path_from_url - -URL = "https://bj.bcebos.com/paddlenlp/paddlenlp/datasets/waybill.tar.gz" - - -def main(arguments): - parser = argparse.ArgumentParser() - parser.add_argument("-d", "--data_dir", help="directory to save data to", type=str, default="./") - args = parser.parse_args(arguments) - get_path_from_url(URL, args.data_dir) - - -if __name__ == "__main__": - sys.exit(main(sys.argv[1:])) diff --git a/examples/information_extraction/waybill_ie/export_bigru_crf_model.py b/examples/information_extraction/waybill_ie/export_bigru_crf_model.py deleted file mode 100644 index b439dc30836a..000000000000 --- a/examples/information_extraction/waybill_ie/export_bigru_crf_model.py +++ /dev/null @@ -1,60 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os - -import paddle -from data import load_dict -from model import BiGRUWithCRF - -parser = argparse.ArgumentParser() -parser.add_argument( - "--params_path", - type=str, - required=True, - default="./checkpoint/model_900/model_state.pdparams", - help="The path to model parameters to be loaded.", -) -parser.add_argument( - "--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved." -) -parser.add_argument( - "--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located." -) -args = parser.parse_args() - -if __name__ == "__main__": - # The number of labels should be in accordance with the training dataset. - label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) - word_vocab = load_dict(os.path.join(args.data_dir, "word.dic")) - - # Define the model netword and its loss - model = BiGRUWithCRF(300, 256, len(word_vocab), len(label_vocab)) - if args.params_path and os.path.isfile(args.params_path): - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - print("Loaded parameters from %s" % args.params_path) - model.eval() - - model = paddle.jit.to_static( - model, - input_spec=[ - paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids - paddle.static.InputSpec(shape=[None], dtype="int64"), # lengths - ], - ) - - save_path = os.path.join(args.output_path, "inference") - paddle.jit.save(model, save_path) diff --git a/examples/information_extraction/waybill_ie/export_ernie_crf_model.py b/examples/information_extraction/waybill_ie/export_ernie_crf_model.py deleted file mode 100644 index 9f1d6839e54e..000000000000 --- a/examples/information_extraction/waybill_ie/export_ernie_crf_model.py +++ /dev/null @@ -1,55 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os - -import paddle -from data import load_dict -from model import ErnieCrfForTokenClassification - -from paddlenlp.transformers import AutoModelForTokenClassification - -# fmt: off -parser = argparse.ArgumentParser() -parser.add_argument("--params_path", type=str, required=True, default="./checkpoint/model_900/model_state.pdparams", help="The path to model parameters to be loaded.") -parser.add_argument("--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved.") -parser.add_argument("--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located.") -args = parser.parse_args() -# fmt: on - -if __name__ == "__main__": - # The number of labels should be in accordance with the training dataset. - label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) - - # Define the model netword and its loss - ernie = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab)) - model = ErnieCrfForTokenClassification(ernie) - if args.params_path and os.path.isfile(args.params_path): - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - print("Loaded parameters from %s" % args.params_path) - model.eval() - - model = paddle.jit.to_static( - model, - input_spec=[ - paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids - paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids - paddle.static.InputSpec(shape=[None], dtype="int64"), # lengths - ], - ) - - save_path = os.path.join(args.output_path, "inference") - paddle.jit.save(model, save_path) diff --git a/examples/information_extraction/waybill_ie/export_ernie_model.py b/examples/information_extraction/waybill_ie/export_ernie_model.py deleted file mode 100644 index 2436a98b4af8..000000000000 --- a/examples/information_extraction/waybill_ie/export_ernie_model.py +++ /dev/null @@ -1,52 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os - -import paddle -from data import load_dict - -from paddlenlp.transformers import AutoModelForTokenClassification - -# fmt: off -parser = argparse.ArgumentParser() -parser.add_argument("--params_path", type=str, required=True, default="./checkpoint/model_900/model_state.pdparams", help="The path to model parameters to be loaded.") -parser.add_argument("--output_path", type=str, default="./output", help="The path of model parameter in static graph to be saved.") -parser.add_argument("--data_dir", type=str, default="./waybill_ie/data", help="The folder where the dataset is located.") -args = parser.parse_args() -# fmt: on - -if __name__ == "__main__": - # The number of labels should be in accordance with the training dataset. - label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) - - model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab)) - - if args.params_path and os.path.isfile(args.params_path): - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - print("Loaded parameters from %s" % args.params_path) - model.eval() - - model = paddle.jit.to_static( - model, - input_spec=[ - paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids - paddle.static.InputSpec(shape=[None, None], dtype="int64"), # segment_ids - ], - ) - - save_path = os.path.join(args.output_path, "inference") - paddle.jit.save(model, save_path) diff --git a/examples/information_extraction/waybill_ie/model.py b/examples/information_extraction/waybill_ie/model.py deleted file mode 100644 index d6d1e8dfb36f..000000000000 --- a/examples/information_extraction/waybill_ie/model.py +++ /dev/null @@ -1,76 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -import paddle.nn as nn - -from paddlenlp.embeddings import TokenEmbedding -from paddlenlp.layers.crf import LinearChainCrf, LinearChainCrfLoss -from paddlenlp.utils.tools import compare_version - -if compare_version(paddle.version.full_version, "2.2.0") >= 0: - # paddle.text.ViterbiDecoder is supported by paddle after version 2.2.0 - from paddle.text import ViterbiDecoder -else: - from paddlenlp.layers.crf import ViterbiDecoder - - -class BiGRUWithCRF(nn.Layer): - def __init__(self, emb_size, hidden_size, word_num, label_num, use_w2v_emb=False): - super(BiGRUWithCRF, self).__init__() - if use_w2v_emb: - self.word_emb = TokenEmbedding(extended_vocab_path="./data/word.dic", unknown_token="OOV") - else: - self.word_emb = nn.Embedding(word_num, emb_size) - self.gru = nn.GRU(emb_size, hidden_size, num_layers=2, direction="bidirect") - # We need `label_num + 2` for appending BOS and EOS tag - self.fc = nn.Linear(hidden_size * 2, label_num + 2) - self.crf = LinearChainCrf(label_num) - self.crf_loss = LinearChainCrfLoss(self.crf) - self.viterbi_decoder = ViterbiDecoder(self.crf.transitions) - - def forward(self, inputs, lengths, labels=None): - embs = self.word_emb(inputs) - output, _ = self.gru(embs) - emission = self.fc(output) - if labels is not None: - loss = self.crf_loss(emission, lengths, labels) - return loss - else: - _, prediction = self.viterbi_decoder(emission, lengths) - return prediction - - -class ErnieCrfForTokenClassification(nn.Layer): - def __init__(self, ernie, crf_lr=100): - super().__init__() - self.num_labels = ernie.num_labels - self.ernie = ernie # allow ernie to be config - self.crf = LinearChainCrf(self.num_labels, crf_lr=crf_lr, with_start_stop_tag=False) - self.crf_loss = LinearChainCrfLoss(self.crf) - self.viterbi_decoder = ViterbiDecoder(self.crf.transitions, False) - - def forward( - self, input_ids, token_type_ids=None, lengths=None, position_ids=None, attention_mask=None, labels=None - ): - logits = self.ernie( - input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask, position_ids=position_ids - ) - - if labels is not None: - loss = self.crf_loss(logits, lengths, labels) - return loss - else: - _, prediction = self.viterbi_decoder(logits, lengths) - return prediction diff --git a/examples/information_extraction/waybill_ie/run_bigru_crf.py b/examples/information_extraction/waybill_ie/run_bigru_crf.py deleted file mode 100644 index f458d36de5b3..000000000000 --- a/examples/information_extraction/waybill_ie/run_bigru_crf.py +++ /dev/null @@ -1,149 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os -from functools import partial - -import paddle -from data import load_dataset, load_dict, parse_decodes -from model import BiGRUWithCRF - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.metrics import ChunkEvaluator - -parser = argparse.ArgumentParser() - -# yapf: disable -parser.add_argument("--save_dir", default='./bigru_crf_ckpt', type=str, help="The output directory where the model checkpoints will be written.") -parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") -parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for training.") -parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") -parser.add_argument("--data_dir", default='./waybill_ie/data', type=str, help="The folder where the dataset is located.") - -args = parser.parse_args() -# yapf: enable - - -def convert_tokens_to_ids(tokens, vocab, oov_token=None): - token_ids = [] - oov_id = vocab.get(oov_token) if oov_token else None - for token in tokens: - token_id = vocab.get(token, oov_id) - token_ids.append(token_id) - return token_ids - - -def convert_to_features(example, word_vocab, label_vocab): - tokens, labels = example - token_ids = convert_tokens_to_ids(tokens, word_vocab, "OOV") - label_ids = convert_tokens_to_ids(labels, label_vocab, "O") - return token_ids, len(token_ids), label_ids - - -@paddle.no_grad() -def evaluate(model, metric, data_loader): - model.eval() - metric.reset() - for token_ids, lengths, label_ids in data_loader: - preds = model(token_ids, lengths) - n_infer, n_label, n_correct = metric.compute(lengths, preds, label_ids) - metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy()) - precision, recall, f1_score = metric.accumulate() - print("[EVAL] Precision: %f - Recall: %f - F1: %f" % (precision, recall, f1_score)) - model.train() - - -@paddle.no_grad() -def predict(model, data_loader, ds, label_vocab): - all_preds = [] - all_lens = [] - for token_ids, lengths, label_ids in data_loader: - preds = model(token_ids, lengths) - all_preds.append(preds.numpy()) - all_lens.append(lengths) - sentences = [example[0] for example in ds.data] - results = parse_decodes(sentences, all_preds, all_lens, label_vocab) - return results - - -if __name__ == "__main__": - paddle.set_device(args.device) - - # Create dataset, tokenizer and dataloader. - train_ds, dev_ds, test_ds = load_dataset( - datafiles=( - os.path.join(args.data_dir, "train.txt"), - os.path.join(args.data_dir, "dev.txt"), - os.path.join(args.data_dir, "test.txt"), - ) - ) - - label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) - word_vocab = load_dict(os.path.join(args.data_dir, "word.dic")) - - trans_func = partial(convert_to_features, word_vocab=word_vocab, label_vocab=label_vocab) - train_ds.map(trans_func) - dev_ds.map(trans_func) - test_ds.map(trans_func) - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=word_vocab.get("OOV", 0), dtype="int32"), # token_ids - Stack(dtype="int64"), # seq_len - Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"), # label_ids - ): fn(samples) - - train_loader = paddle.io.DataLoader( - dataset=train_ds, - batch_size=args.batch_size, - shuffle=True, - drop_last=True, - return_list=True, - collate_fn=batchify_fn, - ) - - dev_loader = paddle.io.DataLoader( - dataset=dev_ds, batch_size=args.batch_size, drop_last=True, return_list=True, collate_fn=batchify_fn - ) - - test_loader = paddle.io.DataLoader( - dataset=test_ds, batch_size=args.batch_size, drop_last=True, return_list=True, collate_fn=batchify_fn - ) - - # Define the model netword and its loss - model = BiGRUWithCRF(300, 256, len(word_vocab), len(label_vocab)) - - optimizer = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters()) - metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True) - - step = 0 - for epoch in range(args.epochs): - for token_ids, lengths, label_ids in train_loader: - loss = model(token_ids, lengths, label_ids) - loss = loss.mean() - loss.backward() - optimizer.step() - optimizer.clear_grad() - step += 1 - print("[TRAIN] Epoch:%d - Step:%d - Loss: %f" % (epoch, step, loss)) - evaluate(model, metric, dev_loader) - paddle.save(model.state_dict(), os.path.join(args.save_dir, "model_%d" % step, "model_state.pdparams")) - - preds = predict(model, test_loader, test_ds, label_vocab) - file_path = "bigru_crf_results.txt" - with open(file_path, "w", encoding="utf8") as fout: - fout.write("\n".join(preds)) - # Print some examples - print("The results have been saved into: %s, some examples are shown below: " % file_path) - print("\n".join(preds[:10])) diff --git a/examples/information_extraction/waybill_ie/run_ernie.py b/examples/information_extraction/waybill_ie/run_ernie.py deleted file mode 100644 index d21baad79a77..000000000000 --- a/examples/information_extraction/waybill_ie/run_ernie.py +++ /dev/null @@ -1,166 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os -from functools import partial - -import paddle -from data import load_dataset, load_dict, parse_decodes - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.metrics import ChunkEvaluator -from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer - -# fmt: off -parser = argparse.ArgumentParser() -parser.add_argument("--save_dir", default="./ernie_ckpt", type=str, help="The output directory where the model checkpoints will be written.") -parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") -parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for training.") -parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") -parser.add_argument("--data_dir", default="./waybill_ie/data", type=str, help="The folder where the dataset is located.") -args = parser.parse_args() -# fmt: on - - -def convert_to_features(example, tokenizer, label_vocab): - tokens, labels = example - tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token") - # Token '[CLS]' and '[SEP]' will get label 'O' - labels = ["O"] + labels + ["O"] - tokenized_input["labels"] = [label_vocab[x] for x in labels] - return ( - tokenized_input["input_ids"], - tokenized_input["token_type_ids"], - tokenized_input["seq_len"], - tokenized_input["labels"], - ) - - -@paddle.no_grad() -def evaluate(model, metric, data_loader): - model.eval() - metric.reset() - for input_ids, seg_ids, lens, labels in data_loader: - logits = model(input_ids, seg_ids) - preds = paddle.argmax(logits, axis=-1) - n_infer, n_label, n_correct = metric.compute(lens, preds, labels) - metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy()) - precision, recall, f1_score = metric.accumulate() - print("[EVAL] Precision: %f - Recall: %f - F1: %f" % (precision, recall, f1_score)) - model.train() - - -@paddle.no_grad() -def predict(model, data_loader, ds, label_vocab): - all_preds = [] - all_lens = [] - for input_ids, seg_ids, lens, labels in data_loader: - logits = model(input_ids, seg_ids) - preds = paddle.argmax(logits, axis=-1) - # Drop CLS prediction - preds = [pred[1:] for pred in preds.numpy()] - all_preds.append(preds) - all_lens.append(lens) - sentences = [example[0] for example in ds.data] - results = parse_decodes(sentences, all_preds, all_lens, label_vocab) - return results - - -def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): - if trans_fn: - dataset = dataset.map(trans_fn) - - shuffle = True if mode == "train" else False - if mode == "train": - batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - else: - batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - - return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) - - -if __name__ == "__main__": - paddle.set_device(args.device) - rank = paddle.distributed.get_rank() - trainer_num = paddle.distributed.get_world_size() - if trainer_num > 1: - paddle.distributed.init_parallel_env() - # Create dataset, tokenizer and dataloader. - train_ds, dev_ds, test_ds = load_dataset( - datafiles=( - os.path.join(args.data_dir, "train.txt"), - os.path.join(args.data_dir, "dev.txt"), - os.path.join(args.data_dir, "test.txt"), - ) - ) - - label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) - tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") - - trans_func = partial(convert_to_features, tokenizer=tokenizer, label_vocab=label_vocab) - - train_ds.map(trans_func) - dev_ds.map(trans_func) - test_ds.map(trans_func) - - ignore_label = -1 - - def batchify_fn(samples): - fn = Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int32"), # input_ids - Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int32"), # token_type_ids - Stack(dtype="int64"), # seq_len - Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"), # labels - ) - return fn(samples) - - train_loader = create_dataloader( - dataset=train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn - ) - - dev_loader = create_dataloader(dataset=dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn) - - test_loader = create_dataloader(dataset=test_ds, mode="test", batch_size=args.batch_size, batchify_fn=batchify_fn) - - # Define the model netword and its loss - model = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab)) - if trainer_num > 1: - model = paddle.DataParallel(model) - metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True) - loss_fn = paddle.nn.loss.CrossEntropyLoss(ignore_index=ignore_label) - optimizer = paddle.optimizer.AdamW(learning_rate=2e-5, parameters=model.parameters()) - - step = 0 - for epoch in range(args.epochs): - for input_ids, token_type_ids, length, labels in train_loader: - logits = model(input_ids, token_type_ids) - loss = paddle.mean(loss_fn(logits, labels)) - loss.backward() - optimizer.step() - optimizer.clear_grad() - step += 1 - print("[TRAIN] Epoch:%d - Step:%d - Loss: %f" % (epoch, step, loss)) - evaluate(model, metric, dev_loader) - model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model - model_to_save.save_pretrained(os.path.join(args.save_dir, "model_%d" % step)) - - if rank == 0: - preds = predict(model, test_loader, test_ds, label_vocab) - file_path = "ernie_results.txt" - with open(file_path, "w", encoding="utf8") as fout: - fout.write("\n".join(preds)) - # Print some examples - print("The results have been saved in the file: %s, some examples are shown below: " % file_path) - print("\n".join(preds[:10])) diff --git a/examples/information_extraction/waybill_ie/run_ernie_crf.py b/examples/information_extraction/waybill_ie/run_ernie_crf.py deleted file mode 100644 index b9d03b77e643..000000000000 --- a/examples/information_extraction/waybill_ie/run_ernie_crf.py +++ /dev/null @@ -1,147 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os -from functools import partial - -import paddle -from data import load_dataset, load_dict, parse_decodes -from model import ErnieCrfForTokenClassification - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.metrics import ChunkEvaluator -from paddlenlp.transformers import AutoModelForTokenClassification, AutoTokenizer - -# fmt: off -parser = argparse.ArgumentParser() -parser.add_argument("--save_dir", default="./ernie_crf_ckpt", type=str, help="The output directory where the model checkpoints will be written.") -parser.add_argument("--epochs", default=10, type=int, help="Total number of training epochs to perform.") -parser.add_argument("--batch_size", default=200, type=int, help="Batch size per GPU/CPU for training.") -parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu.") -parser.add_argument("--data_dir", default="./waybill_ie/data", type=str, help="The folder where the dataset is located.") -args = parser.parse_args() -# fmt: on - - -def convert_to_features(example, tokenizer, label_vocab): - tokens, labels = example - tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token") - # Token '[CLS]' and '[SEP]' will get label 'O' - labels = ["O"] + labels + ["O"] - tokenized_input["labels"] = [label_vocab[x] for x in labels] - return ( - tokenized_input["input_ids"], - tokenized_input["token_type_ids"], - tokenized_input["seq_len"], - tokenized_input["labels"], - ) - - -@paddle.no_grad() -def evaluate(model, metric, data_loader): - model.eval() - metric.reset() - for input_ids, seg_ids, lens, labels in data_loader: - preds = model(input_ids, seg_ids, lengths=lens) - n_infer, n_label, n_correct = metric.compute(lens, preds, labels) - metric.update(n_infer.numpy(), n_label.numpy(), n_correct.numpy()) - precision, recall, f1_score = metric.accumulate() - print("[EVAL] Precision: %f - Recall: %f - F1: %f" % (precision, recall, f1_score)) - model.train() - - -@paddle.no_grad() -def predict(model, data_loader, ds, label_vocab): - all_preds = [] - all_lens = [] - for input_ids, seg_ids, lens, labels in data_loader: - preds = model(input_ids, seg_ids, lengths=lens) - # Drop CLS prediction - preds = [pred[1:] for pred in preds.numpy()] - all_preds.append(preds) - all_lens.append(lens) - sentences = [example[0] for example in ds.data] - results = parse_decodes(sentences, all_preds, all_lens, label_vocab) - return results - - -if __name__ == "__main__": - paddle.set_device(args.device) - - # Create dataset, tokenizer and dataloader. - train_ds, dev_ds, test_ds = load_dataset( - datafiles=( - os.path.join(args.data_dir, "train.txt"), - os.path.join(args.data_dir, "dev.txt"), - os.path.join(args.data_dir, "test.txt"), - ) - ) - - label_vocab = load_dict(os.path.join(args.data_dir, "tag.dic")) - tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") - - trans_func = partial(convert_to_features, tokenizer=tokenizer, label_vocab=label_vocab) - - train_ds.map(trans_func) - dev_ds.map(trans_func) - test_ds.map(trans_func) - - def batchify_fn(samples): - fn = Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int32"), # input_ids - Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int32"), # token_type_ids - Stack(dtype="int64"), # seq_len - Pad(axis=0, pad_val=label_vocab.get("O", 0), dtype="int64"), # labels - ) - return fn(samples) - - train_loader = paddle.io.DataLoader( - dataset=train_ds, batch_size=args.batch_size, return_list=True, collate_fn=batchify_fn - ) - dev_loader = paddle.io.DataLoader( - dataset=dev_ds, batch_size=args.batch_size, return_list=True, collate_fn=batchify_fn - ) - test_loader = paddle.io.DataLoader( - dataset=test_ds, batch_size=args.batch_size, return_list=True, collate_fn=batchify_fn - ) - - # Define the model netword and its loss - ernie = AutoModelForTokenClassification.from_pretrained("ernie-3.0-medium-zh", num_labels=len(label_vocab)) - model = ErnieCrfForTokenClassification(ernie) - - metric = ChunkEvaluator(label_list=label_vocab.keys(), suffix=True) - optimizer = paddle.optimizer.AdamW(learning_rate=2e-5, parameters=model.parameters()) - - step = 0 - for epoch in range(args.epochs): - for input_ids, token_type_ids, lengths, labels in train_loader: - loss = model(input_ids, token_type_ids, lengths=lengths, labels=labels) - avg_loss = paddle.mean(loss) - avg_loss.backward() - optimizer.step() - optimizer.clear_grad() - step += 1 - print("[TRAIN] Epoch:%d - Step:%d - Loss: %f" % (epoch, step, avg_loss)) - evaluate(model, metric, dev_loader) - - paddle.save(model.state_dict(), os.path.join(args.save_dir, "model_%d" % step, "model_state.pdparams")) - - preds = predict(model, test_loader, test_ds, label_vocab) - file_path = "ernie_crf_results.txt" - with open(file_path, "w", encoding="utf8") as fout: - fout.write("\n".join(preds)) - # Print some examples - print("The results have been saved in the file: %s, some examples are shown below: " % file_path) - print("\n".join(preds[:10])) diff --git a/examples/machine_translation/seq2seq/README.md b/examples/machine_translation/seq2seq/README.md deleted file mode 100644 index 2f271dfb6b4f..000000000000 --- a/examples/machine_translation/seq2seq/README.md +++ /dev/null @@ -1,104 +0,0 @@ -# Machine Translation using Seq2Seq with Attention - -以下是本范例模型的简要目录结构及说明: - -``` -. -├── deploy # 预测部署目录 -│ └── python -│ └── infer.py # 用预测模型进行推理的程序 -├── README.md # 文档,本文件 -├── args.py # 训练、预测、导出模型以及模型参数配置程序 -├── data.py # 数据读入程序 -├── train.py # 训练主程序 -├── predict.py # 预测主程序 -├── export_model.py # 导出预测模型的程序 -└── seq2seq_attn.py # 带注意力机制的翻译模型程序 -``` - -## 简介 - -Sequence to Sequence (Seq2Seq),使用编码器-解码器(Encoder-Decoder)结构,用编码器将源序列编码成vector,再用解码器将该vector解码为目标序列。Seq2Seq 广泛应用于机器翻译,自动对话机器人,文档摘要自动生成,图片描述自动生成等任务中。 - -本目录包含Seq2Seq的一个经典样例:机器翻译,带Attention机制的翻译模型。Seq2Seq翻译模型,模拟了人类在进行翻译类任务时的行为:先解析源语言,理解其含义,再根据该含义来写出目标语言的语句。更多关于机器翻译的具体原理和数学表达式,我们推荐参考飞桨官网[机器翻译案例](https://www.paddlepaddle.org.cn/documentation/docs/zh/user_guides/nlp_case/machine_translation/README.cn.html)。 - -## 模型概览 - -本模型中,在编码器方面,我们采用了基于LSTM的多层的RNN encoder;在解码器方面,我们使用了带注意力(Attention)机制的RNN decoder,在预测时我们使用柱搜索(beam search)算法来生成翻译的目标语句。 - -## 数据介绍 - -本教程使用[IWSLT'15 English-Vietnamese data ](https://nlp.stanford.edu/projects/nmt/)数据集中的英语到越南语的数据作为训练语料,tst2012的数据作为开发集,tst2013的数据作为测试集。 - -### 数据获取 -如果用户在初始化数据集时没有提供路径,数据集会自动下载到`paddlenlp.utils.env.DATA_HOME`的`IWSLT15/`路径下,例如在linux系统下,默认存储路径是`~/.paddlenlp/datasets/IWSLT15`。 - -## 模型训练 - -执行以下命令即可训练带有注意力机制的Seq2Seq机器翻译模型: - -```sh -python train.py \ - --num_layers 2 \ - --hidden_size 512 \ - --batch_size 128 \ - --dropout 0.2 \ - --init_scale 0.1 \ - --max_grad_norm 5.0 \ - --device gpu \ - --model_path ./attention_models -``` - -各参数的具体说明请参阅 `args.py` 。训练程序会在每个epoch训练结束之后,save一次模型。 - -**NOTE:** 如需恢复模型训练,则`init_from_ckpt`只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=attention_models/5`即可,程序会自动加载模型参数`attention_models/5.pdparams`,也会自动加载优化器状态`attention_models/5.pdopt`。 - -## 模型预测 - -训练完成之后,可以使用保存的模型(由 `--init_from_ckpt` 指定)对测试集的数据集进行beam search解码。生成的翻译结果位于`--infer_output_file`指定的路径,预测命令如下: - -```sh -python predict.py \ - --num_layers 2 \ - --hidden_size 512 \ - --batch_size 128 \ - --dropout 0.2 \ - --init_scale 0.1 \ - --max_grad_norm 5.0 \ - --init_from_ckpt attention_models/9 \ - --infer_output_file infer_output.txt \ - --beam_size 10 \ - --device gpu -``` - -各参数的具体说明请参阅 `args.py` ,注意预测时所用模型超参数需和训练时一致。 - -## 预测效果评价 -取第10个epoch的结果,用取beam_size为10的beam search解码,`predict.py`脚本在生成翻译结果之后,会调用`paddlenlp.metrics.BLEU`计算翻译结果的BLEU指标,最终计算出的BLEU分数为0.24329954822714048 - -## 保存预测模型 -这里指定的参数`export_path` 表示导出预测模型文件的前缀。保存时会添加后缀(`pdiparams`,`pdiparams.info`,`pdmodel`)。 -```shell -python export_model.py \ - --num_layers 2 \ - --hidden_size 512 \ - --batch_size 128 \ - --dropout 0.2 \ - --init_scale 0.1 \ - --max_grad_norm 5.0 \ - --init_from_ckpt attention_models/9.pdparams \ - --beam_size 10 \ - --export_path ./infer_model/model -``` - -## 基于预测引擎推理 -然后按照如下的方式对IWSLT15数据集中的测试集(有标注的)进行预测(基于Paddle的[Python预测API](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/05_inference_deployment/inference/python_infer_cn.html)): - -```shell -cd deploy/python -python infer.py \ - --export_path ../../infer_model/model \ - --device gpu \ - --batch_size 128 \ - --infer_output_file infer_output.txt -``` diff --git a/examples/machine_translation/seq2seq/args.py b/examples/machine_translation/seq2seq/args.py deleted file mode 100644 index 317917ab1189..000000000000 --- a/examples/machine_translation/seq2seq/args.py +++ /dev/null @@ -1,61 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse - - -def parse_args(): - parser = argparse.ArgumentParser(description=__doc__) - - parser.add_argument("--learning_rate", type=float, default=0.001, help="learning rate for optimizer") - - parser.add_argument("--num_layers", type=int, default=1, help="layers number of encoder and decoder") - - parser.add_argument("--hidden_size", type=int, default=100, help="hidden size of encoder and decoder") - - parser.add_argument("--batch_size", type=int, help="batch size of each step") - - parser.add_argument("--max_epoch", type=int, default=12, help="max epoch for the training") - - parser.add_argument("--max_len", type=int, default=50, help="max length for source and target sentence") - - parser.add_argument("--dropout", type=float, default=0.2, help="drop probability") - - parser.add_argument("--init_scale", type=float, default=0.0, help="init scale for parameter") - - parser.add_argument("--max_grad_norm", type=float, default=5.0, help="max grad norm for global norm clip") - - parser.add_argument("--log_freq", type=int, default=100, help="The frequency to print training logs") - - parser.add_argument("--model_path", type=str, default="model", help="model path for model to save") - - parser.add_argument("--infer_output_file", type=str, default="infer_output", help="file name for inference output") - - parser.add_argument("--beam_size", type=int, default=10, help="file name for inference") - - parser.add_argument( - "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference." - ) - - parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") - - parser.add_argument( - "--export_path", - type=str, - default=None, - help="The output file prefix used to save the exported inference model.", - ) - - args = parser.parse_args() - return args diff --git a/examples/machine_translation/seq2seq/data.py b/examples/machine_translation/seq2seq/data.py deleted file mode 100644 index 3e4f44901a42..000000000000 --- a/examples/machine_translation/seq2seq/data.py +++ /dev/null @@ -1,113 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from functools import partial - -import numpy as np -import paddle - -from paddlenlp.data import Pad, SamplerHelper, Vocab -from paddlenlp.datasets import load_dataset - - -def create_train_loader(args): - batch_size = args.batch_size - max_len = args.max_len - - train_ds, dev_ds = load_dataset("iwslt15", splits=("train", "dev")) - src_vocab = Vocab.load_vocabulary(**train_ds.vocab_info["en"]) - tgt_vocab = Vocab.load_vocabulary(**train_ds.vocab_info["vi"]) - bos_id = src_vocab[src_vocab.bos_token] - eos_id = src_vocab[src_vocab.eos_token] - pad_id = eos_id - - def convert_example(example): - source = example["en"].split()[:max_len] - target = example["vi"].split()[:max_len] - - source = src_vocab.to_indices(source) - target = tgt_vocab.to_indices(target) - - return source, target - - key = lambda x, data_source: len(data_source[x][0]) - - # Truncate and convert example to ids - train_ds = train_ds.map(convert_example, lazy=False) - dev_ds = dev_ds.map(convert_example, lazy=False) - - train_batch_sampler = ( - SamplerHelper(train_ds).shuffle().sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size) - ) - - dev_batch_sampler = SamplerHelper(dev_ds).sort(key=key, buffer_size=batch_size * 20).batch(batch_size=batch_size) - - train_loader = paddle.io.DataLoader( - train_ds, - batch_sampler=train_batch_sampler, - collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id), - ) - - dev_loader = paddle.io.DataLoader( - dev_ds, - batch_sampler=dev_batch_sampler, - collate_fn=partial(prepare_train_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id), - ) - - return train_loader, dev_loader, len(src_vocab), len(tgt_vocab), pad_id - - -def create_infer_loader(args): - batch_size = args.batch_size - test_ds = load_dataset("iwslt15", splits="test") - src_vocab = Vocab.load_vocabulary(**test_ds.vocab_info["en"]) - tgt_vocab = Vocab.load_vocabulary(**test_ds.vocab_info["vi"]) - bos_id = src_vocab[src_vocab.bos_token] - eos_id = src_vocab[src_vocab.eos_token] - pad_id = eos_id - - def convert_example(example): - source = example["en"].split() - target = example["vi"].split() - - source = src_vocab.to_indices(source) - target = tgt_vocab.to_indices(target) - - return source, target - - test_ds.map(convert_example) - test_batch_sampler = SamplerHelper(test_ds).batch(batch_size=batch_size) - - test_loader = paddle.io.DataLoader( - test_ds, - batch_sampler=test_batch_sampler, - collate_fn=partial(prepare_infer_input, bos_id=bos_id, eos_id=eos_id, pad_id=pad_id), - ) - return test_loader, len(src_vocab), len(tgt_vocab), bos_id, eos_id - - -def prepare_infer_input(insts, bos_id, eos_id, pad_id): - insts = [([bos_id] + inst[0] + [eos_id], [bos_id] + inst[1] + [eos_id]) for inst in insts] - src, src_length = Pad(pad_val=pad_id, ret_length=True)([inst[0] for inst in insts]) - return src, src_length - - -def prepare_train_input(insts, bos_id, eos_id, pad_id): - # Add eos token id and bos token id. - insts = [([bos_id] + inst[0] + [eos_id], [bos_id] + inst[1] + [eos_id]) for inst in insts] - # Pad sequence using eos id. - src, src_length = Pad(pad_val=pad_id, ret_length=True)([inst[0] for inst in insts]) - tgt, tgt_length = Pad(pad_val=pad_id, ret_length=True, dtype="int64")([inst[1] for inst in insts]) - tgt_mask = (tgt[:, :-1] != pad_id).astype("float32") - return src, src_length, tgt[:, :-1], tgt[:, 1:, np.newaxis], tgt_mask diff --git a/examples/machine_translation/seq2seq/deploy/python/infer.py b/examples/machine_translation/seq2seq/deploy/python/infer.py deleted file mode 100644 index de6f80e07785..000000000000 --- a/examples/machine_translation/seq2seq/deploy/python/infer.py +++ /dev/null @@ -1,99 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import io -import sys - -sys.path.append("../../") - -import numpy as np # noqa: E402 -import paddle # noqa: E402 -from args import parse_args # noqa: E402 -from data import create_infer_loader # noqa: E402 -from predict import post_process_seq # noqa: E402 - -from paddlenlp.data import Vocab # noqa: E402 -from paddlenlp.datasets import load_dataset # noqa: E402 -from paddlenlp.metrics import BLEU # noqa: E402 - - -class Predictor(object): - def __init__(self, predictor, input_handles, output_handles): - self.predictor = predictor - self.input_handles = input_handles - self.output_handles = output_handles - - @classmethod - def create_predictor(cls, args): - config = paddle.inference.Config(args.export_path + ".pdmodel", args.export_path + ".pdiparams") - if args.device == "gpu": - # set GPU configs accordingly - config.enable_use_gpu(100, 0) - elif args.device == "cpu": - # set CPU configs accordingly, - # such as enable_mkldnn, set_cpu_math_library_num_threads - config.disable_gpu() - elif args.device == "xpu": - # set XPU configs accordingly - config.enable_xpu(100) - config.switch_use_feed_fetch_ops(False) - predictor = paddle.inference.create_predictor(config) - input_handles = [predictor.get_input_handle(name) for name in predictor.get_input_names()] - output_handles = [predictor.get_output_handle(name) for name in predictor.get_output_names()] - return cls(predictor, input_handles, output_handles) - - def predict_batch(self, data): - for input_field, input_handle in zip(data, self.input_handles): - input_handle.copy_from_cpu(input_field.numpy() if isinstance(input_field, paddle.Tensor) else input_field) - self.predictor.run() - output = [output_handle.copy_to_cpu() for output_handle in self.output_handles] - return output - - def predict(self, dataloader, infer_output_file, trg_idx2word, bos_id, eos_id): - cand_list = [] - with io.open(infer_output_file, "w", encoding="utf-8") as f: - for data in dataloader(): - finished_seq = self.predict_batch(data)[0] - finished_seq = finished_seq[:, :, np.newaxis] if len(finished_seq.shape) == 2 else finished_seq - finished_seq = np.transpose(finished_seq, [0, 2, 1]) - for ins in finished_seq: - for beam_idx, beam in enumerate(ins): - id_list = post_process_seq(beam, bos_id, eos_id) - word_list = [trg_idx2word[id] for id in id_list] - sequence = " ".join(word_list) + "\n" - f.write(sequence) - cand_list.append(word_list) - break - - test_ds = load_dataset("iwslt15", splits="test") - bleu = BLEU() - for i, data in enumerate(test_ds): - ref = data["vi"].split() - bleu.add_inst(cand_list[i], [ref]) - print("BLEU score is %s." % bleu.score()) - - -def main(): - args = parse_args() - - predictor = Predictor.create_predictor(args) - test_loader, src_vocab_size, tgt_vocab_size, bos_id, eos_id = create_infer_loader(args) - tgt_vocab = Vocab.load_vocabulary(**test_loader.dataset.vocab_info["vi"]) - trg_idx2word = tgt_vocab.idx_to_token - - predictor.predict(test_loader, args.infer_output_file, trg_idx2word, bos_id, eos_id) - - -if __name__ == "__main__": - main() diff --git a/examples/machine_translation/seq2seq/export_model.py b/examples/machine_translation/seq2seq/export_model.py deleted file mode 100644 index 79b05c1dccb5..000000000000 --- a/examples/machine_translation/seq2seq/export_model.py +++ /dev/null @@ -1,57 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -from args import parse_args -from data import create_infer_loader -from seq2seq_attn import Seq2SeqAttnInferModel - - -def main(): - args = parse_args() - _, src_vocab_size, tgt_vocab_size, bos_id, eos_id = create_infer_loader(args) - - # Build model and load trained parameters - model = Seq2SeqAttnInferModel( - src_vocab_size, - tgt_vocab_size, - args.hidden_size, - args.hidden_size, - args.num_layers, - args.dropout, - bos_id=bos_id, - eos_id=eos_id, - beam_size=args.beam_size, - max_out_len=256, - ) - - # Load the trained model - model.set_state_dict(paddle.load(args.init_from_ckpt)) - - # Wwitch to eval model - model.eval() - # Convert to static graph with specific input description - model = paddle.jit.to_static( - model, - input_spec=[ - paddle.static.InputSpec(shape=[None, None], dtype="int64"), # src - paddle.static.InputSpec(shape=[None], dtype="int64"), # src length - ], - ) - # Save converted static graph model - paddle.jit.save(model, args.export_path) - - -if __name__ == "__main__": - main() diff --git a/examples/machine_translation/seq2seq/predict.py b/examples/machine_translation/seq2seq/predict.py deleted file mode 100644 index 0da32d69d057..000000000000 --- a/examples/machine_translation/seq2seq/predict.py +++ /dev/null @@ -1,92 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import io - -import numpy as np -import paddle -from args import parse_args -from data import create_infer_loader -from seq2seq_attn import Seq2SeqAttnInferModel - -from paddlenlp.data import Vocab -from paddlenlp.metrics import BLEU - - -def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): - """ - Post-process the decoded sequence. - """ - eos_pos = len(seq) - 1 - for i, idx in enumerate(seq): - if idx == eos_idx: - eos_pos = i - break - seq = [idx for idx in seq[: eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] - return seq - - -def do_predict(args): - paddle.set_device(args.device) - - test_loader, src_vocab_size, tgt_vocab_size, bos_id, eos_id = create_infer_loader(args) - tgt_vocab = Vocab.load_vocabulary(**test_loader.dataset.vocab_info["vi"]) - - model = paddle.Model( - Seq2SeqAttnInferModel( - src_vocab_size, - tgt_vocab_size, - args.hidden_size, - args.hidden_size, - args.num_layers, - args.dropout, - bos_id=bos_id, - eos_id=eos_id, - beam_size=args.beam_size, - max_out_len=256, - ) - ) - - model.prepare() - - # Load the trained model - assert args.init_from_ckpt, "Please set reload_model to load the infer model." - model.load(args.init_from_ckpt) - - cand_list = [] - with io.open(args.infer_output_file, "w", encoding="utf-8") as f: - for data in test_loader(): - with paddle.no_grad(): - finished_seq = model.predict_batch(inputs=data)[0] - finished_seq = finished_seq[:, :, np.newaxis] if len(finished_seq.shape) == 2 else finished_seq - finished_seq = np.transpose(finished_seq, [0, 2, 1]) - for ins in finished_seq: - for beam_idx, beam in enumerate(ins): - id_list = post_process_seq(beam, bos_id, eos_id) - word_list = [tgt_vocab.to_tokens(id) for id in id_list] - sequence = " ".join(word_list) + "\n" - f.write(sequence) - cand_list.append(word_list) - break - - bleu = BLEU() - for i, data in enumerate(test_loader.dataset.data): - ref = data["vi"].split() - bleu.add_inst(cand_list[i], [ref]) - print("BLEU score is %s." % bleu.score()) - - -if __name__ == "__main__": - args = parse_args() - do_predict(args) diff --git a/examples/machine_translation/seq2seq/seq2seq_attn.py b/examples/machine_translation/seq2seq/seq2seq_attn.py deleted file mode 100644 index 5bbcf62b77c5..000000000000 --- a/examples/machine_translation/seq2seq/seq2seq_attn.py +++ /dev/null @@ -1,254 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -import paddle.nn as nn -import paddle.nn.functional as F -import paddle.nn.initializer as I - - -class CrossEntropyCriterion(nn.Layer): - def __init__(self): - super(CrossEntropyCriterion, self).__init__() - - def forward(self, predict, label, trg_mask): - cost = F.cross_entropy(input=predict, label=label, soft_label=False, reduction="none") - cost = paddle.squeeze(cost, axis=[2]) - masked_cost = cost * trg_mask - batch_mean_cost = paddle.mean(masked_cost, axis=[0]) - seq_cost = paddle.sum(batch_mean_cost) - - return seq_cost - - -class Seq2SeqEncoder(nn.Layer): - def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, dropout_prob=0.0, init_scale=0.1): - super(Seq2SeqEncoder, self).__init__() - self.embedder = nn.Embedding( - vocab_size, - embed_dim, - weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), - ) - - self.lstm = nn.LSTM( - input_size=embed_dim, - hidden_size=hidden_size, - num_layers=num_layers, - direction="forward", - dropout=dropout_prob if num_layers > 1 else 0.0, - ) - - def forward(self, sequence, sequence_length): - inputs = self.embedder(sequence) - encoder_output, encoder_state = self.lstm(inputs, sequence_length=sequence_length) - - return encoder_output, encoder_state - - -class AttentionLayer(nn.Layer): - def __init__(self, hidden_size, bias=False, init_scale=0.1): - super(AttentionLayer, self).__init__() - self.input_proj = nn.Linear( - hidden_size, - hidden_size, - weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), - bias_attr=bias, - ) - self.output_proj = nn.Linear( - hidden_size + hidden_size, - hidden_size, - weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), - bias_attr=bias, - ) - - def forward(self, hidden, encoder_output, encoder_padding_mask): - encoder_output = self.input_proj(encoder_output) - attn_scores = paddle.matmul(paddle.unsqueeze(hidden, [1]), encoder_output, transpose_y=True) - - if encoder_padding_mask is not None: - attn_scores = paddle.add(attn_scores, encoder_padding_mask) - - attn_scores = F.softmax(attn_scores) - attn_out = paddle.squeeze(paddle.matmul(attn_scores, encoder_output), [1]) - attn_out = paddle.concat([attn_out, hidden], 1) - attn_out = self.output_proj(attn_out) - return attn_out - - -class Seq2SeqDecoderCell(nn.RNNCellBase): - def __init__(self, num_layers, input_size, hidden_size, dropout_prob=0.0): - super(Seq2SeqDecoderCell, self).__init__() - if dropout_prob > 0.0: - self.dropout = nn.Dropout(dropout_prob) - else: - self.dropout = None - - self.lstm_cells = nn.LayerList( - [ - nn.LSTMCell(input_size=input_size + hidden_size if i == 0 else hidden_size, hidden_size=hidden_size) - for i in range(num_layers) - ] - ) - - self.attention_layer = AttentionLayer(hidden_size) - - def forward(self, step_input, states, encoder_output, encoder_padding_mask=None): - lstm_states, input_feed = states - new_lstm_states = [] - step_input = paddle.concat([step_input, input_feed], 1) - for i, lstm_cell in enumerate(self.lstm_cells): - out, new_lstm_state = lstm_cell(step_input, lstm_states[i]) - if self.dropout: - step_input = self.dropout(out) - else: - step_input = out - - new_lstm_states.append(new_lstm_state) - out = self.attention_layer(step_input, encoder_output, encoder_padding_mask) - return out, [new_lstm_states, out] - - -class Seq2SeqDecoder(nn.Layer): - def __init__(self, vocab_size, embed_dim, hidden_size, num_layers, dropout_prob=0.0, init_scale=0.1): - super(Seq2SeqDecoder, self).__init__() - self.embedder = nn.Embedding( - vocab_size, - embed_dim, - weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), - ) - self.lstm_attention = nn.RNN( - Seq2SeqDecoderCell(num_layers, embed_dim, hidden_size, dropout_prob), is_reverse=False, time_major=False - ) - self.output_layer = nn.Linear( - hidden_size, - vocab_size, - weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), - bias_attr=False, - ) - - def forward(self, trg, decoder_initial_states, encoder_output, encoder_padding_mask): - inputs = self.embedder(trg) - - decoder_output, _ = self.lstm_attention( - inputs, - initial_states=decoder_initial_states, - encoder_output=encoder_output, - encoder_padding_mask=encoder_padding_mask, - ) - predict = self.output_layer(decoder_output) - - return predict - - -class Seq2SeqAttnModel(nn.Layer): - def __init__( - self, - src_vocab_size, - trg_vocab_size, - embed_dim, - hidden_size, - num_layers, - dropout_prob=0.0, - eos_id=1, - init_scale=0.1, - ): - super(Seq2SeqAttnModel, self).__init__() - self.hidden_size = hidden_size - self.eos_id = eos_id - self.num_layers = num_layers - self.INF = 1e9 - self.encoder = Seq2SeqEncoder(src_vocab_size, embed_dim, hidden_size, num_layers, dropout_prob, init_scale) - self.decoder = Seq2SeqDecoder(trg_vocab_size, embed_dim, hidden_size, num_layers, dropout_prob, init_scale) - - def forward(self, src, src_length, trg): - encoder_output, encoder_final_state = self.encoder(src, src_length) - - # Transfer shape of encoder_final_states to [num_layers, 2, batch_size, hidden_size] - encoder_final_states = [(encoder_final_state[0][i], encoder_final_state[1][i]) for i in range(self.num_layers)] - # Construct decoder initial states: use input_feed and the shape is - # [[h,c] * num_layers, input_feed], consistent with Seq2SeqDecoderCell.states - decoder_initial_states = [ - encoder_final_states, - self.decoder.lstm_attention.cell.get_initial_states(batch_ref=encoder_output, shape=[self.hidden_size]), - ] - # Build attention mask to avoid paying attention on padddings - src_mask = (src != self.eos_id).astype(paddle.get_default_dtype()) - encoder_padding_mask = (src_mask - 1.0) * self.INF - encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1]) - - predict = self.decoder(trg, decoder_initial_states, encoder_output, encoder_padding_mask) - return predict - - -class Seq2SeqAttnInferModel(Seq2SeqAttnModel): - def __init__( - self, - src_vocab_size, - trg_vocab_size, - embed_dim, - hidden_size, - num_layers, - dropout_prob=0.0, - bos_id=0, - eos_id=1, - beam_size=4, - max_out_len=256, - ): - args = dict(locals()) - args.pop("self") - args.pop("__class__", None) - self.bos_id = args.pop("bos_id") - self.beam_size = args.pop("beam_size") - self.max_out_len = args.pop("max_out_len") - self.num_layers = num_layers - super(Seq2SeqAttnInferModel, self).__init__(**args) - # Dynamic decoder for inference - self.beam_search_decoder = nn.BeamSearchDecoder( - self.decoder.lstm_attention.cell, - start_token=bos_id, - end_token=eos_id, - beam_size=beam_size, - embedding_fn=self.decoder.embedder, - output_fn=self.decoder.output_layer, - ) - - def forward(self, src, src_length): - encoder_output, encoder_final_state = self.encoder(src, src_length) - - encoder_final_state = [(encoder_final_state[0][i], encoder_final_state[1][i]) for i in range(self.num_layers)] - - # Initial decoder initial states - decoder_initial_states = [ - encoder_final_state, - self.decoder.lstm_attention.cell.get_initial_states(batch_ref=encoder_output, shape=[self.hidden_size]), - ] - # Build attention mask to avoid paying attention on paddings - src_mask = (src != self.eos_id).astype(paddle.get_default_dtype()) - - encoder_padding_mask = (src_mask - 1.0) * self.INF - encoder_padding_mask = paddle.unsqueeze(encoder_padding_mask, [1]) - - # Tile the batch dimension with beam_size - encoder_output = nn.BeamSearchDecoder.tile_beam_merge_with_batch(encoder_output, self.beam_size) - encoder_padding_mask = nn.BeamSearchDecoder.tile_beam_merge_with_batch(encoder_padding_mask, self.beam_size) - - # Dynamic decoding with beam search - seq_output, _ = nn.dynamic_decode( - decoder=self.beam_search_decoder, - inits=decoder_initial_states, - max_step_num=self.max_out_len, - encoder_output=encoder_output, - encoder_padding_mask=encoder_padding_mask, - ) - return seq_output diff --git a/examples/machine_translation/seq2seq/train.py b/examples/machine_translation/seq2seq/train.py deleted file mode 100644 index fec0708040f5..000000000000 --- a/examples/machine_translation/seq2seq/train.py +++ /dev/null @@ -1,64 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -import paddle.nn as nn -from args import parse_args -from data import create_train_loader -from seq2seq_attn import CrossEntropyCriterion, Seq2SeqAttnModel - -from paddlenlp.metrics import Perplexity - - -def do_train(args): - paddle.set_device(args.device) - - # Define dataloader - train_loader, eval_loader, src_vocab_size, tgt_vocab_size, eos_id = create_train_loader(args) - - model = paddle.Model( - Seq2SeqAttnModel( - src_vocab_size, tgt_vocab_size, args.hidden_size, args.hidden_size, args.num_layers, args.dropout, eos_id - ) - ) - - grad_clip = nn.ClipGradByGlobalNorm(args.max_grad_norm) - optimizer = paddle.optimizer.Adam( - learning_rate=args.learning_rate, parameters=model.parameters(), grad_clip=grad_clip - ) - - ppl_metric = Perplexity() - model.prepare(optimizer, CrossEntropyCriterion(), ppl_metric) - - print(args) - if args.init_from_ckpt: - model.load(args.init_from_ckpt) - print("Loaded checkpoint from %s" % args.init_from_ckpt) - - benchmark_logger = paddle.callbacks.ProgBarLogger(log_freq=args.log_freq, verbose=3) - - model.fit( - train_data=train_loader, - eval_data=eval_loader, - epochs=args.max_epoch, - eval_freq=1, - save_freq=1, - save_dir=args.model_path, - callbacks=[benchmark_logger], - ) - - -if __name__ == "__main__": - args = parse_args() - do_train(args) diff --git a/examples/machine_translation/transformer/tls/distributed_utils.py b/examples/machine_translation/transformer/tls/distributed_utils.py deleted file mode 100644 index 26d6c0ca8d90..000000000000 --- a/examples/machine_translation/transformer/tls/distributed_utils.py +++ /dev/null @@ -1,19 +0,0 @@ -import paddle -import paddle.distributed as dist - - -def all_gather_tokens(data): - """Gathers num of tokens from all nodes. - `data` should be a tensor of num of tokens. - """ - if dist.get_world_size() < 2: - return data - if not hasattr(all_gather_tokens, "_in_buffer") or all_gather_tokens._in_buffer is None: - all_gather_tokens._in_buffer = data - all_gather_tokens._out_buffers = [] - in_buffer = all_gather_tokens._in_buffer - out_buffers = all_gather_tokens._out_buffers - - dist.all_gather(out_buffers, in_buffer) - - return paddle.add_n(out_buffers) diff --git a/examples/model_compression/distill_lstm/README.md b/examples/model_compression/distill_lstm/README.md deleted file mode 100644 index 56e63c435339..000000000000 --- a/examples/model_compression/distill_lstm/README.md +++ /dev/null @@ -1,194 +0,0 @@ -# Distilling Knowledge From Fine-tuned BERT into Bi-LSTM - -以下是本例的简要目录结构及说明: -``` -. -├── small.py # 小模型结构以及对小模型单独训练的脚本 -├── bert_distill.py # 用教师模型BERT蒸馏学生模型的蒸馏脚本 -├── data.py # 定义了dataloader等数据读取接口 -├── utils.py # 定义了将样本转成id的转换接口 -├── args.py # 参数配置脚本 -└── README.md # 文档,本文件 -``` - -## 简介 -本目录下的实验是将特定任务下BERT模型的知识蒸馏到基于Bi-LSTM的小模型中,主要参考论文 [Distilling Task-Specific Knowledge from BERT into Simple Neural Networks](https://arxiv.org/abs/1903.12136)实现。 - -在模型蒸馏中,较大的模型(在本例中是BERT)通常被称为教师模型,较小的模型(在本例中是Bi-LSTM)通常被称为学生模型。知识的蒸馏通常是通过模型学习蒸馏相关的损失函数实现,在本实验中,损失函数是均方误差损失函数,传入函数的两个参数分别是学生模型的输出和教师模型的输出。 - -在[论文](https://arxiv.org/abs/1903.12136)的模型蒸馏阶段,作者为了能让教师模型表达出更多的知识供学生模型学习,对训练数据进行了数据增强。作者使用了三种数据增强方式,分别是: - -1. Masking,即以一定的概率将原数据中的word token替换成`[MASK]`; - -2. POS—guided word replacement,即以一定的概率将原数据中的词用与其有相同POS tag的词替换; - -3. n-gram sampling,即以一定的概率,从每条数据中采样n-gram,其中n的范围可通过人工设置。 - -通过数据增强,可以产生更多无标签的训练数据,在训练过程中,学生模型可借助教师模型的“暗知识”,在更大的数据集上进行训练,产生更好的蒸馏效果。需要指出的是,实验只使用了第1和第3种数据增强方式。 -在英文数据集任务上,本文使用了Google News语料[预训练的Word Embedding](https://code.google.com/archive/p/word2vec/)初始化小模型的Embedding层。 - -本实验分为三个训练过程:在特定任务上对BERT的fine-tuning、在特定任务上对基于Bi-LSTM的小模型的训练(用于评价蒸馏效果)、将BERT模型的知识蒸馏到基于Bi-LSTM的小模型上。 - -## 数据、预训练模型介绍及获取 - -本实验使用GLUE中的SST-2、QQP以及中文情感分类数据集ChnSentiCorp中的训练集作为训练语料,用数据集中的验证集评估模型的效果。运行本目录下的实验,数据集会被自动下载到`paddlenlp.utils.env.DATA_HOME` 路径下,例如在linux系统下,例如对于GLUE中的QQP数据集,默认存储路径是`~/.paddlenlp/datasets/glue/QQP`,对于ChnSentiCorp数据集,则会下载到 `~/.paddlenlp/datasets/chnsenticorp`。 - -对于BERT的fine-tuning任务,本实验中使用了预训练模型`bert-bas-uncased`、`bert-wwm-ext-chinese`、`bert-base-chinese`。同样,这几个模型在训练时会被自动下载到`paddlenlp.utils.env.MODEL_HOME`路径下。例如,对于`bert-base-uncased`模型,在linux系统下,会被下载到`~/.paddlenlp/models/bert-base-uncased`下。 - -在中文数据集上的小模型训练的输入利用jieba分词,其中词表同本repo下[文本分类项目](../../text_classification/rnn)的词表,可通过运行以下命令进行下载: - -```shell -wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt -``` - -为了节省显存和运行时间,可以对ChnSentiCorp中未出现的词先进行过滤,并将最后的词表文件名和词表大小配置在下面的参数中。 - - -## 蒸馏实验过程 -### 训练BERT fine-tuning模型 -训练BERT的fine-tuning模型,可以去本repo下example中的[glue目录](../../benchmark/glue)下。关于glue的更多详细说明,可见glue目录下的README文档。 - -以GLUE的SST-2任务为例,调用BERT fine-tune的训练脚本,配置如下的参数,训练SST-2任务: - -```shell -cd ../../benchmark/glue -export CUDA_VISIBLE_DEVICES=0 -export TASK_NAME=SST-2 -python -u ./run_glue.py \ - --model_type bert \ - --model_name_or_path bert-base-uncased \ - --task_name $TASK_NAME \ - --max_seq_length 128 \ - --batch_size 128 \ - --learning_rate 3e-5 \ - --num_train_epochs 3 \ - --logging_steps 10 \ - --save_steps 10 \ - --output_dir ../model_compression/distill_lstm/pretrained_models/$TASK_NAME/ \ - --device gpu \ - -``` - -如果需要训练基于ChnSentiCorp数据集的BERT finetuning模型,可以进入[文本分类目录](../../text_classification/pretrained_models)下,将预训练模型改成BERT,并基于bert-base-chinese和bert-wwm-ext-chinese模型进行fine-tuning训练。 - -训练完成之后,可将训练效果最好的模型保存在本项目下的`pretrained_models/$TASK_NAME/`下。模型目录下有`model_config.json`, `model_state.pdparams`, `tokenizer_config.json`及`vocab.txt`这几个文件。 - - -### 训练小模型 - -尝试运行下面的脚本可以分别基于ChnSentiCorp、SST-2、QQP数据集对基于BiLSTM的小模型进行训练。 - - -```shell -CUDA_VISIBLE_DEVICES=0 python small.py \ - --task_name chnsenticorp \ - --max_epoch 20 \ - --vocab_size 1256608 \ - --batch_size 64 \ - --model_name bert-wwm-ext-chinese \ - --optimizer adam \ - --lr 3e-4 \ - --dropout_prob 0.2 \ - --vocab_path senta_word_dict.txt \ - --save_steps 10000 \ - --output_dir small_models/chnsenticorp/ - -``` - -```shell -CUDA_VISIBLE_DEVICES=0 python small.py \ - --task_name sst-2 \ - --vocab_size 30522 \ - --max_epoch 10 \ - --batch_size 64 \ - --lr 1.0 \ - --dropout_prob 0.4 \ - --output_dir small_models/SST-2 \ - --save_steps 10000 \ - --embedding_name w2v.google_news.target.word-word.dim300.en - -``` - -```shell -CUDA_VISIBLE_DEVICES=0 python small.py \ - --task_name qqp \ - --vocab_size 30522 \ - --max_epoch 35 \ - --batch_size 256 \ - --lr 2.0 \ - --dropout_prob 0.4 \ - --output_dir small_models/QQP \ - --save_steps 10000 \ - --embedding_name w2v.google_news.target.word-word.dim300.en - -``` - -### 蒸馏模型 -这一步是将教师模型BERT的知识蒸馏到基于BiLSTM的学生模型中,可以运行下面的命令分别基于ChnSentiCorp、SST-2、QQP数据集对基于BiLSTM的学生模型进行蒸馏。 - -```shell -CUDA_VISIBLE_DEVICES=0 python bert_distill.py \ - --task_name chnsenticorp \ - --vocab_size 1256608 \ - --max_epoch 6 \ - --lr 1.0 \ - --dropout_prob 0.1 \ - --batch_size 64 \ - --model_name bert-wwm-ext-chinese \ - --teacher_dir pretrained_models/chnsenticorp/best_bert_wwm_ext_model_880 \ - --vocab_path senta_word_dict.txt \ - --output_dir distilled_models/chnsenticorp \ - --save_steps 10000 \ - -``` - -```shell -CUDA_VISIBLE_DEVICES=0 python bert_distill.py \ - --task_name sst-2 \ - --vocab_size 30522 \ - --max_epoch 6 \ - --lr 1.0 \ - --task_name sst-2 \ - --dropout_prob 0.2 \ - --batch_size 128 \ - --model_name bert-base-uncased \ - --output_dir distilled_models/SST-2 \ - --teacher_dir pretrained_models/SST-2/best_model_610 \ - --save_steps 10000 \ - --embedding_name w2v.google_news.target.word-word.dim300.en \ - -``` - -```shell -CUDA_VISIBLE_DEVICES=0 python bert_distill.py \ - --task_name qqp \ - --vocab_size 30522 \ - --max_epoch 6 \ - --lr 1.0 \ - --dropout_prob 0.2 \ - --batch_size 256 \ - --model_name bert-base-uncased \ - --n_iter 10 \ - --output_dir distilled_models/QQP \ - --teacher_dir pretrained_models/QQP/best_model_17000 \ - --save_steps 10000 \ - --embedding_name w2v.google_news.target.word-word.dim300.en \ - -``` - -各参数的具体说明请参阅 `args.py` ,注意在训练不同任务时,需要调整对应的超参数。 - - -## 蒸馏实验结果 -本蒸馏实验基于GLUE的SST-2、QQP、中文情感分类ChnSentiCorp数据集。实验效果均使用每个数据集的验证集(dev)进行评价,评价指标是准确率(acc),其中QQP中包含f1值。利用基于BERT的教师模型去蒸馏基于Bi-LSTM的学生模型,对比Bi-LSTM小模型单独训练,在SST-2、QQP、ChnSentiCorp(中文情感分类)任务上分别有3.3%、1.9%、1.4%的提升。 - -| Model | SST-2(dev acc) | QQP(dev acc/f1) | ChnSentiCorp(dev acc) | ChnSentiCorp(dev acc) | -| ----------------- | ----------------- | -------------------------- | --------------------- | --------------------- | -| Teacher model | bert-base-uncased | bert-base-uncased | bert-base-chinese | bert-wwm-ext-chinese | -| BERT-base | 0.930046 | 0.905813(acc)/0.873472(f1) | 0.951667 | 0.955000 | -| Bi-LSTM | 0.854358 | 0.856616(acc)/0.799682(f1) | 0.920000 | 0.920000 | -| Distilled Bi-LSTM | 0.887615 | 0.875216(acc)/0.831254(f1) | 0.932500 | 0.934167 | - -## 参考文献 - -Tang R, Lu Y, Liu L, Mou L, Vechtomova O, Lin J. [Distilling Task-Specific Knowledge from BERT into Simple Neural Networks](https://arxiv.org/abs/1903.12136)[J]. arXiv preprint arXiv:1903.12136, 2019. diff --git a/examples/model_compression/distill_lstm/args.py b/examples/model_compression/distill_lstm/args.py deleted file mode 100644 index 07fd4b1bb191..000000000000 --- a/examples/model_compression/distill_lstm/args.py +++ /dev/null @@ -1,108 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import argparse - -from paddlenlp.utils.env import MODEL_HOME - - -def parse_args(): - parser = argparse.ArgumentParser(description=__doc__) - - parser.add_argument("--task_name", type=str, default="sst-2", help="Task name.") - - parser.add_argument( - "--optimizer", type=str, default="adadelta", help="Optimizer to use, only support[adam|adadelta]." - ) - - parser.add_argument("--lr", type=float, default=1.0, help="Learning rate for optimizer.") - - parser.add_argument("--num_layers", type=int, default=1, help="Layers number of LSTM.") - - parser.add_argument("--emb_dim", type=int, default=300, help="Embedding dim.") - - parser.add_argument("--output_dim", type=int, default=2, help="Number of classifications.") - - parser.add_argument("--hidden_size", type=int, default=300, help="Hidden size of LSTM") - - parser.add_argument("--batch_size", type=int, default=64, help="Batch size of training.") - - parser.add_argument("--max_epoch", type=int, default=12, help="Max number of epochs for training.") - - parser.add_argument("--max_seq_length", type=int, default=128, help="Max length for sentence.") - - parser.add_argument( - "--n_iter", type=int, default=20, help="Number of iterations for one sample in data augmentation." - ) - - parser.add_argument("--dropout_prob", type=float, default=0.0, help="Drop probability.") - - parser.add_argument("--init_scale", type=float, default=0.1, help="Init scale for parameter") - - parser.add_argument("--log_freq", type=int, default=10, help="The frequency to print evaluation logs.") - - parser.add_argument("--save_steps", type=int, default=100, help="The frequency to print evaluation logs.") - - parser.add_argument("--padding_idx", type=int, default=0, help="The padding index of embedding.") - - parser.add_argument( - "--model_name", - type=str, - default="bert-base-uncased", - help="Teacher model's name. Maybe its tokenizer would be loaded and used by small model.", - ) - - parser.add_argument("--teacher_dir", type=str, help="Teacher model's directory.") - - parser.add_argument( - "--vocab_path", - type=str, - default=os.path.join(MODEL_HOME, "bert-base-uncased", "bert-base-uncased-vocab.txt"), - help="Student model's vocab path.", - ) - - parser.add_argument("--output_dir", type=str, default="models", help="Directory to save models .") - - parser.add_argument( - "--init_from_ckpt", type=str, default=None, help="The path of layer and optimizer to be loaded." - ) - - parser.add_argument( - "--whole_word_mask", - action="store_true", - help="If True, use whole word masking method in data augmentation in distilling.", - ) - - parser.add_argument("--embedding_name", type=str, default=None, help="The name of pretrained word embedding.") - - parser.add_argument("--vocab_size", type=int, default=10000, help="Student model's vocab size.") - - parser.add_argument( - "--alpha", type=float, default=0.0, help="Weight balance between cross entropy loss and mean square loss." - ) - - parser.add_argument( - "--seed", - type=int, - default=2021, - help="Random seed for model parameter initialization, data augmentation and so on.", - ) - - parser.add_argument( - "--device", default="gpu", choices=["gpu", "cpu", "xpu"], help="Device selected for inference." - ) - - args = parser.parse_args() - return args diff --git a/examples/model_compression/distill_lstm/bert_distill.py b/examples/model_compression/distill_lstm/bert_distill.py deleted file mode 100644 index 9f253a31b8f5..000000000000 --- a/examples/model_compression/distill_lstm/bert_distill.py +++ /dev/null @@ -1,172 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import time - -import paddle -import paddle.nn as nn -from args import parse_args -from data import create_distill_loader -from paddle.metric import Accuracy -from small import BiLSTM - -from paddlenlp.metrics import AccuracyAndF1 -from paddlenlp.transformers import BertForSequenceClassification - -METRIC_CLASSES = {"sst-2": Accuracy, "qqp": AccuracyAndF1, "chnsenticorp": Accuracy} - - -class TeacherModel(object): - def __init__(self, teacher_dir): - self.model = BertForSequenceClassification.from_pretrained(teacher_dir) - self.model.eval() - - -def evaluate(task_name, model, metric, data_loader): - model.eval() - metric.reset() - for i, batch in enumerate(data_loader): - if task_name == "qqp": - _, _, student_input_ids_1, seq_len_1, student_input_ids_2, seq_len_2, labels = batch - logits = model(student_input_ids_1, seq_len_1, student_input_ids_2, seq_len_2) - else: - _, _, student_input_ids, seq_len, labels = batch - logits = model(student_input_ids, seq_len) - - correct = metric.compute(logits, labels) - metric.update(correct) - res = metric.accumulate() - if isinstance(metric, AccuracyAndF1): - print( - "acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " - % ( - res[0], - res[1], - res[2], - res[3], - res[4], - ), - end="", - ) - else: - print("acc: %s, " % (res), end="") - model.train() - - -def do_train(agrs): - paddle.set_device(args.device) - train_data_loader, dev_data_loader = create_distill_loader( - args.task_name, - model_name=args.model_name, - vocab_path=args.vocab_path, - batch_size=args.batch_size, - max_seq_length=args.max_seq_length, - n_iter=args.n_iter, - whole_word_mask=args.whole_word_mask, - seed=args.seed, - ) - - model = BiLSTM( - args.emb_dim, - args.hidden_size, - args.vocab_size, - args.output_dim, - args.vocab_path, - args.padding_idx, - args.num_layers, - args.dropout_prob, - args.init_scale, - args.embedding_name, - ) - - if args.optimizer == "adadelta": - optimizer = paddle.optimizer.Adadelta(learning_rate=args.lr, rho=0.95, parameters=model.parameters()) - else: - optimizer = paddle.optimizer.Adam(learning_rate=args.lr, parameters=model.parameters()) - - ce_loss = nn.CrossEntropyLoss() - mse_loss = nn.MSELoss() - - metric_class = METRIC_CLASSES[args.task_name] - metric = metric_class() - - teacher = TeacherModel(args.teacher_dir) - - print("Start to distill student model.") - - if args.init_from_ckpt: - model.set_state_dict(paddle.load(args.init_from_ckpt + ".pdparams")) - optimizer.set_state_dict(paddle.load(args.init_from_ckpt + ".pdopt")) - print("Loaded checkpoint from %s" % args.init_from_ckpt) - - global_step = 0 - tic_train = time.time() - for epoch in range(args.max_epoch): - model.train() - for i, batch in enumerate(train_data_loader): - global_step += 1 - if args.task_name == "qqp": - ( - bert_input_ids, - bert_segment_ids, - student_input_ids_1, - seq_len_1, - student_input_ids_2, - seq_len_2, - labels, - ) = batch - else: - bert_input_ids, bert_segment_ids, student_input_ids, seq_len, labels = batch - - # Calculate teacher model's forward. - with paddle.no_grad(): - teacher_logits = teacher.model(bert_input_ids, bert_segment_ids) - - # Calculate student model's forward. - if args.task_name == "qqp": - logits = model(student_input_ids_1, seq_len_1, student_input_ids_2, seq_len_2) - else: - logits = model(student_input_ids, seq_len) - - loss = args.alpha * ce_loss(logits, labels) + (1 - args.alpha) * mse_loss(logits, teacher_logits) - - loss.backward() - optimizer.step() - optimizer.clear_grad() - - if global_step % args.log_freq == 0: - print( - "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.4f step/s" - % (global_step, epoch, i, loss, args.log_freq / (time.time() - tic_train)) - ) - tic_eval = time.time() - evaluate(args.task_name, model, metric, dev_data_loader) - print("eval done total : %s s" % (time.time() - tic_eval)) - tic_train = time.time() - - if global_step % args.save_steps == 0: - paddle.save( - model.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdparams") - ) - paddle.save( - optimizer.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdopt") - ) - - -if __name__ == "__main__": - args = parse_args() - print(args) - paddle.seed(args.seed) - do_train(args) diff --git a/examples/model_compression/distill_lstm/data.py b/examples/model_compression/distill_lstm/data.py deleted file mode 100644 index dec2b358260b..000000000000 --- a/examples/model_compression/distill_lstm/data.py +++ /dev/null @@ -1,322 +0,0 @@ -# -*- coding: utf-8 -*- -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from functools import partial - -import jieba -import numpy as np -import paddle -from utils import ( - convert_example_for_distill, - convert_example_for_lstm, - convert_pair_example, -) - -from paddlenlp.data import Pad, Stack, Tuple, Vocab -from paddlenlp.datasets import load_dataset -from paddlenlp.transformers import BertTokenizer - - -def load_vocab(vocab_file): - """Loads a vocabulary file into a dictionary.""" - vocab = {} - with open(vocab_file, "r", encoding="utf-8") as reader: - tokens = reader.readlines() - for index, token in enumerate(tokens): - token = token.rstrip("\n").split("\t")[0] - vocab[token] = index - return vocab - - -def ngram_sampling(words, words_2=None, p_ng=0.25, ngram_range=(2, 6)): - if np.random.rand() < p_ng: - ngram_len = np.random.randint(ngram_range[0], ngram_range[1] + 1) - ngram_len = min(ngram_len, len(words)) - start = np.random.randint(0, len(words) - ngram_len + 1) - words = words[start : start + ngram_len] - if words_2: - words_2 = words_2[start : start + ngram_len] - return words if not words_2 else (words, words_2) - - -def flatten(list_of_list): - final_list = [] - for each_list in list_of_list: - final_list += each_list - return final_list - - -def apply_data_augmentation( - data, task_name, tokenizer, n_iter=20, p_mask=0.1, p_ng=0.25, ngram_range=(2, 6), whole_word_mask=False, seed=0 -): - """ - Data Augmentation contains Masking and n-gram sampling. Tokenization and - Masking are performed at the same time, so that the masked token can be - directly replaced by `mask_token`, after what sampling is performed. - """ - - def _data_augmentation(data, tokenized_list, whole_word_mask=whole_word_mask): - # 1. Masking - words = [] - if not whole_word_mask: - words = [tokenizer.mask_token if np.random.rand() < p_mask else word for word in tokenized_list] - else: - for word in data.split(): - words += [[tokenizer.mask_token]] if np.random.rand() < p_mask else [tokenizer.tokenize(word)] - # 2. N-gram sampling - words = ngram_sampling(words, p_ng=p_ng, ngram_range=ngram_range) - words = flatten(words) if isinstance(words[0], list) else words - return words - - np.random.seed(seed) - new_data = [] - for example in data: - if task_name == "qqp": - data_list = tokenizer.tokenize(example["sentence1"]) - data_list_2 = tokenizer.tokenize(example["sentence2"]) - new_data.append({"sentence1": data_list, "sentence2": data_list_2, "labels": example["labels"]}) - else: - data_list = tokenizer.tokenize(example["sentence"]) - new_data.append({"sentence": data_list, "labels": example["labels"]}) - - for example in data: - for _ in range(n_iter): - if task_name == "qqp": - words = _data_augmentation(example["sentence1"], data_list) - words_2 = _data_augmentation(example["sentence2"], data_list_2) - new_data.append({"sentence1": words, "sentence2": words_2, "labels": example["labels"]}) - else: - words = _data_augmentation(example["sentence"], data_list) - new_data.append({"sentence": words, "labels": example["labels"]}) - return new_data - - -def apply_data_augmentation_for_cn( - data, tokenizer, vocab, n_iter=20, p_mask=0.1, p_ng=0.25, ngram_range=(2, 10), seed=0 -): - """ - Because BERT and jieba have different `tokenize` function, it returns - jieba_tokenizer(example['text'], bert_tokenizer(example['text']) and - example['label]) for each example in data. - jieba tokenization and Masking are performed at the same time, so that the - masked token can be directly replaced by `mask_token`, and other tokens - could be tokenized by BERT's tokenizer, from which tokenized example for - student model and teacher model would get at the same time. - """ - np.random.seed(seed) - new_data = [] - - for example in data: - if not example["text"]: - continue - text_tokenized = list(jieba.cut(example["text"])) - lstm_tokens = text_tokenized - bert_tokens = tokenizer.tokenize(example["text"]) - new_data.append({"lstm_tokens": lstm_tokens, "bert_tokens": bert_tokens, "label": example["label"]}) - for _ in range(n_iter): - # 1. Masking - lstm_tokens, bert_tokens = [], [] - for word in text_tokenized: - if np.random.rand() < p_mask: - lstm_tokens.append([vocab.unk_token]) - bert_tokens.append([tokenizer.unk_token]) - else: - lstm_tokens.append([word]) - bert_tokens.append(tokenizer.tokenize(word)) - # 2. N-gram sampling - lstm_tokens, bert_tokens = ngram_sampling(lstm_tokens, bert_tokens, p_ng, ngram_range) - lstm_tokens, bert_tokens = flatten(lstm_tokens), flatten(bert_tokens) - if lstm_tokens and bert_tokens: - new_data.append({"lstm_tokens": lstm_tokens, "bert_tokens": bert_tokens, "label": example["label"]}) - return new_data - - -def create_data_loader_for_small_model( - task_name, vocab_path, model_name=None, batch_size=64, max_seq_length=128, shuffle=True -): - """Data loader for bi-lstm, not bert.""" - if task_name == "chnsenticorp": - train_ds, dev_ds = load_dataset(task_name, splits=["train", "dev"]) - else: - train_ds, dev_ds = load_dataset("glue", task_name, splits=["train", "dev"]) - if task_name == "chnsenticorp": - vocab = Vocab.load_vocabulary( - vocab_path, - unk_token="[UNK]", - pad_token="[PAD]", - bos_token=None, - eos_token=None, - ) - pad_val = vocab["[PAD]"] - - else: - vocab = BertTokenizer.from_pretrained(model_name) - pad_val = vocab.pad_token_id - - trans_fn = partial( - convert_example_for_lstm, task_name=task_name, vocab=vocab, max_seq_length=max_seq_length, is_test=False - ) - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=pad_val), Stack(dtype="int64"), Stack(dtype="int64") # input_ids # seq len # label - ): fn(samples) - - train_ds = train_ds.map(trans_fn, lazy=True) - dev_ds = dev_ds.map(trans_fn, lazy=True) - - train_data_loader, dev_data_loader = create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle) - - return train_data_loader, dev_data_loader - - -def create_distill_loader( - task_name, - model_name, - vocab_path, - batch_size=64, - max_seq_length=128, - shuffle=True, - n_iter=20, - whole_word_mask=False, - seed=0, -): - """ - Returns batch data for bert and small model. - Bert and small model have different input representations. - """ - tokenizer = BertTokenizer.from_pretrained(model_name) - if task_name == "chnsenticorp": - train_ds, dev_ds = load_dataset(task_name, splits=["train", "dev"]) - vocab = Vocab.load_vocabulary( - vocab_path, - unk_token="[UNK]", - pad_token="[PAD]", - bos_token=None, - eos_token=None, - ) - pad_val = vocab["[PAD]"] - data_aug_fn = partial( - apply_data_augmentation_for_cn, tokenizer=tokenizer, vocab=vocab, n_iter=n_iter, seed=seed - ) - else: - train_ds, dev_ds = load_dataset("glue", task_name, splits=["train", "dev"]) - vocab = tokenizer - pad_val = tokenizer.pad_token_id - data_aug_fn = partial( - apply_data_augmentation, - task_name=task_name, - tokenizer=tokenizer, - n_iter=n_iter, - whole_word_mask=whole_word_mask, - seed=seed, - ) - train_ds = train_ds.map(data_aug_fn, batched=True) - print("Data augmentation has been applied.") - - trans_fn = partial( - convert_example_for_distill, - task_name=task_name, - tokenizer=tokenizer, - label_list=train_ds.label_list, - max_seq_length=max_seq_length, - vocab=vocab, - ) - - trans_fn_dev = partial( - convert_example_for_distill, - task_name=task_name, - tokenizer=tokenizer, - label_list=train_ds.label_list, - max_seq_length=max_seq_length, - vocab=vocab, - is_tokenized=False, - ) - - if task_name == "qqp": - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id), # bert input - Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # bert segment - Pad(axis=0, pad_val=pad_val), # small input_ids - Stack(dtype="int64"), # small seq len - Pad(axis=0, pad_val=pad_val), # small input_ids - Stack(dtype="int64"), # small seq len - Stack(dtype="int64"), # small label - ): fn(samples) - else: - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id), # bert input - Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # bert segment - Pad(axis=0, pad_val=pad_val), # small input_ids - Stack(dtype="int64"), # small seq len - Stack(dtype="int64"), # small label - ): fn(samples) - - train_ds = train_ds.map(trans_fn, lazy=True) - dev_ds = dev_ds.map(trans_fn_dev, lazy=True) - train_data_loader, dev_data_loader = create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle) - return train_data_loader, dev_data_loader - - -def create_pair_loader_for_small_model( - task_name, model_name, vocab_path, batch_size=64, max_seq_length=128, shuffle=True, is_test=False -): - """Only support QQP now.""" - tokenizer = BertTokenizer.from_pretrained(model_name) - train_ds, dev_ds = load_dataset("glue", task_name, splits=["train", "dev"]) - vocab = Vocab.load_vocabulary( - vocab_path, - unk_token="[UNK]", - pad_token="[PAD]", - bos_token=None, - eos_token=None, - ) - - trans_func = partial( - convert_pair_example, - task_name=task_name, - vocab=tokenizer, - is_tokenized=False, - max_seq_length=max_seq_length, - is_test=is_test, - ) - train_ds = train_ds.map(trans_func, lazy=True) - dev_ds = dev_ds.map(trans_func, lazy=True) - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=vocab["[PAD]"]), # input - Stack(), # length - Pad(axis=0, pad_val=vocab["[PAD]"]), # input - Stack(), # length - Stack(dtype="int64" if train_ds.label_list else "float32"), # label - ): fn(samples) - - train_data_loader, dev_data_loader = create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle) - return train_data_loader, dev_data_loader - - -def create_dataloader(train_ds, dev_ds, batch_size, batchify_fn, shuffle=True): - train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=batch_size, shuffle=shuffle) - - dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=batch_size, shuffle=False) - - train_data_loader = paddle.io.DataLoader( - dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True - ) - - dev_data_loader = paddle.io.DataLoader( - dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, num_workers=0, return_list=True - ) - - return train_data_loader, dev_data_loader diff --git a/examples/model_compression/distill_lstm/small.py b/examples/model_compression/distill_lstm/small.py deleted file mode 100644 index 92681bd03991..000000000000 --- a/examples/model_compression/distill_lstm/small.py +++ /dev/null @@ -1,211 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os -import time - -import paddle -import paddle.nn as nn -import paddle.nn.initializer as I -from args import parse_args -from data import create_data_loader_for_small_model, create_pair_loader_for_small_model -from paddle.metric import Accuracy - -from paddlenlp.embeddings import TokenEmbedding -from paddlenlp.metrics import AccuracyAndF1 - -METRIC_CLASSES = {"sst-2": Accuracy, "qqp": AccuracyAndF1, "chnsenticorp": Accuracy} - - -class BiLSTM(nn.Layer): - def __init__( - self, - embed_dim, - hidden_size, - vocab_size, - output_dim, - vocab_path, - padding_idx=0, - num_layers=1, - dropout_prob=0.0, - init_scale=0.1, - embedding_name=None, - ): - super(BiLSTM, self).__init__() - if embedding_name is not None: - self.embedder = TokenEmbedding( - embedding_name, extended_vocab_path=vocab_path, keep_extended_vocab_only=True - ) - embed_dim = self.embedder.embedding_dim - else: - self.embedder = nn.Embedding(vocab_size, embed_dim, padding_idx) - - self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers, "bidirectional", dropout=dropout_prob) - - self.fc = nn.Linear( - hidden_size * 2, - hidden_size, - weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), - ) - - self.fc_1 = nn.Linear( - hidden_size * 8, - hidden_size, - weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), - ) - - self.output_layer = nn.Linear( - hidden_size, - output_dim, - weight_attr=paddle.ParamAttr(initializer=I.Uniform(low=-init_scale, high=init_scale)), - ) - - def forward(self, x_1, seq_len_1, x_2=None, seq_len_2=None): - x_embed_1 = self.embedder(x_1) - lstm_out_1, (hidden_1, _) = self.lstm(x_embed_1, sequence_length=seq_len_1) - out_1 = paddle.concat((hidden_1[-2, :, :], hidden_1[-1, :, :]), axis=1) - if x_2 is not None: - x_embed_2 = self.embedder(x_2) - lstm_out_2, (hidden_2, _) = self.lstm(x_embed_2, sequence_length=seq_len_2) - out_2 = paddle.concat((hidden_2[-2, :, :], hidden_2[-1, :, :]), axis=1) - out = paddle.concat(x=[out_1, out_2, out_1 + out_2, paddle.abs(out_1 - out_2)], axis=1) - out = paddle.tanh(self.fc_1(out)) - else: - out = paddle.tanh(self.fc(out_1)) - logits = self.output_layer(out) - - return logits - - -def evaluate(task_name, model, loss_fct, metric, data_loader): - model.eval() - metric.reset() - for batch in data_loader: - if task_name == "qqp": - input_ids_1, seq_len_1, input_ids_2, seq_len_2, labels = batch - logits = model(input_ids_1, seq_len_1, input_ids_2, seq_len_2) - else: - input_ids, seq_len, labels = batch - logits = model(input_ids, seq_len) - loss = loss_fct(logits, labels) - correct = metric.compute(logits, labels) - metric.update(correct) - res = metric.accumulate() - if isinstance(metric, AccuracyAndF1): - print( - "eval loss: %f, acc: %s, precision: %s, recall: %s, f1: %s, acc and f1: %s, " - % ( - loss.numpy(), - res[0], - res[1], - res[2], - res[3], - res[4], - ), - end="", - ) - else: - print("eval loss: %f, acc: %s, " % (loss.numpy(), res), end="") - model.train() - return res[0] if isinstance(metric, AccuracyAndF1) else res - - -def do_train(args): - paddle.set_device(args.device) - metric_class = METRIC_CLASSES[args.task_name] - metric = metric_class() - if args.task_name == "qqp": - train_data_loader, dev_data_loader = create_pair_loader_for_small_model( - task_name=args.task_name, - vocab_path=args.vocab_path, - model_name=args.model_name, - batch_size=args.batch_size, - ) - else: - train_data_loader, dev_data_loader = create_data_loader_for_small_model( - task_name=args.task_name, - vocab_path=args.vocab_path, - model_name=args.model_name if args.task_name == "sst-2" else None, - batch_size=args.batch_size, - ) - - model = BiLSTM( - args.emb_dim, - args.hidden_size, - args.vocab_size, - args.output_dim, - args.vocab_path, - args.padding_idx, - args.num_layers, - args.dropout_prob, - args.init_scale, - args.embedding_name, - ) - - loss_fct = nn.CrossEntropyLoss() - - if args.optimizer == "adadelta": - optimizer = paddle.optimizer.Adadelta(learning_rate=args.lr, rho=0.95, parameters=model.parameters()) - else: - optimizer = paddle.optimizer.Adam(learning_rate=args.lr, parameters=model.parameters()) - - if args.init_from_ckpt: - model.set_state_dict(paddle.load(args.init_from_ckpt + ".pdparams")) - optimizer.set_state_dict(paddle.load(args.init_from_ckpt + ".pdopt")) - print("Loaded checkpoint from %s" % args.init_from_ckpt) - - global_step = 0 - tic_train = time.time() - for epoch in range(args.max_epoch): - for i, batch in enumerate(train_data_loader): - global_step += 1 - if args.task_name == "qqp": - input_ids_1, seq_len_1, input_ids_2, seq_len_2, labels = batch - logits = model(input_ids_1, seq_len_1, input_ids_2, seq_len_2) - else: - input_ids, seq_len, labels = batch - logits = model(input_ids, seq_len) - - loss = loss_fct(logits, labels) - - loss.backward() - optimizer.step() - optimizer.clear_grad() - - if global_step % args.log_freq == 0: - with paddle.no_grad(): - print( - "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.4f step/s" - % (global_step, epoch, i, loss, args.log_freq / (time.time() - tic_train)) - ) - tic_eval = time.time() - - evaluate(args.task_name, model, loss_fct, metric, dev_data_loader) - print("eval done total : %s s" % (time.time() - tic_eval)) - tic_train = time.time() - - if global_step % args.save_steps == 0: - paddle.save( - model.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdparams") - ) - paddle.save( - optimizer.state_dict(), os.path.join(args.output_dir, "step_" + str(global_step) + ".pdopt") - ) - - -if __name__ == "__main__": - args = parse_args() - print(args) - paddle.seed(args.seed) - do_train(args) diff --git a/examples/model_compression/distill_lstm/utils.py b/examples/model_compression/distill_lstm/utils.py deleted file mode 100644 index 0243d97a5a64..000000000000 --- a/examples/model_compression/distill_lstm/utils.py +++ /dev/null @@ -1,117 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserve. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import jieba - -import numpy as np - - -def convert_example_for_lstm(example, task_name, vocab, is_tokenized=False, max_seq_length=128, is_test=False): - """convert a example for lstm's input""" - input_ids = [] - if task_name == "chnsenticorp": - if is_tokenized: - lstm_tokens = example["lstm_tokens"][:max_seq_length] - input_ids = [vocab[token] for token in lstm_tokens] - else: - tokenized_text = list(jieba.cut(example["text"]))[:max_seq_length] - input_ids = vocab[tokenized_text] - else: - if is_tokenized: - tokens = example["sentence"][:max_seq_length] - else: - tokens = vocab.tokenize(example["sentence"])[:max_seq_length] - input_ids = vocab.convert_tokens_to_ids(tokens) - - valid_length = np.array(len(input_ids), dtype="int64") - if not is_test: - label = ( - np.array(example["label"], dtype="int64") - if task_name == "chnsenticorp" - else np.array(example["labels"], dtype="int64") - ) - return input_ids, valid_length, label - return input_ids, valid_length - - -def convert_pair_example(example, task_name, vocab, is_tokenized=True, max_seq_length=128, is_test=False): - seq1 = convert_example_for_lstm( - {"sentence": example["sentence1"], "labels": example["labels"]}, - task_name, - vocab, - is_tokenized, - max_seq_length, - is_test, - )[:2] - - seq2 = convert_example_for_lstm( - {"sentence": example["sentence2"], "labels": example["labels"]}, - task_name, - vocab, - is_tokenized, - max_seq_length, - is_test, - ) - pair_features = seq1 + seq2 - - return pair_features - - -def convert_example_for_distill( - example, task_name, tokenizer, label_list, max_seq_length, vocab, is_tokenized=True, is_test=False -): - bert_features = convert_example_for_bert( - example, - tokenizer=tokenizer, - label_list=label_list, - is_tokenized=is_tokenized, - max_seq_length=max_seq_length, - is_test=is_test, - ) - if task_name == "qqp": - small_features = convert_pair_example(example, task_name, vocab, is_tokenized, max_seq_length, is_test) - else: - small_features = convert_example_for_lstm(example, task_name, vocab, is_tokenized, max_seq_length, is_test) - return bert_features[:2] + small_features - - -def convert_example_for_bert(example, tokenizer, label_list, is_tokenized=False, max_seq_length=512, is_test=False): - """convert a example for bert's input""" - if not is_test: - # `label_list == None` is for regression task - label_dtype = "int64" if label_list else "float32" - # Get the label - label = example["labels"] if "labels" in example else example["label"] - label = np.array([label], dtype=label_dtype) - # Convert raw text to feature - if "sentence1" in example: - example = tokenizer( - example["sentence1"], - text_pair=example["sentence2"], - max_seq_len=max_seq_length, - is_split_into_words=is_tokenized, - ) - else: - if "sentence" in example: - text = example["sentence"] - elif "text" in example: - text = example["text"] - else: - text = example["bert_tokens"] - example = tokenizer(text, max_seq_len=max_seq_length, is_split_into_words=is_tokenized) - - if not is_test: - return example["input_ids"], example["token_type_ids"], label - else: - return example["input_ids"], example["token_type_ids"] diff --git a/examples/model_interpretation/README.md b/examples/model_interpretation/README.md deleted file mode 100644 index ab3adf4eddab..000000000000 --- a/examples/model_interpretation/README.md +++ /dev/null @@ -1,255 +0,0 @@ -NLP可解释评估 -=== -深度学习模型在很多NLP任务上已经取得巨大成功,但其常被当作一个黑盒使用,内部预测机制对使用者是不透明的。这使得深度学习模型结果不被人信任,增加落地难度,尤其是在医疗、法律等特殊领域。同时,当模型出现效果不好或鲁棒性差等问题时,由于不了解其内部机制,导致很难对模型进行优化。近期,深度学习模型的可解释性被越来越多的人关注。但模型的可解释性评估还不够完善,本模块提供了3个NLP任务的评测数据和相关评测指标,旨在评估模型的可解释性。模块包含以下功能: - - 1. 完善可解释性评估体系,提供了评测数据和对应的评测指标 - 2. 提供了3种典型的证据抽取方法,分别是基于注意力(attention-based)、梯度(gradient-based)和线性模型(LIME)的证据抽取方法,并在LSTM、Transformer(RoBERTa-base和RoBERTa-large)等常用模型网络结构上完成实验验证,分别验证模型结构复杂度、模型参数规模对模型可解释的影响 - 3. 提供模型较全面的评估报告,含模型本身准确率等效果、以及在3个可解释评测指标上的结果 - -

-
-

- -可解释评估体系 ---- -### 评测数据 -我们提供了情感分析、相似度计算、阅读理解等三个NLP任务上的中英文数据集。对于每一个数据集,人工标注了证据数据和扰动数据。 - - 证据数据:给出模型预测依赖的证据(从人类认知角度),其由输入中的若干词构成。我们的标注标准包含3个维度:充分性(sufficiency)、简洁性(concision)、可理解性(understandability)。 - 扰动数据:旨在评估模型在扰动下的证据一致性。我们从抗干扰性、敏感性和泛化性等角度构建了扰动数据,其中,“敏感性”和“泛化性”维度下构建的数据可能会改变证据。 - -#### 样例数据(来自中文情感分析任务): - -

-
-

- -#### 数据规模 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
任务英文模型中文模型
规模证据平均长度比例证据平均数量规模证据平均长度比例证据平均数量
情感分析1,49919.20%2.11,64630.10%1.4
相似度任务1,65952.20%1.01,62970.50%1.0
阅读理解1,50710.20%1.01,7629.60%1.0
- -### 评估指标 -__合理性__:评估模型预测依赖的证据与人工标注证据的拟合度,我们这里使用macro-F1作为评估指标,其中模型预测依赖证据可以由本模块提供的证据分析方法(位于/model_interpretation/task/目录下)给出。
- -

-
-

-其中Sip和Sig分别代表针对第i条输入模型预测证据和人工标注证据,N代表数据集中数据的数量
- -__一致性__:评估(原始输入,对应扰动输入)对中词重要度排序的一致性。证据分析方法对输入中每个词赋予一个重要度,基于该重要度对输入中所有词进行排序。我们使用搜索排序中的MAP(mean average precision)指标来计算两个排序的一致性。这里给出了MAP的两种计算方式,分别见以下两个公式:
-公式一(正在使用):
-

-
-

-公式二:
-

-
-

-其中Xo和Xd分别代表原始输入和扰动输入的词重要度排序序列。|Xd|代表Xd中词的个数,Xo1:j表示Xo中前j最重要的词。函数G(x, Y)检查词x是否存在于列表Y中,如果存在则G(x, Y)=1。MAP越高表示两个序列排序一致性越高
- -__忠诚性__:评估模型给出的证据的忠诚性,即模型是否真的基于给出的证据进行预测的。这里从充分性和完备性两个角度进行评估。充分性,即模型给出的证据是否包含了预测需要的全部信息(即yri = yxi,其中ri表示输入xi的证据,yx表示模型对输入x的预测结果);完备性,即模型对输入x的预测结果(即yxi\ri ≠ yxi,其中xi\ri表示从输入xi中去除证据ri)。基于这两个维度,我们提出了一个新的指标New-P,计算方式如下:
- -

-
-

-

-
-

- -### 证据抽取方法 -证据抽取方法(rationale-extraction),顾名思义,就是从输入中抽取对模型预测至关重要的词,又被称为后验解释方法(post-hoc explanation methods)。 -该平台提供了3种典型的证据抽取方法,分别是:基于注意力机制(attention-based)的解释方法、基于梯度(gradient-based)的解释方法,和基于线性模型(linear-based)的解释方法:
- -Attention-based([Jain and Wallace, 2019](https://arxiv.org/pdf/1902.10186.pdf)): - - 将注意力分数作为词重要度。注意力分数的获取取决于具体模型架构,我们提供了基于LSTM和transformer框架的提取方法,见每个具体任务下的saliency_map目录。 - -Gradient-based([Sundararajan et al., 2017](https://arxiv.org/pdf/1703.01365.pdf)): - - 基于梯度给出每个词重要度。我们这里给出了integrated gradient计算方式,具体见saliency_map目录或论文[Axiomatic attribution for deep networks](https://arxiv.org/pdf/1703.01365.pdf)。 - -Linear-based([Ribeiro et al.. 2016](https://arxiv.org/pdf/1602.04938.pdf)): - - 使用线性模型局部模拟待验证模型,线性模型学习到的词的权重作为该词对预测结果的重要度,详细见论文[" why should i trust you?" explaining the predictions of any classifier](https://arxiv.org/pdf/1602.04938.pdf)。 - -### 三个任务的被评估模型 -为验证模型复杂度、参数规模对可解释的影响,针对每个任务,我们分别提供了基于LSTM(简单结构)的模型、及Transformer-based预训练模型(复杂结构),其中,对于预训练模型,提供了base版本和large版本。
-模型代码位置:/model_interpretation/task/{task}/,({task}可取值为["senti","similarity","mrc"],其中senti代表情感分析,similarity代表相似度计算,mrc代表阅读理解)
-模型运行及依赖环境请参考下方的“平台使用”。 - - -## 平台使用 -### 环境准备 -代码运行需要 Linux 主机,Python 3.8(推荐,其他低版本未测试过) 和 PaddlePaddle 2.1 以上版本。 - -### 推荐的环境 - -* 操作系统 CentOS 7.5 -* Python 3.8.12 -* PaddlePaddle 2.1.0 -* PaddleNLP 2.2.4 - -除此之外,需要使用支持 GPU 的硬件环境。 - -### PaddlePaddle - -需要安装GPU版的PaddlePaddle。 - -``` -# GPU 版本 -pip3 install paddlepaddle-gpu -``` - -更多关于 PaddlePaddle 的安装教程、使用方法等请参考[官方文档](https://www.paddlepaddle.org.cn/#quick-start). - -### 第三方 Python 库 -除 PaddlePaddle 及其依赖之外,还依赖其它第三方 Python 库,位于代码根目录的 requirements.txt 文件中。 - -可使用 pip 一键安装 - -```pip3 install -r requirements.txt``` - -## 数据准备 -### 模型训练数据 -#### 情感分析任务: - -中文推荐使用ChnSentiCorp,英文推荐使用SST-2。本模块提供的中英文情感分析模型就是基于这两个数据集的。若修改训练数据集,请修改/model_interpretation/task/senti/pretrained_models/train.py (RoBERTa) 以及 /model_interpretation/task/senti/rnn/train.py (LSTM)。 - -[//]:数据集会被缓存到/home/work/.paddlenlp/datasets/目录下 - -#### 相似度计算: - -中文推荐使用LCQMC,英文推荐使用QQP。本模块提供的中英文相似度计算模型就是基于这两个数据集的,若修改训练数据集,请修改/model_interpretation/task/similarity/pretrained_models/train_pointwise.py(RoBERTa)以及/model_interpretation/task/similarity/simnet/train.py(LSTM)。 - -#### 阅读理解中英文: - -中文推荐使用[DuReader_Checklist](https://dataset-bj.cdn.bcebos.com/lic2021/dureader_checklist.dataset.tar.gz),英文推荐使用[SQUDA2](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json)。请将阅读理解训练数据放置在/model_interpretation/task/mrc/data目录下。 - -### 下载预训练模型 - -使用paddlenlp框架自动缓存模型文件。 - -### 其他数据下载 -请运行download.sh自动下载 - -### 评测数据 -评测数据样例位于/model_interpretation/data/目录下,每一行为一条JSON格式的数据。 -#### 情感分析数据格式: - id: 数据的编号,作为该条数据识别key; - context:原文本数据; - sent_token:原文本数据的标准分词,注意:golden证据是基于该分词的,预测证据也需要与该分词对应; - sample_type: 数据的类性,分为原始数据(ori)和扰动数据(disturb); - rel_ids:与原始数据关联的扰动数据的id列表(只有原始数据有); - -#### 相似度数据格式: - id:数据的编号,作为该条数据识别key; - query(英文中为sentence1):句子1的原文本数据; - title(英文中为sentence2):句子2的原文本数据; - text_q_seg:句子1的标准分词,注意:golden证据是基于该分词的,预测证据也需要与该分词对应; - text_t_seg:句子2的标准分词,注意:golden证据是基于该分词的,预测证据也需要与该分词对应; - sample_type: 数据的类性,分为原始数据(ori)和扰动数据(disturb); - rel_ids:与原始数据关联的扰动数据的id列表(只有原始数据有); - -#### 阅读理解数据格式: - id:数据的编号,作为该条数据识别key; - title:文章标题; - context:文章主体; - question:文章的问题; - sent_token:原文本数据的标准分词,注意:golden证据是基于该分词的,预测证据也需要与该分词对应; - sample_type: 数据的类性,分为原始数据(ori)和扰动数据(disturb); - rel_ids:与原始数据关联的扰动数据的id列表(只有原始数据有); -## 模型运行 -### 模型预测: - - model_interpretation/task/{task}/run_inter_all.sh (生成所有结果) - model_interpretation/task/{task}/run_inter.sh (生成单个配置的结果,配置可以选择不同的评估模型,以及不同的证据抽取方法、语言) - -(注:{task}可取值为["senti","similarity","mrc"],其中senti代表情感分析,similarity代表相似度计算,mrc代表阅读理解) - -### 证据抽取: - cd model_interpretation/rationale_extraction - ./generate.sh - -### 可解释评估: -#### 合理性(plausibility): - model_interpretation/evaluation/plausibility/run_f1.sh -#### 一致性(consistency): - model_interpretation/evaluation/consistency/run_map.sh -#### 忠诚性(faithfulness): - model_interpretation/evaluation/faithfulness/run_newp.sh - -### 评估报告 -中文情感分析评估报告样例: - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
模型 + 证据抽取方法情感分析
AccMacro-F1MAPNew_P
LSTM + IG56.836.859.891.4
RoBERTa-base + IG62.436.448.748.9
RoBERTa-large + IG65.338.341.937.8
diff --git a/examples/model_interpretation/data/mrc_ch b/examples/model_interpretation/data/mrc_ch deleted file mode 100644 index 09a10298751f..000000000000 --- a/examples/model_interpretation/data/mrc_ch +++ /dev/null @@ -1,100 +0,0 @@ -{"id": 1, "title": "地瓜是红薯吗", "context": "地瓜不是红薯。地瓜一般生吃或者凉拌,外形是纺锤型的,有明显的瓣状结构,内里的肉是白色的,有清淡的药香味,生吃又脆又甜,常食用可以预防肝癌、胃癌,营养价值非常高。红薯是粗粮,也叫番薯山芋。它是一种属管状花目,旋花科一年生的草本植物,富含丰富的矿物质和维生素,而且非常耐饱。", "question": "地瓜和红薯一样吗", "sent_token": ["地", "瓜", "不", "是", "红", "薯", "。", "地", "瓜", "一", "般", "生", "吃", "或", "者", "凉", "拌", ",", "外", "形", "是", "纺", "锤", "型", "的", ",", "有", "明", "显", "的", "瓣", "状", "结", "构", ",", "内", "里", "的", "肉", "是", "白", "色", "的", ",", "有", "清", "淡", "的", "药", "香", "味", ",", "生", "吃", "又", "脆", "又", "甜", ",", "常", "食", "用", "可", "以", "预", "防", "肝", "癌", "、", "胃", "癌", ",", "营", "养", "价", "值", "非", "常", "高", "。", "红", "薯", "是", "粗", "粮", ",", "也", "叫", "番", "薯", "山", "芋", "。", "它", "是", "一", "种", "属", "管", "状", "花", "目", ",", "旋", "花", "科", "一", "年", "生", "的", "草", "本", "植", "物", ",", "富", "含", "丰", "富", "的", "矿", "物", "质", "和", "维", "生", "素", ",", "而", "且", "非", "常", "耐", "饱", "。", "地", "瓜", "是", "红", "薯", "吗"], "sample_type": "ori", "rel_ids": [1763]} -{"id": 5, "title": "已满多少岁的人犯贩卖毒品罪应负刑事责任", "context": "根据《刑法》第十七条:已满十六周岁的人犯罪,应当负刑事责任。已满十四周岁不满十六周岁的人,犯故意杀人、故意伤害致人重伤或者死亡、强奸、抢劫、贩卖毒品、放火、爆炸、投放危险物质罪的,应当负刑事责任。", "question": "已满几周岁的人贩卖毒品罪应当负刑事责任", "sent_token": ["根", "据", "《", "刑", "法", "》", "第", "十", "七", "条", ":", "已", "满", "十", "六", "周", "岁", "的", "人", "犯", "罪", ",", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "十", "四", "周", "岁", "不", "满", "十", "六", "周", "岁", "的", "人", ",", "犯", "故", "意", "杀", "人", "、", "故", "意", "伤", "害", "致", "人", "重", "伤", "或", "者", "死", "亡", "、", "强", "奸", "、", "抢", "劫", "、", "贩", "卖", "毒", "品", "、", "放", "火", "、", "爆", "炸", "、", "投", "放", "危", "险", "物", "质", "罪", "的", ",", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "多", "少", "岁", "的", "人", "犯", "贩", "卖", "毒", "品", "罪", "应", "负", "刑", "事", "责", "任"], "sample_type": "ori", "rel_ids": [1767]} -{"id": 10, "title": "读研跟考研有什么区别", "context": "考研和读研的区别在于概念和意义不同。考研是指考生通过考试来得到研究生的入学资格,而考生并不是硕士研究生;而读研是指学生在高校攻读硕士研究生的过程,学生身份已经是硕士研究生。这二者并不等同,而是有先后关系,也就是说考生只有通过考研,才能成为硕士研究生,然后在规定的学习时间内读研。", "question": "考研跟读研有什么区别", "sent_token": ["考", "研", "和", "读", "研", "的", "区", "别", "在", "于", "概", "念", "和", "意", "义", "不", "同", "。", "考", "研", "是", "指", "考", "生", "通", "过", "考", "试", "来", "得", "到", "研", "究", "生", "的", "入", "学", "资", "格", ",", "而", "考", "生", "并", "不", "是", "硕", "士", "研", "究", "生", ";", "而", "读", "研", "是", "指", "学", "生", "在", "高", "校", "攻", "读", "硕", "士", "研", "究", "生", "的", "过", "程", ",", "学", "生", "身", "份", "已", "经", "是", "硕", "士", "研", "究", "生", "。", "这", "二", "者", "并", "不", "等", "同", ",", "而", "是", "有", "先", "后", "关", "系", ",", "也", "就", "是", "说", "考", "生", "只", "有", "通", "过", "考", "研", ",", "才", "能", "成", "为", "硕", "士", "研", "究", "生", ",", "然", "后", "在", "规", "定", "的", "学", "习", "时", "间", "内", "读", "研", "。", "读", "研", "跟", "考", "研", "有", "什", "么", "区", "别"], "sample_type": "ori", "rel_ids": [1772]} -{"id": 12, "title": "多效唑能和磷酸二氢钾一起用吗", "context": "多效唑能和磷酸二氢钾一起用。多效唑是植物的生长调节剂,主要是控制作物疯长的。而磷酸二氢钾属于叶面肥,施用后可促使作物的叶色更加浓绿,根系发达,药效完全不同,也并不排斥,可以混合使用。不过要注意施用时要严格按照说明施加,不可过量,否则会阻碍生长。", "question": "磷酸二氢钾能和多效唑一起用吗", "sent_token": ["多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "。", "多", "效", "唑", "是", "植", "物", "的", "生", "长", "调", "节", "剂", ",", "主", "要", "是", "控", "制", "作", "物", "疯", "长", "的", "。", "而", "磷", "酸", "二", "氢", "钾", "属", "于", "叶", "面", "肥", ",", "施", "用", "后", "可", "促", "使", "作", "物", "的", "叶", "色", "更", "加", "浓", "绿", ",", "根", "系", "发", "达", ",", "药", "效", "完", "全", "不", "同", ",", "也", "并", "不", "排", "斥", ",", "可", "以", "混", "合", "使", "用", "。", "不", "过", "要", "注", "意", "施", "用", "时", "要", "严", "格", "按", "照", "说", "明", "施", "加", ",", "不", "可", "过", "量", ",", "否", "则", "会", "阻", "碍", "生", "长", "。", "多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "吗"], "sample_type": "ori", "rel_ids": [1774]} -{"id": 14, "title": "猫能吃蛋黄吗", "context": "猫咪是可以吃蛋黄的。这里特定煮熟的白水蛋,猫咪不能吃生鸡蛋,因为生鸡蛋中有细菌,常见的是沙门氏菌,容易引起猫腹泻脱水,而且饲喂猫咪最好的只饲喂蛋黄。虽然可以吃蛋黄,但是需要掌握好量,一般一周最多吃两三次就可了。蛋黄中也含有丰富的胆固醇,易引发猫咪患脂肪肝和高脂血病。", "question": "猫咪可以吃生蛋黄吗", "sent_token": ["猫", "咪", "是", "可", "以", "吃", "蛋", "黄", "的", "。", "这", "里", "特", "定", "煮", "熟", "的", "白", "水", "蛋", ",", "猫", "咪", "不", "能", "吃", "生", "鸡", "蛋", ",", "因", "为", "生", "鸡", "蛋", "中", "有", "细", "菌", ",", "常", "见", "的", "是", "沙", "门", "氏", "菌", ",", "容", "易", "引", "起", "猫", "腹", "泻", "脱", "水", ",", "而", "且", "饲", "喂", "猫", "咪", "最", "好", "的", "只", "饲", "喂", "蛋", "黄", "。", "虽", "然", "可", "以", "吃", "蛋", "黄", ",", "但", "是", "需", "要", "掌", "握", "好", "量", ",", "一", "般", "一", "周", "最", "多", "吃", "两", "三", "次", "就", "可", "了", "。", "蛋", "黄", "中", "也", "含", "有", "丰", "富", "的", "胆", "固", "醇", ",", "易", "引", "发", "猫", "咪", "患", "脂", "肪", "肝", "和", "高", "脂", "血", "病", "。", "猫", "能", "吃", "蛋", "黄", "吗"], "sample_type": "ori", "rel_ids": [1776]} -{"id": 18, "title": "最近深圳限行吗", "context": "现在由于疫情的影响,深圳市不限行的了,但是没有必要尽量还是少出门,出门也要做好一系列的防护措施才可以。因为虽然目前国内疫情形势有所缓和,但是这并不意味着疫情的结束,国外疫情形势还是很严峻的,境外输入案例较多。", "question": "最近深圳没有限行吗", "sent_token": ["现", "在", "由", "于", "疫", "情", "的", "影", "响", ",", "深", "圳", "市", "不", "限", "行", "的", "了", ",", "但", "是", "没", "有", "必", "要", "尽", "量", "还", "是", "少", "出", "门", ",", "出", "门", "也", "要", "做", "好", "一", "系", "列", "的", "防", "护", "措", "施", "才", "可", "以", "。", "因", "为", "虽", "然", "目", "前", "国", "内", "疫", "情", "形", "势", "有", "所", "缓", "和", ",", "但", "是", "这", "并", "不", "意", "味", "着", "疫", "情", "的", "结", "束", ",", "国", "外", "疫", "情", "形", "势", "还", "是", "很", "严", "峻", "的", ",", "境", "外", "输", "入", "案", "例", "较", "多", "。", "最", "近", "深", "圳", "限", "行", "吗"], "sample_type": "ori", "rel_ids": [1780]} -{"id": 19, "title": "合同签字不盖章有效吗", "context": "可能有效可能无效。只有签字没有公章的合同是否有法律效力要根据具体情况分析:如果合同是由单位的委托代理人在其权限范围内、或单位的法定代表人签的字,则合同有效。", "question": "合同不签字不盖章有效吗", "sent_token": ["可", "能", "有", "效", "可", "能", "无", "效", "。", "只", "有", "签", "字", "没", "有", "公", "章", "的", "合", "同", "是", "否", "有", "法", "律", "效", "力", "要", "根", "据", "具", "体", "情", "况", "分", "析", ":", "如", "果", "合", "同", "是", "由", "单", "位", "的", "委", "托", "代", "理", "人", "在", "其", "权", "限", "范", "围", "内", "、", "或", "单", "位", "的", "法", "定", "代", "表", "人", "签", "的", "字", ",", "则", "合", "同", "有", "效", "。", "合", "同", "签", "字", "不", "盖", "章", "有", "效", "吗"], "sample_type": "ori", "rel_ids": [1781]} -{"id": 27, "title": "", "context": "吴三桂(1612年-1678年10月2日),字长伯,一字月所,明朝辽东人,明末清初著名政治军事人物,吴周政权建立者吴周太祖。", "question": "吴三贵什么朝代", "sent_token": ["吴", "三", "桂", "(", "1612", "年", "-", "1678", "年", "10", "月", "2", "日", ")", ",", "字", "长", "伯", ",", "一", "字", "月", "所", ",", "明", "朝", "辽", "东", "人", ",", "明", "末", "清", "初", "著", "名", "政", "治", "军", "事", "人", "物", ",", "吴", "周", "政", "权", "建", "立", "者", "吴", "周", "太", "祖", "。"], "sample_type": "ori", "rel_ids": [1789]} -{"id": 34, "title": "狗狗为什么互相闻屁股", "context": "相互闻屁股是狗狗打招呼的一种方式。狗狗的嗅觉很敏感,它们可以用相互闻屁股来了解狗狗的配偶状况、饮食习惯等,因为狗狗的屁股后面有两个肛门腺,在肛门腺里面涵盖了很多的信息素。处在发情期的狗狗也会通过闻屁股来挑选自己的配偶。", "question": "狗狗为什么总是闻屁股", "sent_token": ["相", "互", "闻", "屁", "股", "是", "狗", "狗", "打", "招", "呼", "的", "一", "种", "方", "式", "。", "狗", "狗", "的", "嗅", "觉", "很", "敏", "感", ",", "它", "们", "可", "以", "用", "相", "互", "闻", "屁", "股", "来", "了", "解", "狗", "狗", "的", "配", "偶", "状", "况", "、", "饮", "食", "习", "惯", "等", ",", "因", "为", "狗", "狗", "的", "屁", "股", "后", "面", "有", "两", "个", "肛", "门", "腺", ",", "在", "肛", "门", "腺", "里", "面", "涵", "盖", "了", "很", "多", "的", "信", "息", "素", "。", "处", "在", "发", "情", "期", "的", "狗", "狗", "也", "会", "通", "过", "闻", "屁", "股", "来", "挑", "选", "自", "己", "的", "配", "偶", "。", "狗", "狗", "为", "什", "么", "互", "相", "闻", "屁", "股"], "sample_type": "ori", "rel_ids": [1796]} -{"id": 36, "title": "出租房隔音差怎么解决", "context": "可以在窗户上贴一层隔音膜,在粘贴过程中要注意,不要出现气泡,以免影响隔音效果。若想要隔音效果更好点,还可以购买一些密封条安装在窗户缝隙处,这也能起到更好的隔音效果。另外,室内使用的家具可以更换成木质的,这样同样能起到一定的吸音效果。", "question": "出租房隔音不好怎么解决", "sent_token": ["可", "以", "在", "窗", "户", "上", "贴", "一", "层", "隔", "音", "膜", ",", "在", "粘", "贴", "过", "程", "中", "要", "注", "意", ",", "不", "要", "出", "现", "气", "泡", ",", "以", "免", "影", "响", "隔", "音", "效", "果", "。", "若", "想", "要", "隔", "音", "效", "果", "更", "好", "点", ",", "还", "可", "以", "购", "买", "一", "些", "密", "封", "条", "安", "装", "在", "窗", "户", "缝", "隙", "处", ",", "这", "也", "能", "起", "到", "更", "好", "的", "隔", "音", "效", "果", "。", "另", "外", ",", "室", "内", "使", "用", "的", "家", "具", "可", "以", "更", "换", "成", "木", "质", "的", ",", "这", "样", "同", "样", "能", "起", "到", "一", "定", "的", "吸", "音", "效", "果", "。", "出", "租", "房", "隔", "音", "差", "怎", "么", "解", "决"], "sample_type": "ori", "rel_ids": [1798]} -{"id": 40, "title": "鬼迷心窍(李宗盛演唱歌曲)_百度百科", "context": "《鬼迷心窍》是1992年黄日华、周海媚主演台湾电视剧《末代皇孙》的主题曲,是由李宗盛作词、作曲、演唱,收录于1992年影视剧音乐合辑《滚石九大天王之十二出好戏》当中。", "question": "鬼迷心窍原唱", "sent_token": ["《", "鬼", "迷", "心", "窍", "》", "是", "1992", "年", "黄", "日", "华", "、", "周", "海", "媚", "主", "演", "台", "湾", "电", "视", "剧", "《", "末", "代", "皇", "孙", "》", "的", "主", "题", "曲", ",", "是", "由", "李", "宗", "盛", "作", "词", "、", "作", "曲", "、", "演", "唱", ",", "收", "录", "于", "1992", "年", "影", "视", "剧", "音", "乐", "合", "辑", "《", "滚", "石", "九", "大", "天", "王", "之", "十", "二", "出", "好", "戏", "》", "当", "中", "。", "鬼", "迷", "心", "窍", "(", "李", "宗", "盛", "演", "唱", "歌", "曲", ")", "_", "百", "度", "百", "科"], "sample_type": "ori", "rel_ids": [1802]} -{"id": 41, "title": "", "context": "白龙马,名著小说《西游记》中的重要角色。本是西海龙王三太子,因纵火烧毁玉帝赏赐的明珠而被西海龙王上天告忤逆,要被斩首。后因南海观世菩萨出面才免于死罪,被贬到蛇盘山鹰愁涧等待唐僧取经。之后又误吃唐僧所骑的白马,被菩萨点化,变身为白龙。", "question": "白龙马的真正身份", "sent_token": ["白", "龙", "马", ",", "名", "著", "小", "说", "《", "西", "游", "记", "》", "中", "的", "重", "要", "角", "色", "。", "本", "是", "西", "海", "龙", "王", "三", "太", "子", ",", "因", "纵", "火", "烧", "毁", "玉", "帝", "赏", "赐", "的", "明", "珠", "而", "被", "西", "海", "龙", "王", "上", "天", "告", "忤", "逆", ",", "要", "被", "斩", "首", "。", "后", "因", "南", "海", "观", "世", "菩", "萨", "出", "面", "才", "免", "于", "死", "罪", ",", "被", "贬", "到", "蛇", "盘", "山", "鹰", "愁", "涧", "等", "待", "唐", "僧", "取", "经", "。", "之", "后", "又", "误", "吃", "唐", "僧", "所", "骑", "的", "白", "马", ",", "被", "菩", "萨", "点", "化", ",", "变", "身", "为", "白", "龙", "。"], "sample_type": "ori", "rel_ids": [1803]} -{"id": 43, "title": "", "context": "《湮灭》是由派拉蒙影业出品的科幻惊悚片,由亚历克斯·加兰执导,娜塔莉·波特曼、詹妮弗·杰森·李、吉娜·罗德里格兹、泰莎·汤普森联合主演。该片于2018年2月23日在美国上映。影片根据杰夫·梵德米尔所著《遗落的南境》三部曲的首部同名小说改编,讲述了生物学家莉娜为了自己的丈夫,她自愿加入了科学考察探险小队,去研究美国领土一块被检疫隔离的生态灾害区域的故事。", "question": "湮灭什么类型", "sent_token": ["《", "湮", "灭", "》", "是", "由", "派", "拉", "蒙", "影", "业", "出", "品", "的", "科", "幻", "惊", "悚", "片", ",", "由", "亚", "历", "克", "斯", "·", "加", "兰", "执", "导", ",", "娜", "塔", "莉", "·", "波", "特", "曼", "、", "詹", "妮", "弗", "·", "杰", "森", "·", "李", "、", "吉", "娜", "·", "罗", "德", "里", "格", "兹", "、", "泰", "莎", "·", "汤", "普", "森", "联", "合", "主", "演", "。", "该", "片", "于", "2018", "年", "2", "月", "23", "日", "在", "美", "国", "上", "映", "。", "影", "片", "根", "据", "杰", "夫", "·", "梵", "德", "米", "尔", "所", "著", "《", "遗", "落", "的", "南", "境", "》", "三", "部", "曲", "的", "首", "部", "同", "名", "小", "说", "改", "编", ",", "讲", "述", "了", "生", "物", "学", "家", "莉", "娜", "为", "了", "自", "己", "的", "丈", "夫", ",", "她", "自", "愿", "加", "入", "了", "科", "学", "考", "察", "探", "险", "小", "队", ",", "去", "研", "究", "美", "国", "领", "土", "一", "块", "被", "检", "疫", "隔", "离", "的", "生", "态", "灾", "害", "区", "域", "的", "故", "事", "。"], "sample_type": "ori", "rel_ids": [1805]} -{"id": 45, "title": "", "context": "网球运动的起源及演变可以用四句话来概括:网球孕育在法国,诞生在英国,开始普及和形成高潮在美国,现盛行全世界。", "question": "网球起源于哪国?", "sent_token": ["网", "球", "运", "动", "的", "起", "源", "及", "演", "变", "可", "以", "用", "四", "句", "话", "来", "概", "括", ":", "网", "球", "孕", "育", "在", "法", "国", ",", "诞", "生", "在", "英", "国", ",", "开", "始", "普", "及", "和", "形", "成", "高", "潮", "在", "美", "国", ",", "现", "盛", "行", "全", "世", "界", "。"], "sample_type": "ori", "rel_ids": [1807]} -{"id": 48, "title": "单人挑战巫女大蛇悲鸣需要多少体力_单人挑战巫女大蛇悲鸣需要体力", "context": "阴阳师巫女大蛇悲鸣单人通关需要12点体力组队通关的话只需要8点体力,挑战巫女大蛇悲鸣的体力消耗是普通御魂副本的2倍。奖励掉落5星与6星御魂,经验强化狗粮4星青吉鬼。在御魂副本1-10层原本掉落的基础上,巫女大蛇·悲鸣新增了蚌精、幽谷响、轮入道、蝠翼、狂骨这5种御魂的掉落,每日掉落御魂种类增加到5。", "question": "阴阳师 组队挑战大蛇悲鸣需要多少体力", "sent_token": ["阴", "阳", "师", "巫", "女", "大", "蛇", "悲", "鸣", "单", "人", "通", "关", "需", "要", "12", "点", "体", "力", "组", "队", "通", "关", "的", "话", "只", "需", "要", "8", "点", "体", "力", ",", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "的", "体", "力", "消", "耗", "是", "普", "通", "御", "魂", "副", "本", "的", "2", "倍", "。", "奖", "励", "掉", "落", "5", "星", "与", "6", "星", "御", "魂", ",", "经", "验", "强", "化", "狗", "粮", "4", "星", "青", "吉", "鬼", "。", "在", "御", "魂", "副", "本", "1", "-", "10", "层", "原", "本", "掉", "落", "的", "基", "础", "上", ",", "巫", "女", "大", "蛇", "·", "悲", "鸣", "新", "增", "了", "蚌", "精", "、", "幽", "谷", "响", "、", "轮", "入", "道", "、", "蝠", "翼", "、", "狂", "骨", "这", "5", "种", "御", "魂", "的", "掉", "落", ",", "每", "日", "掉", "落", "御", "魂", "种", "类", "增", "加", "到", "5", "。", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "多", "少", "体", "力", "_", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "体", "力"], "sample_type": "ori", "rel_ids": [1810]} -{"id": 53, "title": "", "context": "人类的心脏位于胸腔中部偏左,体积约相当于一个拳头大小,重量约350克。女性的心脏通常要比男性的体积小且重量轻。人的心脏外形像桃子,位于横膈之上,两肺间而偏左。", "question": "人类心脏多少斤", "sent_token": ["人", "类", "的", "心", "脏", "位", "于", "胸", "腔", "中", "部", "偏", "左", ",", "体", "积", "约", "相", "当", "于", "一", "个", "拳", "头", "大", "小", ",", "重", "量", "约", "350", "克", "。", "女", "性", "的", "心", "脏", "通", "常", "要", "比", "男", "性", "的", "体", "积", "小", "且", "重", "量", "轻", "。", "人", "的", "心", "脏", "外", "形", "像", "桃", "子", ",", "位", "于", "横", "膈", "之", "上", ",", "两", "肺", "间", "而", "偏", "左", "。"], "sample_type": "ori", "rel_ids": [1815]} -{"id": 54, "title": "紫菜变成紫色还能吃吗-有来医生", "context": "如果紫菜变成紫色的情况下,主要考虑还是紫菜受潮引起的,紫菜受潮以后容易滋生细菌,营养物质也会丧失,口感也会变差,一般情况下,建议不要食用,以免导致消化道的不良反应。紫菜中含有的营养物质是很丰富的,含有丰富的锌元素和铁元素,每天适当的吃一点,可以预防缺铁性贫血,可以预防缺锌引起的反复性口腔溃疡,可以增进食欲。", "question": "海苔回潮了还能吃吗", "sent_token": ["如", "果", "紫", "菜", "变", "成", "紫", "色", "的", "情", "况", "下", ",", "主", "要", "考", "虑", "还", "是", "紫", "菜", "受", "潮", "引", "起", "的", ",", "紫", "菜", "受", "潮", "以", "后", "容", "易", "滋", "生", "细", "菌", ",", "营", "养", "物", "质", "也", "会", "丧", "失", ",", "口", "感", "也", "会", "变", "差", ",", "一", "般", "情", "况", "下", ",", "建", "议", "不", "要", "食", "用", ",", "以", "免", "导", "致", "消", "化", "道", "的", "不", "良", "反", "应", "。", "紫", "菜", "中", "含", "有", "的", "营", "养", "物", "质", "是", "很", "丰", "富", "的", ",", "含", "有", "丰", "富", "的", "锌", "元", "素", "和", "铁", "元", "素", ",", "每", "天", "适", "当", "的", "吃", "一", "点", ",", "可", "以", "预", "防", "缺", "铁", "性", "贫", "血", ",", "可", "以", "预", "防", "缺", "锌", "引", "起", "的", "反", "复", "性", "口", "腔", "溃", "疡", ",", "可", "以", "增", "进", "食", "欲", "。", "紫", "菜", "变", "成", "紫", "色", "还", "能", "吃", "吗", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1816]} -{"id": 68, "title": "", "context": "穿上盔甲后,托尼变身成了复仇者联盟中惩恶扬善的钢铁侠。复仇者联盟2:奥创纪元钢铁侠是美国演员小罗伯特·唐尼演的。小罗伯特唐尼的电影钢铁侠扮演者小罗伯特·。", "question": "谁演过钢铁侠", "sent_token": ["穿", "上", "盔", "甲", "后", ",", "托", "尼", "变", "身", "成", "了", "复", "仇", "者", "联", "盟", "中", "惩", "恶", "扬", "善", "的", "钢", "铁", "侠", "。", "复", "仇", "者", "联", "盟", "2", ":", "奥", "创", "纪", "元", "钢", "铁", "侠", "是", "美", "国", "演", "员", "小", "罗", "伯", "特", "·", "唐", "尼", "演", "的", "。", "小", "罗", "伯", "特", "唐", "尼", "的", "电", "影", "钢", "铁", "侠", "扮", "演", "者", "小", "罗", "伯", "特", "·", "。"], "sample_type": "ori", "rel_ids": [1830]} -{"id": 69, "title": "人间正道是沧桑是什么意思_酷知经验网", "context": "天若有情天亦老,人间正道是沧桑:上句借用李贺《金铜仙人辞汉歌》中诗句,原诗说的是汉武帝时制作的极贵重的宝物金铜仙人像,在三国时被魏明帝由长安迁往洛阳的传说。原句的意思是,对于这样的人间恨事,天若有情,也要因悲伤而衰老。", "question": "人间正道是沧桑上一句", "sent_token": ["天", "若", "有", "情", "天", "亦", "老", ",", "人", "间", "正", "道", "是", "沧", "桑", ":", "上", "句", "借", "用", "李", "贺", "《", "金", "铜", "仙", "人", "辞", "汉", "歌", "》", "中", "诗", "句", ",", "原", "诗", "说", "的", "是", "汉", "武", "帝", "时", "制", "作", "的", "极", "贵", "重", "的", "宝", "物", "金", "铜", "仙", "人", "像", ",", "在", "三", "国", "时", "被", "魏", "明", "帝", "由", "长", "安", "迁", "往", "洛", "阳", "的", "传", "说", "。", "原", "句", "的", "意", "思", "是", ",", "对", "于", "这", "样", "的", "人", "间", "恨", "事", ",", "天", "若", "有", "情", ",", "也", "要", "因", "悲", "伤", "而", "衰", "老", "。", "人", "间", "正", "道", "是", "沧", "桑", "是", "什", "么", "意", "思", "_", "酷", "知", "经", "验", "网"], "sample_type": "ori", "rel_ids": [1831]} -{"id": 72, "title": "", "context": "《艺妓回忆录》根据美国作家阿瑟-高顿的同名小说改编。于2005年12月1日上映,由章子怡·巩俐·杨紫琼等共同演绎。是一部时长约140分钟的电影。全篇充满着古典美,时代背景从1929年开始延续到二战结束,女主人公回忆了自己从小拼命挣扎、历尽荣辱的人生经历。", "question": "艺妓回忆录多长时间", "sent_token": ["《", "艺", "妓", "回", "忆", "录", "》", "根", "据", "美", "国", "作", "家", "阿", "瑟", "-", "高", "顿", "的", "同", "名", "小", "说", "改", "编", "。", "于", "2005", "年", "12", "月", "1", "日", "上", "映", ",", "由", "章", "子", "怡", "·", "巩", "俐", "·", "杨", "紫", "琼", "等", "共", "同", "演", "绎", "。", "是", "一", "部", "时", "长", "约", "140", "分", "钟", "的", "电", "影", "。", "全", "篇", "充", "满", "着", "古", "典", "美", ",", "时", "代", "背", "景", "从", "1929", "年", "开", "始", "延", "续", "到", "二", "战", "结", "束", ",", "女", "主", "人", "公", "回", "忆", "了", "自", "己", "从", "小", "拼", "命", "挣", "扎", "、", "历", "尽", "荣", "辱", "的", "人", "生", "经", "历", "。"], "sample_type": "ori", "rel_ids": [1834]} -{"id": 77, "title": "痛风挂哪个科室比较好?_39健康问答_39健康网", "context": "痛风属于代谢风湿性疾病,目前主要是在风湿免疫科治疗,所以患者需要挂风湿免疫科。风湿免疫科在绝大多数三级甲等医院都有独立的科室。由于这个科是一个新兴学科,在很多县级医院还没有成立,患者可以到内分泌科就诊,挂内分泌科。如果这两个科都没有患者,可以到骨科就诊,因为痛风首发表现是急性痛风性关节炎,骨科大夫对痛风也有一定的了解。", "question": "痛风属于什么类型疾病", "sent_token": ["痛", "风", "属", "于", "代", "谢", "风", "湿", "性", "疾", "病", ",", "目", "前", "主", "要", "是", "在", "风", "湿", "免", "疫", "科", "治", "疗", ",", "所", "以", "患", "者", "需", "要", "挂", "风", "湿", "免", "疫", "科", "。", "风", "湿", "免", "疫", "科", "在", "绝", "大", "多", "数", "三", "级", "甲", "等", "医", "院", "都", "有", "独", "立", "的", "科", "室", "。", "由", "于", "这", "个", "科", "是", "一", "个", "新", "兴", "学", "科", ",", "在", "很", "多", "县", "级", "医", "院", "还", "没", "有", "成", "立", ",", "患", "者", "可", "以", "到", "内", "分", "泌", "科", "就", "诊", ",", "挂", "内", "分", "泌", "科", "。", "如", "果", "这", "两", "个", "科", "都", "没", "有", "患", "者", ",", "可", "以", "到", "骨", "科", "就", "诊", ",", "因", "为", "痛", "风", "首", "发", "表", "现", "是", "急", "性", "痛", "风", "性", "关", "节", "炎", ",", "骨", "科", "大", "夫", "对", "痛", "风", "也", "有", "一", "定", "的", "了", "解", "。", "痛", "风", "挂", "哪", "个", "科", "室", "比", "较", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [1839]} -{"id": 82, "title": "阴阳师武士之灵生前被谁所杀_游侠网", "context": "从武士之灵的传记中可以得知,武士之灵生前是被茨木童子所击杀。该问题来自游戏内的逢魔密信,正确回答问题之后就有机会获得包括金币、体力、勾玉和结界卡在内的多种游戏内道具物资奖励。", "question": "武士之灵生前被谁所杀", "sent_token": ["从", "武", "士", "之", "灵", "的", "传", "记", "中", "可", "以", "得", "知", ",", "武", "士", "之", "灵", "生", "前", "是", "被", "茨", "木", "童", "子", "所", "击", "杀", "。", "该", "问", "题", "来", "自", "游", "戏", "内", "的", "逢", "魔", "密", "信", ",", "正", "确", "回", "答", "问", "题", "之", "后", "就", "有", "机", "会", "获", "得", "包", "括", "金", "币", "、", "体", "力", "、", "勾", "玉", "和", "结", "界", "卡", "在", "内", "的", "多", "种", "游", "戏", "内", "道", "具", "物", "资", "奖", "励", "。", "阴", "阳", "师", "武", "士", "之", "灵", "生", "前", "被", "谁", "所", "杀", "_", "游", "侠", "网"], "sample_type": "ori", "rel_ids": [1844]} -{"id": 88, "title": "中医肾主什么-有来医生", "context": "根据中医基础理论,肾主水、主纳气、主二便、主藏精。肾主水,是指全身的水液代谢都是在肾阳的气化温煦作用下,从而分布到全身,然后再通过呼吸、二便将代谢废物排除体外。肾主纳气,是指肾能够使人体维持正常的呼吸深度。肾主二便,人的大小便需要在肾的作用下,才能够正常的排泄,否则就会出现异常的改变,比如大小便失禁、大便稀薄等情况。肾主藏精,是指五脏六腑化生的精气,最后都是储存在肾脏,反过来肾脏所藏的精气,又能够推动各脏腑的功能。", "question": "肾主什么", "sent_token": ["根", "据", "中", "医", "基", "础", "理", "论", ",", "肾", "主", "水", "、", "主", "纳", "气", "、", "主", "二", "便", "、", "主", "藏", "精", "。", "肾", "主", "水", ",", "是", "指", "全", "身", "的", "水", "液", "代", "谢", "都", "是", "在", "肾", "阳", "的", "气", "化", "温", "煦", "作", "用", "下", ",", "从", "而", "分", "布", "到", "全", "身", ",", "然", "后", "再", "通", "过", "呼", "吸", "、", "二", "便", "将", "代", "谢", "废", "物", "排", "除", "体", "外", "。", "肾", "主", "纳", "气", ",", "是", "指", "肾", "能", "够", "使", "人", "体", "维", "持", "正", "常", "的", "呼", "吸", "深", "度", "。", "肾", "主", "二", "便", ",", "人", "的", "大", "小", "便", "需", "要", "在", "肾", "的", "作", "用", "下", ",", "才", "能", "够", "正", "常", "的", "排", "泄", ",", "否", "则", "就", "会", "出", "现", "异", "常", "的", "改", "变", ",", "比", "如", "大", "小", "便", "失", "禁", "、", "大", "便", "稀", "薄", "等", "情", "况", "。", "肾", "主", "藏", "精", ",", "是", "指", "五", "脏", "六", "腑", "化", "生", "的", "精", "气", ",", "最", "后", "都", "是", "储", "存", "在", "肾", "脏", ",", "反", "过", "来", "肾", "脏", "所", "藏", "的", "精", "气", ",", "又", "能", "够", "推", "动", "各", "脏", "腑", "的", "功", "能", "。", "中", "医", "肾", "主", "什", "么", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1850]} -{"id": 91, "title": "1963年属什么生肖年_十二生肖_卜易居", "context": "1963年属什么生肖年,葵卯兔年,属兔之人举止文雅,谈吐随和,为人恭良谦逊,与人交往如慕春风,学习能力超群,敏捷果断,安贫乐道。虽性子柔弱,但韧性极强,绝境之中能力惊人,缺点则是难以坚持原则,随波逐流。", "question": "1963年属什么生肖", "sent_token": ["1963", "年", "属", "什", "么", "生", "肖", "年", ",", "葵", "卯", "兔", "年", ",", "属", "兔", "之", "人", "举", "止", "文", "雅", ",", "谈", "吐", "随", "和", ",", "为", "人", "恭", "良", "谦", "逊", ",", "与", "人", "交", "往", "如", "慕", "春", "风", ",", "学", "习", "能", "力", "超", "群", ",", "敏", "捷", "果", "断", ",", "安", "贫", "乐", "道", "。", "虽", "性", "子", "柔", "弱", ",", "但", "韧", "性", "极", "强", ",", "绝", "境", "之", "中", "能", "力", "惊", "人", ",", "缺", "点", "则", "是", "难", "以", "坚", "持", "原", "则", ",", "随", "波", "逐", "流", "。", "1963", "年", "属", "什", "么", "生", "肖", "年", "_", "十", "二", "生", "肖", "_", "卜", "易", "居"], "sample_type": "ori", "rel_ids": [1853]} -{"id": 92, "title": "食管和食道一样吗-有来医生", "context": "食管和食道是没有区别的,食管是医学上的称谓,而食道是民间的一种说法。两者都指从咽喉部到胃贲门之间的管道。食管可以分为颈段和胸段,而胸段又分为胸上段、胸中段和胸下段。食管本身有3个生理性的狭窄,这也是某些食管疾病发生的基础。常见的食管疾病包括食管炎、食管息肉、食管癌、食管狭窄、胃食管反流症、巴雷特食管等。可以通过消化道造影以及胃镜来进一步明确。", "question": "食管跟食道一样吗", "sent_token": ["食", "管", "和", "食", "道", "是", "没", "有", "区", "别", "的", ",", "食", "管", "是", "医", "学", "上", "的", "称", "谓", ",", "而", "食", "道", "是", "民", "间", "的", "一", "种", "说", "法", "。", "两", "者", "都", "指", "从", "咽", "喉", "部", "到", "胃", "贲", "门", "之", "间", "的", "管", "道", "。", "食", "管", "可", "以", "分", "为", "颈", "段", "和", "胸", "段", ",", "而", "胸", "段", "又", "分", "为", "胸", "上", "段", "、", "胸", "中", "段", "和", "胸", "下", "段", "。", "食", "管", "本", "身", "有", "3", "个", "生", "理", "性", "的", "狭", "窄", ",", "这", "也", "是", "某", "些", "食", "管", "疾", "病", "发", "生", "的", "基", "础", "。", "常", "见", "的", "食", "管", "疾", "病", "包", "括", "食", "管", "炎", "、", "食", "管", "息", "肉", "、", "食", "管", "癌", "、", "食", "管", "狭", "窄", "、", "胃", "食", "管", "反", "流", "症", "、", "巴", "雷", "特", "食", "管", "等", "。", "可", "以", "通", "过", "消", "化", "道", "造", "影", "以", "及", "胃", "镜", "来", "进", "一", "步", "明", "确", "。", "食", "管", "和", "食", "道", "一", "样", "吗", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1854]} -{"id": 101, "title": "农历六月二十四是什么星座-星座乐", "context": "农历六月二十四是狮子座。狮子座,火象星座,位于黄道十二宫之第五宫,出生日期为阳历7月23日-8月22日。狮子座是英雄主义者,他们乐观,乐于助人,喜欢帮助弱势群体。他们天生自带光环,特立独行,做事豪爽大气,讲话淡定从容,从不扭扭捏捏畏畏缩缩。而且心思细腻,做事完整准确,善于将自己的优点发挥到极致。", "question": "农历六月二十四是什么星座", "sent_token": ["农", "历", "六", "月", "二", "十", "四", "是", "狮", "子", "座", "。", "狮", "子", "座", ",", "火", "象", "星", "座", ",", "位", "于", "黄", "道", "十", "二", "宫", "之", "第", "五", "宫", ",", "出", "生", "日", "期", "为", "阳", "历", "7", "月", "23", "日", "-", "8", "月", "22", "日", "。", "狮", "子", "座", "是", "英", "雄", "主", "义", "者", ",", "他", "们", "乐", "观", ",", "乐", "于", "助", "人", ",", "喜", "欢", "帮", "助", "弱", "势", "群", "体", "。", "他", "们", "天", "生", "自", "带", "光", "环", ",", "特", "立", "独", "行", ",", "做", "事", "豪", "爽", "大", "气", ",", "讲", "话", "淡", "定", "从", "容", ",", "从", "不", "扭", "扭", "捏", "捏", "畏", "畏", "缩", "缩", "。", "而", "且", "心", "思", "细", "腻", ",", "做", "事", "完", "整", "准", "确", ",", "善", "于", "将", "自", "己", "的", "优", "点", "发", "挥", "到", "极", "致", "。", "农", "历", "六", "月", "二", "十", "四", "是", "什", "么", "星", "座", "-", "星", "座", "乐"], "sample_type": "ori", "rel_ids": [1863]} -{"id": 105, "title": "", "context": "非法持有海洛因10克以上就构成非法持有毒品罪非法持有毒品罪,是指明知是鸦片、海洛因、甲基苯丙胺或者其他毒品,而非法持有且数量较大的行为。非法持有毒品达到一定数量才构成犯罪。", "question": "海洛因几克属于犯罪", "sent_token": ["非", "法", "持", "有", "海", "洛", "因", "10", "克", "以", "上", "就", "构", "成", "非", "法", "持", "有", "毒", "品", "罪", "非", "法", "持", "有", "毒", "品", "罪", ",", "是", "指", "明", "知", "是", "鸦", "片", "、", "海", "洛", "因", "、", "甲", "基", "苯", "丙", "胺", "或", "者", "其", "他", "毒", "品", ",", "而", "非", "法", "持", "有", "且", "数", "量", "较", "大", "的", "行", "为", "。", "非", "法", "持", "有", "毒", "品", "达", "到", "一", "定", "数", "量", "才", "构", "成", "犯", "罪", "。"], "sample_type": "ori", "rel_ids": [1867]} -{"id": 115, "title": "地方志书每几年左右编修一次_高三网", "context": "地方志书每20年左右编修一次。每一轮地方志书编修工作完成后,负责地方志工作的机构在编纂地方综合年鉴、搜集资料以及向社会提供咨询服务的同时,启动新一轮地方志书的续修工作。", "question": "地方质数没几年编修一次", "sent_token": ["地", "方", "志", "书", "每", "20", "年", "左", "右", "编", "修", "一", "次", "。", "每", "一", "轮", "地", "方", "志", "书", "编", "修", "工", "作", "完", "成", "后", ",", "负", "责", "地", "方", "志", "工", "作", "的", "机", "构", "在", "编", "纂", "地", "方", "综", "合", "年", "鉴", "、", "搜", "集", "资", "料", "以", "及", "向", "社", "会", "提", "供", "咨", "询", "服", "务", "的", "同", "时", ",", "启", "动", "新", "一", "轮", "地", "方", "志", "书", "的", "续", "修", "工", "作", "。", "地", "方", "志", "书", "每", "几", "年", "左", "右", "编", "修", "一", "次", "_", "高", "三", "网"], "sample_type": "ori", "rel_ids": [1877]} -{"id": 117, "title": "", "context": "《正气歌》是南宋诗人文天祥在狱中写的一首五言古诗。诗的开头即点出浩然正气存乎天地之间,至时穷之际,必然会显示出来。随后连用十二个典故,都是历史上有名的人物,他们的所作所为凛然显示出浩然正气的力量。接下来八句说明浩然正气贯日月,立天地,为三纲之命,道义之根。最后联系到自己的命运,自己虽然兵败被俘,处在极其恶劣的牢狱之中,但是由于自己一身正气,各种邪气和疾病都不能侵犯自己,因此自己能够坦然面对自己的命运。全诗感情深沉、气壮山河、直抒胸臆、毫无雕饰,充分体现了作者崇高的民族气节和强烈的爱国主义精神。", "question": "正气歌》的作者是", "sent_token": ["《", "正", "气", "歌", "》", "是", "南", "宋", "诗", "人", "文", "天", "祥", "在", "狱", "中", "写", "的", "一", "首", "五", "言", "古", "诗", "。", "诗", "的", "开", "头", "即", "点", "出", "浩", "然", "正", "气", "存", "乎", "天", "地", "之", "间", ",", "至", "时", "穷", "之", "际", ",", "必", "然", "会", "显", "示", "出", "来", "。", "随", "后", "连", "用", "十", "二", "个", "典", "故", ",", "都", "是", "历", "史", "上", "有", "名", "的", "人", "物", ",", "他", "们", "的", "所", "作", "所", "为", "凛", "然", "显", "示", "出", "浩", "然", "正", "气", "的", "力", "量", "。", "接", "下", "来", "八", "句", "说", "明", "浩", "然", "正", "气", "贯", "日", "月", ",", "立", "天", "地", ",", "为", "三", "纲", "之", "命", ",", "道", "义", "之", "根", "。", "最", "后", "联", "系", "到", "自", "己", "的", "命", "运", ",", "自", "己", "虽", "然", "兵", "败", "被", "俘", ",", "处", "在", "极", "其", "恶", "劣", "的", "牢", "狱", "之", "中", ",", "但", "是", "由", "于", "自", "己", "一", "身", "正", "气", ",", "各", "种", "邪", "气", "和", "疾", "病", "都", "不", "能", "侵", "犯", "自", "己", ",", "因", "此", "自", "己", "能", "够", "坦", "然", "面", "对", "自", "己", "的", "命", "运", "。", "全", "诗", "感", "情", "深", "沉", "、", "气", "壮", "山", "河", "、", "直", "抒", "胸", "臆", "、", "毫", "无", "雕", "饰", ",", "充", "分", "体", "现", "了", "作", "者", "崇", "高", "的", "民", "族", "气", "节", "和", "强", "烈", "的", "爱", "国", "主", "义", "精", "神", "。"], "sample_type": "ori", "rel_ids": [1879]} -{"id": 121, "title": "狗狗皮肤上长小脓包怎么回事", "context": "狗狗身上长脓包,是因为真菌感染或是寄生虫感染所致。如不及时处理脓包,会导致扩散全身,甚至溃烂。建议方法:戴上手套,把狗狗身上长脓包的地方挤一挤;然后用碘伏直接喷在患处;如有脓血可用医用纱布给它包在患处,等药效吸收后,取掉纱布;碘伏具有抗菌、消炎的作用,一天可以喷两三次;处理完狗狗伤口后用肥皂洗手。狗狗洗澡要用狗狗专门的沐浴露;洗后立即做吹干处理;定时用狗狗专用梳子,清理身上多余的杂毛;尽量带狗狗去干净的地方玩,回家后把狗狗的脚用抹布抹一次;多注意狗舍卫生,定时做消毒处理。", "question": "狗狗身上长小脓包是怎么回事", "sent_token": ["狗", "狗", "身", "上", "长", "脓", "包", ",", "是", "因", "为", "真", "菌", "感", "染", "或", "是", "寄", "生", "虫", "感", "染", "所", "致", "。", "如", "不", "及", "时", "处", "理", "脓", "包", ",", "会", "导", "致", "扩", "散", "全", "身", ",", "甚", "至", "溃", "烂", "。", "建", "议", "方", "法", ":", "戴", "上", "手", "套", ",", "把", "狗", "狗", "身", "上", "长", "脓", "包", "的", "地", "方", "挤", "一", "挤", ";", "然", "后", "用", "碘", "伏", "直", "接", "喷", "在", "患", "处", ";", "如", "有", "脓", "血", "可", "用", "医", "用", "纱", "布", "给", "它", "包", "在", "患", "处", ",", "等", "药", "效", "吸", "收", "后", ",", "取", "掉", "纱", "布", ";", "碘", "伏", "具", "有", "抗", "菌", "、", "消", "炎", "的", "作", "用", ",", "一", "天", "可", "以", "喷", "两", "三", "次", ";", "处", "理", "完", "狗", "狗", "伤", "口", "后", "用", "肥", "皂", "洗", "手", "。", "狗", "狗", "洗", "澡", "要", "用", "狗", "狗", "专", "门", "的", "沐", "浴", "露", ";", "洗", "后", "立", "即", "做", "吹", "干", "处", "理", ";", "定", "时", "用", "狗", "狗", "专", "用", "梳", "子", ",", "清", "理", "身", "上", "多", "余", "的", "杂", "毛", ";", "尽", "量", "带", "狗", "狗", "去", "干", "净", "的", "地", "方", "玩", ",", "回", "家", "后", "把", "狗", "狗", "的", "脚", "用", "抹", "布", "抹", "一", "次", ";", "多", "注", "意", "狗", "舍", "卫", "生", ",", "定", "时", "做", "消", "毒", "处", "理", "。", "狗", "狗", "皮", "肤", "上", "长", "小", "脓", "包", "怎", "么", "回", "事"], "sample_type": "ori", "rel_ids": [1883]} -{"id": 123, "title": "", "context": "新梓学校成立于2007年9月,是一所公办九年一贯制学校,座落在龙岗街道新生社区,紧邻水岸新都花园,交通十分便利。校园占地27500平方米,建筑面积16285平方米。", "question": "新梓学校地址", "sent_token": ["新", "梓", "学", "校", "成", "立", "于", "2007", "年", "9", "月", ",", "是", "一", "所", "公", "办", "九", "年", "一", "贯", "制", "学", "校", ",", "座", "落", "在", "龙", "岗", "街", "道", "新", "生", "社", "区", ",", "紧", "邻", "水", "岸", "新", "都", "花", "园", ",", "交", "通", "十", "分", "便", "利", "。", "校", "园", "占", "地", "27500", "平", "方", "米", ",", "建", "筑", "面", "积", "16285", "平", "方", "米", "。"], "sample_type": "ori", "rel_ids": [1885]} -{"id": 124, "title": "敷面膜脸痒是缺水吗?教你正确的认识_皮肤", "context": "当我们在洗完澡的时候,或者是敷面膜发现皮肤有一种痒痒的感觉,如果你确定面膜的质量是没有问题的,并且也确定你对这款面膜的物质没有过敏的情况下,皮肤出现痒的感觉,那可能的原因就是由于皮肤缺水。因为你的皮肤太缺水了,在给皮肤补水的时候就会出现一种痒的情况严重的时候,甚至会有刺痛的感觉。会让人觉得很不舒服,水分充足后会缓解。", "question": "脸痒是缺水吗", "sent_token": ["当", "我", "们", "在", "洗", "完", "澡", "的", "时", "候", ",", "或", "者", "是", "敷", "面", "膜", "发", "现", "皮", "肤", "有", "一", "种", "痒", "痒", "的", "感", "觉", ",", "如", "果", "你", "确", "定", "面", "膜", "的", "质", "量", "是", "没", "有", "问", "题", "的", ",", "并", "且", "也", "确", "定", "你", "对", "这", "款", "面", "膜", "的", "物", "质", "没", "有", "过", "敏", "的", "情", "况", "下", ",", "皮", "肤", "出", "现", "痒", "的", "感", "觉", ",", "那", "可", "能", "的", "原", "因", "就", "是", "由", "于", "皮", "肤", "缺", "水", "。", "因", "为", "你", "的", "皮", "肤", "太", "缺", "水", "了", ",", "在", "给", "皮", "肤", "补", "水", "的", "时", "候", "就", "会", "出", "现", "一", "种", "痒", "的", "情", "况", "严", "重", "的", "时", "候", ",", "甚", "至", "会", "有", "刺", "痛", "的", "感", "觉", "。", "会", "让", "人", "觉", "得", "很", "不", "舒", "服", ",", "水", "分", "充", "足", "后", "会", "缓", "解", "。", "敷", "面", "膜", "脸", "痒", "是", "缺", "水", "吗", "?", "教", "你", "正", "确", "的", "认", "识", "_", "皮", "肤"], "sample_type": "ori", "rel_ids": [1886]} -{"id": 126, "title": "无痛人流和药流哪个伤害比较小-有来医生", "context": "无痛人工流产手术和药物流产手术,相对比来说,还是药物流产伤害比较大。因为药物流产,阴道流血时间会比人工流产的阴道流血时间要长,一般人工流产,阴道流血时间不超过7天,而药物流产阴道流血的时间往往在15-20天左右才会干净。一直在有流血的状况下,宫口就是开放的,阴道又跟外界相通,跟宫颈又相通,这样造成细菌侵入感染的机会就会增加,所以容易导致生殖道的感染。另外,药物流产造成不全流产的可能性会大一些,需要做清宫手术。这样就可以想象出药物流产会比无痛人流伤害更大一些。", "question": "无痛人流和药流哪个伤害比较小", "sent_token": ["无", "痛", "人", "工", "流", "产", "手", "术", "和", "药", "物", "流", "产", "手", "术", ",", "相", "对", "比", "来", "说", ",", "还", "是", "药", "物", "流", "产", "伤", "害", "比", "较", "大", "。", "因", "为", "药", "物", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "会", "比", "人", "工", "流", "产", "的", "阴", "道", "流", "血", "时", "间", "要", "长", ",", "一", "般", "人", "工", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "不", "超", "过", "7", "天", ",", "而", "药", "物", "流", "产", "阴", "道", "流", "血", "的", "时", "间", "往", "往", "在", "15", "-", "20", "天", "左", "右", "才", "会", "干", "净", "。", "一", "直", "在", "有", "流", "血", "的", "状", "况", "下", ",", "宫", "口", "就", "是", "开", "放", "的", ",", "阴", "道", "又", "跟", "外", "界", "相", "通", ",", "跟", "宫", "颈", "又", "相", "通", ",", "这", "样", "造", "成", "细", "菌", "侵", "入", "感", "染", "的", "机", "会", "就", "会", "增", "加", ",", "所", "以", "容", "易", "导", "致", "生", "殖", "道", "的", "感", "染", "。", "另", "外", ",", "药", "物", "流", "产", "造", "成", "不", "全", "流", "产", "的", "可", "能", "性", "会", "大", "一", "些", ",", "需", "要", "做", "清", "宫", "手", "术", "。", "这", "样", "就", "可", "以", "想", "象", "出", "药", "物", "流", "产", "会", "比", "无", "痛", "人", "流", "伤", "害", "更", "大", "一", "些", "。", "无", "痛", "人", "流", "和", "药", "流", "哪", "个", "伤", "害", "比", "较", "小", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1888]} -{"id": 128, "title": "长期吃葡萄籽的副作用?_39健康问答_39健康网", "context": "长期吃葡萄籽不会有副作用,不用担心,葡萄籽中含有丰富的花青素,有美容养颜的功效。葡萄籽含有丰富的多种氨基酸、维生素及矿物质等,原花青素含量最高,有促进血液循环、保护视力、抗氧化去除自由基、降低血、保护心血管的作用,可以用于保健、美容。", "question": "葡萄籽能长期吃吗?有什么副作用?", "sent_token": ["长", "期", "吃", "葡", "萄", "籽", "不", "会", "有", "副", "作", "用", ",", "不", "用", "担", "心", ",", "葡", "萄", "籽", "中", "含", "有", "丰", "富", "的", "花", "青", "素", ",", "有", "美", "容", "养", "颜", "的", "功", "效", "。", "葡", "萄", "籽", "含", "有", "丰", "富", "的", "多", "种", "氨", "基", "酸", "、", "维", "生", "素", "及", "矿", "物", "质", "等", ",", "原", "花", "青", "素", "含", "量", "最", "高", ",", "有", "促", "进", "血", "液", "循", "环", "、", "保", "护", "视", "力", "、", "抗", "氧", "化", "去", "除", "自", "由", "基", "、", "降", "低", "血", "、", "保", "护", "心", "血", "管", "的", "作", "用", ",", "可", "以", "用", "于", "保", "健", "、", "美", "容", "。", "长", "期", "吃", "葡", "萄", "籽", "的", "副", "作", "用", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [1890]} -{"id": 132, "title": "红花哪里产的最好?_39健康问答_39健康网", "context": "红花在中国很多地方都是有种植的,比如河南,江苏,四川,河北等等。但是在众多产地中河南的商丘生产的红花应该是最好的了。红花有一种特殊的气味,特别香,味道稍微有点苦。红花是一种很好的植物,对人体有很好的保健作用。高血压患者可以服用一些,红花是有一定的降压作用的,另外还可以促进人体血液的循环,降低血脂。", "question": "世界上哪里的红花最好", "sent_token": ["红", "花", "在", "中", "国", "很", "多", "地", "方", "都", "是", "有", "种", "植", "的", ",", "比", "如", "河", "南", ",", "江", "苏", ",", "四", "川", ",", "河", "北", "等", "等", "。", "但", "是", "在", "众", "多", "产", "地", "中", "河", "南", "的", "商", "丘", "生", "产", "的", "红", "花", "应", "该", "是", "最", "好", "的", "了", "。", "红", "花", "有", "一", "种", "特", "殊", "的", "气", "味", ",", "特", "别", "香", ",", "味", "道", "稍", "微", "有", "点", "苦", "。", "红", "花", "是", "一", "种", "很", "好", "的", "植", "物", ",", "对", "人", "体", "有", "很", "好", "的", "保", "健", "作", "用", "。", "高", "血", "压", "患", "者", "可", "以", "服", "用", "一", "些", ",", "红", "花", "是", "有", "一", "定", "的", "降", "压", "作", "用", "的", ",", "另", "外", "还", "可", "以", "促", "进", "人", "体", "血", "液", "的", "循", "环", ",", "降", "低", "血", "脂", "。", "红", "花", "哪", "里", "产", "的", "最", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [1894]} -{"id": 135, "title": "", "context": "梳妆台指用来化妆的家具装饰。梳妆台一词,在现代家居中,已经被业主、客户、家居设计师广泛用到,现在泛指家具梳妆台。梳妆台尺寸标准的是总高度为1500mm左右,宽为700mm到1200mm,这样的梳妆台尺寸是大小正合适的,在家庭装修之前的前期准备时,就应该确定好梳妆台尺寸大小,同时梳妆台尺寸也要和房间的格调和风格统一起来。", "question": "梳妆台整体高度一般是多少", "sent_token": ["梳", "妆", "台", "指", "用", "来", "化", "妆", "的", "家", "具", "装", "饰", "。", "梳", "妆", "台", "一", "词", ",", "在", "现", "代", "家", "居", "中", ",", "已", "经", "被", "业", "主", "、", "客", "户", "、", "家", "居", "设", "计", "师", "广", "泛", "用", "到", ",", "现", "在", "泛", "指", "家", "具", "梳", "妆", "台", "。", "梳", "妆", "台", "尺", "寸", "标", "准", "的", "是", "总", "高", "度", "为", "1500mm", "左", "右", ",", "宽", "为", "700mm", "到", "1200mm", ",", "这", "样", "的", "梳", "妆", "台", "尺", "寸", "是", "大", "小", "正", "合", "适", "的", ",", "在", "家", "庭", "装", "修", "之", "前", "的", "前", "期", "准", "备", "时", ",", "就", "应", "该", "确", "定", "好", "梳", "妆", "台", "尺", "寸", "大", "小", ",", "同", "时", "梳", "妆", "台", "尺", "寸", "也", "要", "和", "房", "间", "的", "格", "调", "和", "风", "格", "统", "一", "起", "来", "。"], "sample_type": "ori", "rel_ids": [1897]} -{"id": 137, "title": "感冒能不能吃燕窝_妈妈网小百科", "context": "在感冒的时候尽量不要吃燕窝,虽然燕窝比较滋补,但是在感冒期间吃燕窝的话,并不利于感冒的恢复。在感冒期间应该吃得清淡一些,补充身体需要的水分,如果没有食欲的话可以多喝一些粥。在感冒期间可能吃药物的话,也不能够起到很好的效果,但是也要坚持吃药。", "question": "感冒可以吃燕窝吗?有效果吗?", "sent_token": ["在", "感", "冒", "的", "时", "候", "尽", "量", "不", "要", "吃", "燕", "窝", ",", "虽", "然", "燕", "窝", "比", "较", "滋", "补", ",", "但", "是", "在", "感", "冒", "期", "间", "吃", "燕", "窝", "的", "话", ",", "并", "不", "利", "于", "感", "冒", "的", "恢", "复", "。", "在", "感", "冒", "期", "间", "应", "该", "吃", "得", "清", "淡", "一", "些", ",", "补", "充", "身", "体", "需", "要", "的", "水", "分", ",", "如", "果", "没", "有", "食", "欲", "的", "话", "可", "以", "多", "喝", "一", "些", "粥", "。", "在", "感", "冒", "期", "间", "可", "能", "吃", "药", "物", "的", "话", ",", "也", "不", "能", "够", "起", "到", "很", "好", "的", "效", "果", ",", "但", "是", "也", "要", "坚", "持", "吃", "药", "。", "感", "冒", "能", "不", "能", "吃", "燕", "窝", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "ori", "rel_ids": [1899]} -{"id": 138, "title": "房颤会引起脑梗吗-有来医生", "context": "房颤会引起脑血管疾病,在医学上不叫脑梗叫脑栓塞,脑梗是脑血管本身病变引起的脑供血不足的情况,而脑栓塞是由于房颤心脏上形成了附壁血栓,当血栓的栓子脱落之后,就有可能堵塞在脑血管形成了脑拴塞,也是一种脑缺血的表现。治疗方法可以应用改善循环和营养神经的药物治疗,必须应用阿司匹林和氯吡格雷口服抗血小板聚集治疗,对于心房纤颤的患者,要控制心室率,应用阿司匹林和氯吡格雷等口服抗血小板聚集治疗,预防心脏附壁血栓的形成。", "question": "房颤会引起脑梗吗", "sent_token": ["房", "颤", "会", "引", "起", "脑", "血", "管", "疾", "病", ",", "在", "医", "学", "上", "不", "叫", "脑", "梗", "叫", "脑", "栓", "塞", ",", "脑", "梗", "是", "脑", "血", "管", "本", "身", "病", "变", "引", "起", "的", "脑", "供", "血", "不", "足", "的", "情", "况", ",", "而", "脑", "栓", "塞", "是", "由", "于", "房", "颤", "心", "脏", "上", "形", "成", "了", "附", "壁", "血", "栓", ",", "当", "血", "栓", "的", "栓", "子", "脱", "落", "之", "后", ",", "就", "有", "可", "能", "堵", "塞", "在", "脑", "血", "管", "形", "成", "了", "脑", "拴", "塞", ",", "也", "是", "一", "种", "脑", "缺", "血", "的", "表", "现", "。", "治", "疗", "方", "法", "可", "以", "应", "用", "改", "善", "循", "环", "和", "营", "养", "神", "经", "的", "药", "物", "治", "疗", ",", "必", "须", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "对", "于", "心", "房", "纤", "颤", "的", "患", "者", ",", "要", "控", "制", "心", "室", "率", ",", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "等", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "预", "防", "心", "脏", "附", "壁", "血", "栓", "的", "形", "成", "。", "房", "颤", "会", "引", "起", "脑", "梗", "吗", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1900]} -{"id": 144, "title": "二十天的婴儿能看多远_妈妈网小百科", "context": "20天的宝宝能够看到的距离大概是15厘米-20厘米左右,一般能够看到18厘米左右的事物。宝宝刚出生的时候视力极其差,有的甚至没有睁开眼,可以说基本什么都看不清楚,视力比较好的新生儿,也只能感受到光和影或大致的轮廓。", "question": "二十天的宝宝能看多远?", "sent_token": ["20", "天", "的", "宝", "宝", "能", "够", "看", "到", "的", "距", "离", "大", "概", "是", "15", "厘", "米", "-", "20", "厘", "米", "左", "右", ",", "一", "般", "能", "够", "看", "到", "18", "厘", "米", "左", "右", "的", "事", "物", "。", "宝", "宝", "刚", "出", "生", "的", "时", "候", "视", "力", "极", "其", "差", ",", "有", "的", "甚", "至", "没", "有", "睁", "开", "眼", ",", "可", "以", "说", "基", "本", "什", "么", "都", "看", "不", "清", "楚", ",", "视", "力", "比", "较", "好", "的", "新", "生", "儿", ",", "也", "只", "能", "感", "受", "到", "光", "和", "影", "或", "大", "致", "的", "轮", "廓", "。", "二", "十", "天", "的", "婴", "儿", "能", "看", "多", "远", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "ori", "rel_ids": [1906]} -{"id": 156, "title": "4价宫颈疫苗多少钱-有来医生", "context": "4价宫颈癌疫苗有国产疫苗和进口疫苗,国产疫苗价格比较便宜,预防宫颈癌的疫苗只有4价疫苗,具体价格不同地区以及不同生产厂家生产的疫苗,所定价格也不一样。在北京4价宫颈癌疫苗,价格大概是800元左右,总共需要接种三针,需要在半年内接种完,分别在第一个月,第2个月和第6个月各接种一针次,接种年龄是20-45周岁,建议咨询当地疾病预防控制机构,所进疫苗的具体价格比较准确。比如江苏省从2019年开始,所有有价疫苗都是零差价出售,每接种一针次,收取20元材料费和注射费,目前接种宫颈癌疫苗,应该先预约才可以接种。", "question": "国产宫颈疫苗有几价", "sent_token": ["4", "价", "宫", "颈", "癌", "疫", "苗", "有", "国", "产", "疫", "苗", "和", "进", "口", "疫", "苗", ",", "国", "产", "疫", "苗", "价", "格", "比", "较", "便", "宜", ",", "预", "防", "宫", "颈", "癌", "的", "疫", "苗", "只", "有", "4", "价", "疫", "苗", ",", "具", "体", "价", "格", "不", "同", "地", "区", "以", "及", "不", "同", "生", "产", "厂", "家", "生", "产", "的", "疫", "苗", ",", "所", "定", "价", "格", "也", "不", "一", "样", "。", "在", "北", "京", "4", "价", "宫", "颈", "癌", "疫", "苗", ",", "价", "格", "大", "概", "是", "800", "元", "左", "右", ",", "总", "共", "需", "要", "接", "种", "三", "针", ",", "需", "要", "在", "半", "年", "内", "接", "种", "完", ",", "分", "别", "在", "第", "一", "个", "月", ",", "第", "2", "个", "月", "和", "第", "6", "个", "月", "各", "接", "种", "一", "针", "次", ",", "接", "种", "年", "龄", "是", "20", "-", "45", "周", "岁", ",", "建", "议", "咨", "询", "当", "地", "疾", "病", "预", "防", "控", "制", "机", "构", ",", "所", "进", "疫", "苗", "的", "具", "体", "价", "格", "比", "较", "准", "确", "。", "比", "如", "江", "苏", "省", "从", "2019", "年", "开", "始", ",", "所", "有", "有", "价", "疫", "苗", "都", "是", "零", "差", "价", "出", "售", ",", "每", "接", "种", "一", "针", "次", ",", "收", "取", "20", "元", "材", "料", "费", "和", "注", "射", "费", ",", "目", "前", "接", "种", "宫", "颈", "癌", "疫", "苗", ",", "应", "该", "先", "预", "约", "才", "可", "以", "接", "种", "。", "4", "价", "宫", "颈", "疫", "苗", "多", "少", "钱", "-", "有", "来", "医", "生"], "sample_type": "ori", "rel_ids": [1918]} -{"id": 183, "title": "hiit是什么", "context": "hiit是高强度间歇训练,主要是通过进行多组高强度的间隙,和低强度的动作组合训练,这种训练方式能够在短时间内高速燃烧脂肪,非常适合锻炼时间较少或无法长时间坚持锻炼的人。", "question": "什么是HIIT", "sent_token": ["hiit", "是", "高", "强", "度", "间", "歇", "训", "练", ",", "主", "要", "是", "通", "过", "进", "行", "多", "组", "高", "强", "度", "的", "间", "隙", ",", "和", "低", "强", "度", "的", "动", "作", "组", "合", "训", "练", ",", "这", "种", "训", "练", "方", "式", "能", "够", "在", "短", "时", "间", "内", "高", "速", "燃", "烧", "脂", "肪", ",", "非", "常", "适", "合", "锻", "炼", "时", "间", "较", "少", "或", "无", "法", "长", "时", "间", "坚", "持", "锻", "炼", "的", "人", "。", "hiit", "是", "什", "么"], "sample_type": "ori", "rel_ids": [1945]} -{"id": 187, "title": "民生信用卡的客服电话多少?-其他问题知识问答-我爱卡", "context": "民生银行的信用卡的24小时客服电话为400-669-5568,持卡人在办卡或用卡的过程中,有任何疑问,都可以拨打民生银行信用卡客服电话,通过人工客服,来进行咨询。同时,持卡人也可以通过客服电话,办理信用卡激活、修改密码、更改账单日等业务。", "question": "民生信用卡客服", "sent_token": ["民", "生", "银", "行", "的", "信", "用", "卡", "的", "24", "小", "时", "客", "服", "电", "话", "为", "400", "-", "669", "-", "5568", ",", "持", "卡", "人", "在", "办", "卡", "或", "用", "卡", "的", "过", "程", "中", ",", "有", "任", "何", "疑", "问", ",", "都", "可", "以", "拨", "打", "民", "生", "银", "行", "信", "用", "卡", "客", "服", "电", "话", ",", "通", "过", "人", "工", "客", "服", ",", "来", "进", "行", "咨", "询", "。", "同", "时", ",", "持", "卡", "人", "也", "可", "以", "通", "过", "客", "服", "电", "话", ",", "办", "理", "信", "用", "卡", "激", "活", "、", "修", "改", "密", "码", "、", "更", "改", "账", "单", "日", "等", "业", "务", "。", "民", "生", "信", "用", "卡", "的", "客", "服", "电", "话", "多", "少", "?", "-", "其", "他", "问", "题", "知", "识", "问", "答", "-", "我", "爱", "卡"], "sample_type": "ori", "rel_ids": [1949]} -{"id": 194, "title": "", "context": "法令纹位於鼻翼两侧往下延伸至嘴的附近,也称寿带,法令若垂长,亦为长寿之象徵。不过女性多半不喜欢脸上出现法令纹,因为这意味脸部皮肤松弛,是老化的迹象。", "question": "哪里是法令纹?", "sent_token": ["法", "令", "纹", "位", "於", "鼻", "翼", "两", "侧", "往", "下", "延", "伸", "至", "嘴", "的", "附", "近", ",", "也", "称", "寿", "带", ",", "法", "令", "若", "垂", "长", ",", "亦", "为", "长", "寿", "之", "象", "徵", "。", "不", "过", "女", "性", "多", "半", "不", "喜", "欢", "脸", "上", "出", "现", "法", "令", "纹", ",", "因", "为", "这", "意", "味", "脸", "部", "皮", "肤", "松", "弛", ",", "是", "老", "化", "的", "迹", "象", "。"], "sample_type": "ori", "rel_ids": [1956]} -{"id": 204, "title": "婴儿轻微肠炎能自愈吗_妈妈网小百科", "context": "婴儿轻微肠炎不能自愈。肠炎是一种炎症,其发病的原因与胃肠道失调有关联。婴儿胃肠道菌群出现了失调的异常,就会引发肠炎的出现。尽管是比较轻微的肠炎,但还是有炎症的存在。婴儿轻微肠炎需要就医进行治疗,需要吃药促使炎症的消除。", "question": "婴儿轻度肠炎能自愈吗", "sent_token": ["婴", "儿", "轻", "微", "肠", "炎", "不", "能", "自", "愈", "。", "肠", "炎", "是", "一", "种", "炎", "症", ",", "其", "发", "病", "的", "原", "因", "与", "胃", "肠", "道", "失", "调", "有", "关", "联", "。", "婴", "儿", "胃", "肠", "道", "菌", "群", "出", "现", "了", "失", "调", "的", "异", "常", ",", "就", "会", "引", "发", "肠", "炎", "的", "出", "现", "。", "尽", "管", "是", "比", "较", "轻", "微", "的", "肠", "炎", ",", "但", "还", "是", "有", "炎", "症", "的", "存", "在", "。", "婴", "儿", "轻", "微", "肠", "炎", "需", "要", "就", "医", "进", "行", "治", "疗", ",", "需", "要", "吃", "药", "促", "使", "炎", "症", "的", "消", "除", "。", "婴", "儿", "轻", "微", "肠", "炎", "能", "自", "愈", "吗", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "ori", "rel_ids": [1966]} -{"id": 215, "title": "", "context": "珍珠鸟作者简介冯骥才,当代作家,1942年生于天津,原籍浙江慈溪市人。从小喜爱美术、文学和球类活动。曾当过专业篮球运动员,从事过绘画。", "question": "冯骥才什么时候出生", "sent_token": ["珍", "珠", "鸟", "作", "者", "简", "介", "冯", "骥", "才", ",", "当", "代", "作", "家", ",", "1942", "年", "生", "于", "天", "津", ",", "原", "籍", "浙", "江", "慈", "溪", "市", "人", "。", "从", "小", "喜", "爱", "美", "术", "、", "文", "学", "和", "球", "类", "活", "动", "。", "曾", "当", "过", "专", "业", "篮", "球", "运", "动", "员", ",", "从", "事", "过", "绘", "画", "。"], "sample_type": "ori", "rel_ids": [1977]} -{"id": 221, "title": "哺乳期可以吃维生素b2吗_有问必答_快速问医生", "context": "你好,口腔溃疡一般都是由于维生素缺乏导致的,与口腔炎症和上火也有关,可以服用维生素b2和维生素c治疗。用西瓜皮煮水喝,可以清热去火。局部用口腔溃疡散或者用维生素c研磨成粉末涂抹,都可以有效缓解疼痛。孕妇正常也要补充维生素的,服用维生素b2没有问题的。平时一定要多吃新鲜蔬菜水果,补充维生素,注意口腔卫生,早晚刷牙,饭后用温水漱口,每天早上起床用淡盐水漱口。", "question": "哺乳期能吃维生素b2片吗", "sent_token": ["你", "好", ",", "口", "腔", "溃", "疡", "一", "般", "都", "是", "由", "于", "维", "生", "素", "缺", "乏", "导", "致", "的", ",", "与", "口", "腔", "炎", "症", "和", "上", "火", "也", "有", "关", ",", "可", "以", "服", "用", "维", "生", "素", "b2", "和", "维", "生", "素", "c", "治", "疗", "。", "用", "西", "瓜", "皮", "煮", "水", "喝", ",", "可", "以", "清", "热", "去", "火", "。", "局", "部", "用", "口", "腔", "溃", "疡", "散", "或", "者", "用", "维", "生", "素", "c", "研", "磨", "成", "粉", "末", "涂", "抹", ",", "都", "可", "以", "有", "效", "缓", "解", "疼", "痛", "。", "孕", "妇", "正", "常", "也", "要", "补", "充", "维", "生", "素", "的", ",", "服", "用", "维", "生", "素", "b2", "没", "有", "问", "题", "的", "。", "平", "时", "一", "定", "要", "多", "吃", "新", "鲜", "蔬", "菜", "水", "果", ",", "补", "充", "维", "生", "素", ",", "注", "意", "口", "腔", "卫", "生", ",", "早", "晚", "刷", "牙", ",", "饭", "后", "用", "温", "水", "漱", "口", ",", "每", "天", "早", "上", "起", "床", "用", "淡", "盐", "水", "漱", "口", "。", "哺", "乳", "期", "可", "以", "吃", "维", "生", "素", "b2", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "ori", "rel_ids": [1983]} -{"id": 231, "title": "6岁儿童吃几颗肠虫清,吃肠虫清需要忌口吗_孕育常识_亲子宝典库_", "context": "肠虫清是六岁儿童就可以服用的一次吃两片,是吃饱饭后吃,肠虫清的主要是驱虫的药物,一般在晚上睡前服用的是比较好的,服药期间要多喝开水,多吃清淡易消化的食物,忌辛辣刺激性食物和油腻煎炸的食物,注意保暖避免着凉。", "question": "6岁儿童吃几颗肠虫清", "sent_token": ["肠", "虫", "清", "是", "六", "岁", "儿", "童", "就", "可", "以", "服", "用", "的", "一", "次", "吃", "两", "片", ",", "是", "吃", "饱", "饭", "后", "吃", ",", "肠", "虫", "清", "的", "主", "要", "是", "驱", "虫", "的", "药", "物", ",", "一", "般", "在", "晚", "上", "睡", "前", "服", "用", "的", "是", "比", "较", "好", "的", ",", "服", "药", "期", "间", "要", "多", "喝", "开", "水", ",", "多", "吃", "清", "淡", "易", "消", "化", "的", "食", "物", ",", "忌", "辛", "辣", "刺", "激", "性", "食", "物", "和", "油", "腻", "煎", "炸", "的", "食", "物", ",", "注", "意", "保", "暖", "避", "免", "着", "凉", "。", "6", "岁", "儿", "童", "吃", "几", "颗", "肠", "虫", "清", ",", "吃", "肠", "虫", "清", "需", "要", "忌", "口", "吗", "_", "孕", "育", "常", "识", "_", "亲", "子", "宝", "典", "库", "_"], "sample_type": "ori", "rel_ids": [1993]} -{"id": 241, "title": "隔阂意味着是什么意思", "context": "隔阂意味着很多意思,通常隔阂就意味着可能双方之间沟通有问题,比如有些夫妻或者是男女朋友之间吵架,两个人一起冷战,两个人由于没有沟通,双方之间的误会和矛盾就会越来越多了,也有可能是两个人总是以争吵的方式来解决问题,像这样的话就达不到有效的沟通,两个人两个人越不沟通,双方之间的矛盾和争吵就会越来越多,这个时候就会产生深深的隔阂。也有可能是双峰之间的价值观完全不同,比如对待某些问题的时候,有些人比较理性,但是有些人会比较感性,这个时候价值观不同的话就非常容易产生隔阂。", "question": "隔阂什么意思", "sent_token": ["隔", "阂", "意", "味", "着", "很", "多", "意", "思", ",", "通", "常", "隔", "阂", "就", "意", "味", "着", "可", "能", "双", "方", "之", "间", "沟", "通", "有", "问", "题", ",", "比", "如", "有", "些", "夫", "妻", "或", "者", "是", "男", "女", "朋", "友", "之", "间", "吵", "架", ",", "两", "个", "人", "一", "起", "冷", "战", ",", "两", "个", "人", "由", "于", "没", "有", "沟", "通", ",", "双", "方", "之", "间", "的", "误", "会", "和", "矛", "盾", "就", "会", "越", "来", "越", "多", "了", ",", "也", "有", "可", "能", "是", "两", "个", "人", "总", "是", "以", "争", "吵", "的", "方", "式", "来", "解", "决", "问", "题", ",", "像", "这", "样", "的", "话", "就", "达", "不", "到", "有", "效", "的", "沟", "通", ",", "两", "个", "人", "两", "个", "人", "越", "不", "沟", "通", ",", "双", "方", "之", "间", "的", "矛", "盾", "和", "争", "吵", "就", "会", "越", "来", "越", "多", ",", "这", "个", "时", "候", "就", "会", "产", "生", "深", "深", "的", "隔", "阂", "。", "也", "有", "可", "能", "是", "双", "峰", "之", "间", "的", "价", "值", "观", "完", "全", "不", "同", ",", "比", "如", "对", "待", "某", "些", "问", "题", "的", "时", "候", ",", "有", "些", "人", "比", "较", "理", "性", ",", "但", "是", "有", "些", "人", "会", "比", "较", "感", "性", ",", "这", "个", "时", "候", "价", "值", "观", "不", "同", "的", "话", "就", "非", "常", "容", "易", "产", "生", "隔", "阂", "。", "隔", "阂", "意", "味", "着", "是", "什", "么", "意", "思"], "sample_type": "ori", "rel_ids": [2003]} -{"id": 242, "title": "小儿癫痫病能彻底治愈的吗_有问必答_快速问医生", "context": "你好,很高兴为你服务,目前小儿癫痫是可以治愈的,不同的癫痫类型以及患者的实际病情不同,其适合的治疗方法也是不尽相同的。现在常见的小儿癫痫治疗都是采用中医为基础的治疗方法,这样对患儿的伤害较小,而西医则有很大的副作用,好吧", "question": "小儿癫痫能治愈吗", "sent_token": ["你", "好", ",", "很", "高", "兴", "为", "你", "服", "务", ",", "目", "前", "小", "儿", "癫", "痫", "是", "可", "以", "治", "愈", "的", ",", "不", "同", "的", "癫", "痫", "类", "型", "以", "及", "患", "者", "的", "实", "际", "病", "情", "不", "同", ",", "其", "适", "合", "的", "治", "疗", "方", "法", "也", "是", "不", "尽", "相", "同", "的", "。", "现", "在", "常", "见", "的", "小", "儿", "癫", "痫", "治", "疗", "都", "是", "采", "用", "中", "医", "为", "基", "础", "的", "治", "疗", "方", "法", ",", "这", "样", "对", "患", "儿", "的", "伤", "害", "较", "小", ",", "而", "西", "医", "则", "有", "很", "大", "的", "副", "作", "用", ",", "好", "吧", "小", "儿", "癫", "痫", "病", "能", "彻", "底", "治", "愈", "的", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "ori", "rel_ids": [2004]} -{"id": 250, "title": "脑内多发腔隙性脑梗死严重吗_39健康问答_39健康网", "context": "脑内多发腔隙性脑梗死,部分软化灶形成,一般不严重,是细枝血管梗塞,引起小灶脑组织坏死,脑组织软化灶,其他部位的脑组织会替代坏死部位的脑组织功能,所以一般没有不适的症状。注意控制血压,清淡饮食,控制血脂,血粘度,精神放松,解除思想顾虑,多做室外文娱体育活动,精神愉快,多接受紫外线照射,多喝开水,会有利于康复。可以根据情况使用疏通血管的药物。", "question": "多发腔隙性脑梗死吃什么中药", "sent_token": ["脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", ",", "部", "分", "软", "化", "灶", "形", "成", ",", "一", "般", "不", "严", "重", ",", "是", "细", "枝", "血", "管", "梗", "塞", ",", "引", "起", "小", "灶", "脑", "组", "织", "坏", "死", ",", "脑", "组", "织", "软", "化", "灶", ",", "其", "他", "部", "位", "的", "脑", "组", "织", "会", "替", "代", "坏", "死", "部", "位", "的", "脑", "组", "织", "功", "能", ",", "所", "以", "一", "般", "没", "有", "不", "适", "的", "症", "状", "。", "注", "意", "控", "制", "血", "压", ",", "清", "淡", "饮", "食", ",", "控", "制", "血", "脂", ",", "血", "粘", "度", ",", "精", "神", "放", "松", ",", "解", "除", "思", "想", "顾", "虑", ",", "多", "做", "室", "外", "文", "娱", "体", "育", "活", "动", ",", "精", "神", "愉", "快", ",", "多", "接", "受", "紫", "外", "线", "照", "射", ",", "多", "喝", "开", "水", ",", "会", "有", "利", "于", "康", "复", "。", "可", "以", "根", "据", "情", "况", "使", "用", "疏", "通", "血", "管", "的", "药", "物", "。", "脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", "严", "重", "吗", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "ori", "rel_ids": [2012]} -{"id": 1763, "title": "地瓜是红薯吗", "context": "地瓜不是红薯。地瓜一般生吃或者凉拌,外形是纺锤型的,有明显的瓣状结构,内里的肉是白色的,有清淡的药香味,生吃又脆又甜,常食用可以预防肝癌、胃癌,营养价值非常高。红薯是粗粮,也叫番薯山芋。它是一种属管状花目,旋花科一年生的草本植物,富含丰富的矿物质和维生素,而且非常耐饱。", "question": "马铃薯和红苕指的是同一个物种吗", "sent_token": ["地", "瓜", "不", "是", "红", "薯", "。", "地", "瓜", "一", "般", "生", "吃", "或", "者", "凉", "拌", ",", "外", "形", "是", "纺", "锤", "型", "的", ",", "有", "明", "显", "的", "瓣", "状", "结", "构", ",", "内", "里", "的", "肉", "是", "白", "色", "的", ",", "有", "清", "淡", "的", "药", "香", "味", ",", "生", "吃", "又", "脆", "又", "甜", ",", "常", "食", "用", "可", "以", "预", "防", "肝", "癌", "、", "胃", "癌", ",", "营", "养", "价", "值", "非", "常", "高", "。", "红", "薯", "是", "粗", "粮", ",", "也", "叫", "番", "薯", "山", "芋", "。", "它", "是", "一", "种", "属", "管", "状", "花", "目", ",", "旋", "花", "科", "一", "年", "生", "的", "草", "本", "植", "物", ",", "富", "含", "丰", "富", "的", "矿", "物", "质", "和", "维", "生", "素", ",", "而", "且", "非", "常", "耐", "饱", "。", "地", "瓜", "是", "红", "薯", "吗"], "sample_type": "disturb"} -{"id": 1767, "title": "已满多少岁的人犯贩卖毒品罪应负刑事责任", "context": "根据《刑法》第十七条:已满十六周岁的人犯罪,应当负刑事责任。已满十四周岁不满十六周岁的人,犯故意杀人、故意伤害致人重伤或者死亡、强奸、抢劫、贩卖毒品、放火、爆炸、投放危险物质罪的,应当负刑事责任。", "question": "贩卖毒品需要负刑事责任的人要满几周岁", "sent_token": ["根", "据", "《", "刑", "法", "》", "第", "十", "七", "条", ":", "已", "满", "十", "六", "周", "岁", "的", "人", "犯", "罪", ",", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "十", "四", "周", "岁", "不", "满", "十", "六", "周", "岁", "的", "人", ",", "犯", "故", "意", "杀", "人", "、", "故", "意", "伤", "害", "致", "人", "重", "伤", "或", "者", "死", "亡", "、", "强", "奸", "、", "抢", "劫", "、", "贩", "卖", "毒", "品", "、", "放", "火", "、", "爆", "炸", "、", "投", "放", "危", "险", "物", "质", "罪", "的", ",", "应", "当", "负", "刑", "事", "责", "任", "。", "已", "满", "多", "少", "岁", "的", "人", "犯", "贩", "卖", "毒", "品", "罪", "应", "负", "刑", "事", "责", "任"], "sample_type": "disturb"} -{"id": 1772, "title": "读研跟考研有什么区别", "context": "考研和读研的区别在于概念和意义不同。考研是指考生通过考试来得到研究生的入学资格,而考生并不是硕士研究生;而读研是指学生在高校攻读硕士研究生的过程,学生身份已经是硕士研究生。这二者并不等同,而是有先后关系,也就是说考生只有通过考研,才能成为硕士研究生,然后在规定的学习时间内读研。", "question": "考取研究生跟攻读研究生,具体什么区别?", "sent_token": ["考", "研", "和", "读", "研", "的", "区", "别", "在", "于", "概", "念", "和", "意", "义", "不", "同", "。", "考", "研", "是", "指", "考", "生", "通", "过", "考", "试", "来", "得", "到", "研", "究", "生", "的", "入", "学", "资", "格", ",", "而", "考", "生", "并", "不", "是", "硕", "士", "研", "究", "生", ";", "而", "读", "研", "是", "指", "学", "生", "在", "高", "校", "攻", "读", "硕", "士", "研", "究", "生", "的", "过", "程", ",", "学", "生", "身", "份", "已", "经", "是", "硕", "士", "研", "究", "生", "。", "这", "二", "者", "并", "不", "等", "同", ",", "而", "是", "有", "先", "后", "关", "系", ",", "也", "就", "是", "说", "考", "生", "只", "有", "通", "过", "考", "研", ",", "才", "能", "成", "为", "硕", "士", "研", "究", "生", ",", "然", "后", "在", "规", "定", "的", "学", "习", "时", "间", "内", "读", "研", "。", "读", "研", "跟", "考", "研", "有", "什", "么", "区", "别"], "sample_type": "disturb"} -{"id": 1774, "title": "多效唑能和磷酸二氢钾一起用吗", "context": "多效唑能和磷酸二氢钾一起用。多效唑是植物的生长调节剂,主要是控制作物疯长的。而磷酸二氢钾属于叶面肥,施用后可促使作物的叶色更加浓绿,根系发达,药效完全不同,也并不排斥,可以混合使用。不过要注意施用时要严格按照说明施加,不可过量,否则会阻碍生长。", "question": "磷酸一钾能和氯丁唑一起用OK吗", "sent_token": ["多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "。", "多", "效", "唑", "是", "植", "物", "的", "生", "长", "调", "节", "剂", ",", "主", "要", "是", "控", "制", "作", "物", "疯", "长", "的", "。", "而", "磷", "酸", "二", "氢", "钾", "属", "于", "叶", "面", "肥", ",", "施", "用", "后", "可", "促", "使", "作", "物", "的", "叶", "色", "更", "加", "浓", "绿", ",", "根", "系", "发", "达", ",", "药", "效", "完", "全", "不", "同", ",", "也", "并", "不", "排", "斥", ",", "可", "以", "混", "合", "使", "用", "。", "不", "过", "要", "注", "意", "施", "用", "时", "要", "严", "格", "按", "照", "说", "明", "施", "加", ",", "不", "可", "过", "量", ",", "否", "则", "会", "阻", "碍", "生", "长", "。", "多", "效", "唑", "能", "和", "磷", "酸", "二", "氢", "钾", "一", "起", "用", "吗"], "sample_type": "disturb"} -{"id": 1776, "title": "猫能吃蛋黄吗", "context": "猫咪是可以吃蛋黄的。这里特定煮熟的白水蛋,猫咪不能吃生鸡蛋,因为生鸡蛋中有细菌,常见的是沙门氏菌,容易引起猫腹泻脱水,而且饲喂猫咪最好的只饲喂蛋黄。虽然可以吃蛋黄,但是需要掌握好量,一般一周最多吃两三次就可了。蛋黄中也含有丰富的胆固醇,易引发猫咪患脂肪肝和高脂血病。", "question": "小猫咪可以吃蛋黄吗,生的", "sent_token": ["猫", "咪", "是", "可", "以", "吃", "蛋", "黄", "的", "。", "这", "里", "特", "定", "煮", "熟", "的", "白", "水", "蛋", ",", "猫", "咪", "不", "能", "吃", "生", "鸡", "蛋", ",", "因", "为", "生", "鸡", "蛋", "中", "有", "细", "菌", ",", "常", "见", "的", "是", "沙", "门", "氏", "菌", ",", "容", "易", "引", "起", "猫", "腹", "泻", "脱", "水", ",", "而", "且", "饲", "喂", "猫", "咪", "最", "好", "的", "只", "饲", "喂", "蛋", "黄", "。", "虽", "然", "可", "以", "吃", "蛋", "黄", ",", "但", "是", "需", "要", "掌", "握", "好", "量", ",", "一", "般", "一", "周", "最", "多", "吃", "两", "三", "次", "就", "可", "了", "。", "蛋", "黄", "中", "也", "含", "有", "丰", "富", "的", "胆", "固", "醇", ",", "易", "引", "发", "猫", "咪", "患", "脂", "肪", "肝", "和", "高", "脂", "血", "病", "。", "猫", "能", "吃", "蛋", "黄", "吗"], "sample_type": "disturb"} -{"id": 1780, "title": "最近深圳限行吗", "context": "现在由于疫情的影响,深圳市不限行的了,但是没有必要尽量还是少出门,出门也要做好一系列的防护措施才可以。因为虽然目前国内疫情形势有所缓和,但是这并不意味着疫情的结束,国外疫情形势还是很严峻的,境外输入案例较多。", "question": "近期深圳没有限行吗", "sent_token": ["现", "在", "由", "于", "疫", "情", "的", "影", "响", ",", "深", "圳", "市", "不", "限", "行", "的", "了", ",", "但", "是", "没", "有", "必", "要", "尽", "量", "还", "是", "少", "出", "门", ",", "出", "门", "也", "要", "做", "好", "一", "系", "列", "的", "防", "护", "措", "施", "才", "可", "以", "。", "因", "为", "虽", "然", "目", "前", "国", "内", "疫", "情", "形", "势", "有", "所", "缓", "和", ",", "但", "是", "这", "并", "不", "意", "味", "着", "疫", "情", "的", "结", "束", ",", "国", "外", "疫", "情", "形", "势", "还", "是", "很", "严", "峻", "的", ",", "境", "外", "输", "入", "案", "例", "较", "多", "。", "最", "近", "深", "圳", "限", "行", "吗"], "sample_type": "disturb"} -{"id": 1781, "title": "合同签字不盖章有效吗", "context": "可能有效可能无效。只有签字没有公章的合同是否有法律效力要根据具体情况分析:如果合同是由单位的委托代理人在其权限范围内、或单位的法定代表人签的字,则合同有效。", "question": "一没有签字,二没有盖章的合同,还有法律效用吗", "sent_token": ["可", "能", "有", "效", "可", "能", "无", "效", "。", "只", "有", "签", "字", "没", "有", "公", "章", "的", "合", "同", "是", "否", "有", "法", "律", "效", "力", "要", "根", "据", "具", "体", "情", "况", "分", "析", ":", "如", "果", "合", "同", "是", "由", "单", "位", "的", "委", "托", "代", "理", "人", "在", "其", "权", "限", "范", "围", "内", "、", "或", "单", "位", "的", "法", "定", "代", "表", "人", "签", "的", "字", ",", "则", "合", "同", "有", "效", "。", "合", "同", "签", "字", "不", "盖", "章", "有", "效", "吗"], "sample_type": "disturb"} -{"id": 1789, "title": "", "context": "吴三桂(1612年-1678年10月2日),字长伯,一字月所,明朝辽东人,明末清初著名政治军事人物,吴周政权建立者吴周太祖。", "question": "平西王吴三贵什么朝代", "sent_token": ["吴", "三", "桂", "(", "1612", "年", "-", "1678", "年", "10", "月", "2", "日", ")", ",", "字", "长", "伯", ",", "一", "字", "月", "所", ",", "明", "朝", "辽", "东", "人", ",", "明", "末", "清", "初", "著", "名", "政", "治", "军", "事", "人", "物", ",", "吴", "周", "政", "权", "建", "立", "者", "吴", "周", "太", "祖", "。"], "sample_type": "disturb"} -{"id": 1796, "title": "狗狗为什么互相闻屁股", "context": "相互闻屁股是狗狗打招呼的一种方式。狗狗的嗅觉很敏感,它们可以用相互闻屁股来了解狗狗的配偶状况、饮食习惯等,因为狗狗的屁股后面有两个肛门腺,在肛门腺里面涵盖了很多的信息素。处在发情期的狗狗也会通过闻屁股来挑选自己的配偶。", "question": "狗狗为何总是闻屁股", "sent_token": ["相", "互", "闻", "屁", "股", "是", "狗", "狗", "打", "招", "呼", "的", "一", "种", "方", "式", "。", "狗", "狗", "的", "嗅", "觉", "很", "敏", "感", ",", "它", "们", "可", "以", "用", "相", "互", "闻", "屁", "股", "来", "了", "解", "狗", "狗", "的", "配", "偶", "状", "况", "、", "饮", "食", "习", "惯", "等", ",", "因", "为", "狗", "狗", "的", "屁", "股", "后", "面", "有", "两", "个", "肛", "门", "腺", ",", "在", "肛", "门", "腺", "里", "面", "涵", "盖", "了", "很", "多", "的", "信", "息", "素", "。", "处", "在", "发", "情", "期", "的", "狗", "狗", "也", "会", "通", "过", "闻", "屁", "股", "来", "挑", "选", "自", "己", "的", "配", "偶", "。", "狗", "狗", "为", "什", "么", "互", "相", "闻", "屁", "股"], "sample_type": "disturb"} -{"id": 1798, "title": "出租房隔音差怎么解决", "context": "可以在窗户上贴一层隔音膜,在粘贴过程中要注意,不要出现气泡,以免影响隔音效果。若想要隔音效果更好点,还可以购买一些密封条安装在窗户缝隙处,这也能起到更好的隔音效果。另外,室内使用的家具可以更换成木质的,这样同样能起到一定的吸音效果。", "question": "出租房隔音不好如何解决", "sent_token": ["可", "以", "在", "窗", "户", "上", "贴", "一", "层", "隔", "音", "膜", ",", "在", "粘", "贴", "过", "程", "中", "要", "注", "意", ",", "不", "要", "出", "现", "气", "泡", ",", "以", "免", "影", "响", "隔", "音", "效", "果", "。", "若", "想", "要", "隔", "音", "效", "果", "更", "好", "点", ",", "还", "可", "以", "购", "买", "一", "些", "密", "封", "条", "安", "装", "在", "窗", "户", "缝", "隙", "处", ",", "这", "也", "能", "起", "到", "更", "好", "的", "隔", "音", "效", "果", "。", "另", "外", ",", "室", "内", "使", "用", "的", "家", "具", "可", "以", "更", "换", "成", "木", "质", "的", ",", "这", "样", "同", "样", "能", "起", "到", "一", "定", "的", "吸", "音", "效", "果", "。", "出", "租", "房", "隔", "音", "差", "怎", "么", "解", "决"], "sample_type": "disturb"} -{"id": 1802, "title": "鬼迷心窍(李宗盛演唱歌曲)_百度百科", "context": "《鬼迷心窍》是1992年黄日华、周海媚主演台湾电视剧《末代皇孙》的主题曲,是由李宗盛作词、作曲、演唱,收录于1992年影视剧音乐合辑《滚石九大天王之十二出好戏》当中。1993年,李宗盛凭借该曲获得第一届新加坡醉心金曲奖最佳作词奖", "question": "谁是鬼迷心窍的原唱", "sent_token": ["《", "鬼", "迷", "心", "窍", "》", "是", "1992", "年", "黄", "日", "华", "、", "周", "海", "媚", "主", "演", "台", "湾", "电", "视", "剧", "《", "末", "代", "皇", "孙", "》", "的", "主", "题", "曲", ",", "是", "由", "李", "宗", "盛", "作", "词", "、", "作", "曲", "、", "演", "唱", ",", "收", "录", "于", "1992", "年", "影", "视", "剧", "音", "乐", "合", "辑", "《", "滚", "石", "九", "大", "天", "王", "之", "十", "二", "出", "好", "戏", "》", "当", "中", "。", "1993", "年", ",", "李", "宗", "盛", "凭", "借", "该", "曲", "获", "得", "第", "一", "届", "新", "加", "坡", "醉", "心", "金", "曲", "奖", "最", "佳", "作", "词", "奖", "鬼", "迷", "心", "窍", "(", "李", "宗", "盛", "演", "唱", "歌", "曲", ")", "_", "百", "度", "百", "科"], "sample_type": "disturb"} -{"id": 1803, "title": "", "context": "白龙马,名著小说《西游记》中的重要角色。本是西海龙王三太子,因纵火烧毁玉帝赏赐的明珠而被西海龙王上天告忤逆,要被斩首。后因南海观世菩萨出面才免于死罪,被贬到蛇盘山鹰愁涧等待唐僧取经。之后又误吃唐僧所骑的白马,被菩萨点化,变身为白龙。", "question": "西游记中的白龙马,它的原始身份是什么", "sent_token": ["白", "龙", "马", ",", "名", "著", "小", "说", "《", "西", "游", "记", "》", "中", "的", "重", "要", "角", "色", "。", "本", "是", "西", "海", "龙", "王", "三", "太", "子", ",", "因", "纵", "火", "烧", "毁", "玉", "帝", "赏", "赐", "的", "明", "珠", "而", "被", "西", "海", "龙", "王", "上", "天", "告", "忤", "逆", ",", "要", "被", "斩", "首", "。", "后", "因", "南", "海", "观", "世", "菩", "萨", "出", "面", "才", "免", "于", "死", "罪", ",", "被", "贬", "到", "蛇", "盘", "山", "鹰", "愁", "涧", "等", "待", "唐", "僧", "取", "经", "。", "之", "后", "又", "误", "吃", "唐", "僧", "所", "骑", "的", "白", "马", ",", "被", "菩", "萨", "点", "化", ",", "变", "身", "为", "白", "龙", "。"], "sample_type": "disturb"} -{"id": 1805, "title": "", "context": "《湮灭》是由派拉蒙影业出品的科幻惊悚片,这部电影集合了科幻、悬疑、惊悚等时下流行的元素,由亚历克斯·加兰执导,娜塔莉·波特曼、詹妮弗·杰森·李、吉娜·罗德里格兹、泰莎·汤普森联合主演。该片于2018年2月23日在美国上映。影片根据杰夫·梵德米尔所著《遗落的南境》三部曲的首部同名小说改编,讲述了生物学家莉娜为了自己的丈夫,她自愿加入了科学考察探险小队,去研究美国领土一块被检疫隔离的生态灾害区域的故事。", "question": "湮灭是什么类型的电影", "sent_token": ["《", "湮", "灭", "》", "是", "由", "派", "拉", "蒙", "影", "业", "出", "品", "的", "科", "幻", "惊", "悚", "片", ",", "这", "部", "电", "影", "集", "合", "了", "科", "幻", "、", "悬", "疑", "、", "惊", "悚", "等", "时", "下", "流", "行", "的", "元", "素", ",", "由", "亚", "历", "克", "斯", "·", "加", "兰", "执", "导", ",", "娜", "塔", "莉", "·", "波", "特", "曼", "、", "詹", "妮", "弗", "·", "杰", "森", "·", "李", "、", "吉", "娜", "·", "罗", "德", "里", "格", "兹", "、", "泰", "莎", "·", "汤", "普", "森", "联", "合", "主", "演", "。", "该", "片", "于", "2018", "年", "2", "月", "23", "日", "在", "美", "国", "上", "映", "。", "影", "片", "根", "据", "杰", "夫", "·", "梵", "德", "米", "尔", "所", "著", "《", "遗", "落", "的", "南", "境", "》", "三", "部", "曲", "的", "首", "部", "同", "名", "小", "说", "改", "编", ",", "讲", "述", "了", "生", "物", "学", "家", "莉", "娜", "为", "了", "自", "己", "的", "丈", "夫", ",", "她", "自", "愿", "加", "入", "了", "科", "学", "考", "察", "探", "险", "小", "队", ",", "去", "研", "究", "美", "国", "领", "土", "一", "块", "被", "检", "疫", "隔", "离", "的", "生", "态", "灾", "害", "区", "域", "的", "故", "事", "。"], "sample_type": "disturb"} -{"id": 1807, "title": "", "context": "网球与高尔夫、保龄球、桌球并成为世界四大绅士运动,他的起源可以追溯到12-13世纪的法国。网球运动的起源及演变可以用四句话来概括:网球孕育在法国,诞生在英国,开始普及和形成高潮在美国,现盛行全世界。", "question": "网球发源于哪国", "sent_token": ["网", "球", "与", "高", "尔", "夫", "、", "保", "龄", "球", "、", "桌", "球", "并", "成", "为", "世", "界", "四", "大", "绅", "士", "运", "动", ",", "他", "的", "起", "源", "可", "以", "追", "溯", "到", "12", "-", "13", "世", "纪", "的", "法", "国", "。", "网", "球", "运", "动", "的", "起", "源", "及", "演", "变", "可", "以", "用", "四", "句", "话", "来", "概", "括", ":", "网", "球", "孕", "育", "在", "法", "国", ",", "诞", "生", "在", "英", "国", ",", "开", "始", "普", "及", "和", "形", "成", "高", "潮", "在", "美", "国", ",", "现", "盛", "行", "全", "世", "界", "。"], "sample_type": "disturb"} -{"id": 1810, "title": "单人挑战巫女大蛇悲鸣需要多少体力_单人挑战巫女大蛇悲鸣需要体力", "context": "玩家挑战巫女大蛇悲鸣的体力消耗是普通御魂副本的2倍。阴阳师巫女大蛇悲鸣单人通关需要12点体力组队通关的话只需要8点体力,挑战巫女大蛇悲鸣的体力消耗是普通御魂副本的2倍。奖励掉落5星与6星御魂,经验强化狗粮4星青吉鬼。在御魂副本1-10层原本掉落的基础上,巫女大蛇·悲鸣新增了蚌精、幽谷响、轮入道、蝠翼、狂骨这5种御魂的掉落,每日掉落御魂种类增加到5。", "question": "阴阳师 组队挑战大蛇悲鸣需要多少体力", "sent_token": ["玩", "家", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "的", "体", "力", "消", "耗", "是", "普", "通", "御", "魂", "副", "本", "的", "2", "倍", "。", "阴", "阳", "师", "巫", "女", "大", "蛇", "悲", "鸣", "单", "人", "通", "关", "需", "要", "12", "点", "体", "力", "组", "队", "通", "关", "的", "话", "只", "需", "要", "8", "点", "体", "力", ",", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "的", "体", "力", "消", "耗", "是", "普", "通", "御", "魂", "副", "本", "的", "2", "倍", "。", "奖", "励", "掉", "落", "5", "星", "与", "6", "星", "御", "魂", ",", "经", "验", "强", "化", "狗", "粮", "4", "星", "青", "吉", "鬼", "。", "在", "御", "魂", "副", "本", "1", "-", "10", "层", "原", "本", "掉", "落", "的", "基", "础", "上", ",", "巫", "女", "大", "蛇", "·", "悲", "鸣", "新", "增", "了", "蚌", "精", "、", "幽", "谷", "响", "、", "轮", "入", "道", "、", "蝠", "翼", "、", "狂", "骨", "这", "5", "种", "御", "魂", "的", "掉", "落", ",", "每", "日", "掉", "落", "御", "魂", "种", "类", "增", "加", "到", "5", "。", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "多", "少", "体", "力", "_", "单", "人", "挑", "战", "巫", "女", "大", "蛇", "悲", "鸣", "需", "要", "体", "力"], "sample_type": "disturb"} -{"id": 1815, "title": "", "context": "心脏是脊椎动物身体中最重要的一个器官,人类的心脏位于胸腔中部偏左,体积约相当于一个拳头大小,重量约350克。女性的心脏通常要比男性的体积小且重量轻。人的心脏外形像桃子,位于横膈之上,两肺间而偏左。", "question": "人类心脏有多重", "sent_token": ["心", "脏", "是", "脊", "椎", "动", "物", "身", "体", "中", "最", "重", "要", "的", "一", "个", "器", "官", ",", "人", "类", "的", "心", "脏", "位", "于", "胸", "腔", "中", "部", "偏", "左", ",", "体", "积", "约", "相", "当", "于", "一", "个", "拳", "头", "大", "小", ",", "重", "量", "约", "350", "克", "。", "女", "性", "的", "心", "脏", "通", "常", "要", "比", "男", "性", "的", "体", "积", "小", "且", "重", "量", "轻", "。", "人", "的", "心", "脏", "外", "形", "像", "桃", "子", ",", "位", "于", "横", "膈", "之", "上", ",", "两", "肺", "间", "而", "偏", "左", "。"], "sample_type": "disturb"} -{"id": 1816, "title": "紫菜变成紫色还能吃吗-有来医生", "context": "如果紫菜变成紫色的情况下,主要考虑还是紫菜受潮引起的,紫菜受潮以后容易滋生细菌,营养物质也会丧失,口感也会变差,一般情况下,建议不要食用,以免导致消化道的不良反应。紫菜中含有的营养物质是很丰富的,含有丰富的锌元素和铁元素,每天适当的吃一点,可以预防缺铁性贫血,可以预防缺锌引起的反复性口腔溃疡,可以增进食欲。", "question": "海苔回潮了还能吃不", "sent_token": ["如", "果", "紫", "菜", "变", "成", "紫", "色", "的", "情", "况", "下", ",", "主", "要", "考", "虑", "还", "是", "紫", "菜", "受", "潮", "引", "起", "的", ",", "紫", "菜", "受", "潮", "以", "后", "容", "易", "滋", "生", "细", "菌", ",", "营", "养", "物", "质", "也", "会", "丧", "失", ",", "口", "感", "也", "会", "变", "差", ",", "一", "般", "情", "况", "下", ",", "建", "议", "不", "要", "食", "用", ",", "以", "免", "导", "致", "消", "化", "道", "的", "不", "良", "反", "应", "。", "紫", "菜", "中", "含", "有", "的", "营", "养", "物", "质", "是", "很", "丰", "富", "的", ",", "含", "有", "丰", "富", "的", "锌", "元", "素", "和", "铁", "元", "素", ",", "每", "天", "适", "当", "的", "吃", "一", "点", ",", "可", "以", "预", "防", "缺", "铁", "性", "贫", "血", ",", "可", "以", "预", "防", "缺", "锌", "引", "起", "的", "反", "复", "性", "口", "腔", "溃", "疡", ",", "可", "以", "增", "进", "食", "欲", "。", "紫", "菜", "变", "成", "紫", "色", "还", "能", "吃", "吗", "-", "有", "来", "医", "生"], "sample_type": "disturb"} -{"id": 1830, "title": "", "context": "钢铁侠是由美国漫威电影工作室出品的一部科幻冒险电影,改编自同名系列漫画。穿上盔甲后,托尼变身成了复仇者联盟中惩恶扬善的钢铁侠。复仇者联盟2:奥创纪元钢铁侠是美国演员小罗伯特·唐尼演的。小罗伯特唐尼的电影钢铁侠扮演者小罗伯特·。", "question": "谁演过钢铁侠", "sent_token": ["钢", "铁", "侠", "是", "由", "美", "国", "漫", "威", "电", "影", "工", "作", "室", "出", "品", "的", "一", "部", "科", "幻", "冒", "险", "电", "影", ",", "改", "编", "自", "同", "名", "系", "列", "漫", "画", "。", "穿", "上", "盔", "甲", "后", ",", "托", "尼", "变", "身", "成", "了", "复", "仇", "者", "联", "盟", "中", "惩", "恶", "扬", "善", "的", "钢", "铁", "侠", "。", "复", "仇", "者", "联", "盟", "2", ":", "奥", "创", "纪", "元", "钢", "铁", "侠", "是", "美", "国", "演", "员", "小", "罗", "伯", "特", "·", "唐", "尼", "演", "的", "。", "小", "罗", "伯", "特", "唐", "尼", "的", "电", "影", "钢", "铁", "侠", "扮", "演", "者", "小", "罗", "伯", "特", "·", "。"], "sample_type": "disturb"} -{"id": 1831, "title": "人间正道是沧桑是什么意思_酷知经验网", "context": "天若有情天亦老,人间正道是沧桑:上句借用李贺《金铜仙人辞汉歌》中诗句,原诗说的是汉武帝时制作的极贵重的宝物金铜仙人像,在三国时被魏明帝由长安迁往洛阳的传说。原句的意思是,对于这样的人间恨事,天若有情,也要因悲伤而衰老。人间正道,社会发展的正常规律。沧桑,沧海(大海)变为桑田,多指巨大的变化,这里比喻的是革命的道路艰难曲折。", "question": "人间正道是沧桑前面是什么", "sent_token": ["天", "若", "有", "情", "天", "亦", "老", ",", "人", "间", "正", "道", "是", "沧", "桑", ":", "上", "句", "借", "用", "李", "贺", "《", "金", "铜", "仙", "人", "辞", "汉", "歌", "》", "中", "诗", "句", ",", "原", "诗", "说", "的", "是", "汉", "武", "帝", "时", "制", "作", "的", "极", "贵", "重", "的", "宝", "物", "金", "铜", "仙", "人", "像", ",", "在", "三", "国", "时", "被", "魏", "明", "帝", "由", "长", "安", "迁", "往", "洛", "阳", "的", "传", "说", "。", "原", "句", "的", "意", "思", "是", ",", "对", "于", "这", "样", "的", "人", "间", "恨", "事", ",", "天", "若", "有", "情", ",", "也", "要", "因", "悲", "伤", "而", "衰", "老", "。", "人", "间", "正", "道", ",", "社", "会", "发", "展", "的", "正", "常", "规", "律", "。", "沧", "桑", ",", "沧", "海", "(", "大", "海", ")", "变", "为", "桑", "田", ",", "多", "指", "巨", "大", "的", "变", "化", ",", "这", "里", "比", "喻", "的", "是", "革", "命", "的", "道", "路", "艰", "难", "曲", "折", "。", "人", "间", "正", "道", "是", "沧", "桑", "是", "什", "么", "意", "思", "_", "酷", "知", "经", "验", "网"], "sample_type": "disturb"} -{"id": 1834, "title": "", "context": "《艺妓回忆录》根据美国作家阿瑟-高顿的同名小说改编。于2005年12月1日上映,由章子怡·巩俐·杨紫琼等共同演绎。是一部时长约140分钟的电影。全篇充满着古典美,时代背景从1929年开始延续到二战结束,女主人公回忆了自己从小拼命挣扎、历尽荣辱的人生经历。该片获得2006年第78届奥斯卡金像奖最佳摄影、最佳艺术指导、最佳服装设计三项奖项。", "question": "艺伎回忆录片长有多久", "sent_token": ["《", "艺", "妓", "回", "忆", "录", "》", "根", "据", "美", "国", "作", "家", "阿", "瑟", "-", "高", "顿", "的", "同", "名", "小", "说", "改", "编", "。", "于", "2005", "年", "12", "月", "1", "日", "上", "映", ",", "由", "章", "子", "怡", "·", "巩", "俐", "·", "杨", "紫", "琼", "等", "共", "同", "演", "绎", "。", "是", "一", "部", "时", "长", "约", "140", "分", "钟", "的", "电", "影", "。", "全", "篇", "充", "满", "着", "古", "典", "美", ",", "时", "代", "背", "景", "从", "1929", "年", "开", "始", "延", "续", "到", "二", "战", "结", "束", ",", "女", "主", "人", "公", "回", "忆", "了", "自", "己", "从", "小", "拼", "命", "挣", "扎", "、", "历", "尽", "荣", "辱", "的", "人", "生", "经", "历", "。", "该", "片", "获", "得", "2006", "年", "第", "78", "届", "奥", "斯", "卡", "金", "像", "奖", "最", "佳", "摄", "影", "、", "最", "佳", "艺", "术", "指", "导", "、", "最", "佳", "服", "装", "设", "计", "三", "项", "奖", "项", "。"], "sample_type": "disturb"} -{"id": 1839, "title": "痛风挂哪个科室比较好?_39健康问答_39健康网", "context": "痛风属于代谢风湿性疾病,目前主要是在风湿免疫科治疗,所以患者需要挂风湿免疫科。风湿免疫科在绝大多数三级甲等医院都有独立的科室。由于这个科是一个新兴学科,在很多县级医院还没有成立,患者可以到内分泌科就诊,挂内分泌科。如果这两个科都没有患者,可以到骨科就诊,因为痛风首发表现是急性痛风性关节炎,骨科大夫对痛风也有一定的了解。", "question": "痛风属于什么类型疾病呢", "sent_token": ["痛", "风", "属", "于", "代", "谢", "风", "湿", "性", "疾", "病", ",", "目", "前", "主", "要", "是", "在", "风", "湿", "免", "疫", "科", "治", "疗", ",", "所", "以", "患", "者", "需", "要", "挂", "风", "湿", "免", "疫", "科", "。", "风", "湿", "免", "疫", "科", "在", "绝", "大", "多", "数", "三", "级", "甲", "等", "医", "院", "都", "有", "独", "立", "的", "科", "室", "。", "由", "于", "这", "个", "科", "是", "一", "个", "新", "兴", "学", "科", ",", "在", "很", "多", "县", "级", "医", "院", "还", "没", "有", "成", "立", ",", "患", "者", "可", "以", "到", "内", "分", "泌", "科", "就", "诊", ",", "挂", "内", "分", "泌", "科", "。", "如", "果", "这", "两", "个", "科", "都", "没", "有", "患", "者", ",", "可", "以", "到", "骨", "科", "就", "诊", ",", "因", "为", "痛", "风", "首", "发", "表", "现", "是", "急", "性", "痛", "风", "性", "关", "节", "炎", ",", "骨", "科", "大", "夫", "对", "痛", "风", "也", "有", "一", "定", "的", "了", "解", "。", "痛", "风", "挂", "哪", "个", "科", "室", "比", "较", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"} -{"id": 1844, "title": "阴阳师武士之灵生前被谁所杀_游侠网", "context": "由武士死后的灵魂化成。生前一直为主人效忠,最后献出了生命。从武士之灵的传记中可以得知,武士之灵生前是被茨木童子所击杀。该问题来自游戏内的逢魔密信,正确回答问题之后就有机会获得包括金币、体力、勾玉和结界卡在内的多种游戏内道具物资奖励。", "question": "武士之灵生前被谁所杀", "sent_token": ["由", "武", "士", "死", "后", "的", "灵", "魂", "化", "成", "。", "生", "前", "一", "直", "为", "主", "人", "效", "忠", ",", "最", "后", "献", "出", "了", "生", "命", "。", "从", "武", "士", "之", "灵", "的", "传", "记", "中", "可", "以", "得", "知", ",", "武", "士", "之", "灵", "生", "前", "是", "被", "茨", "木", "童", "子", "所", "击", "杀", "。", "该", "问", "题", "来", "自", "游", "戏", "内", "的", "逢", "魔", "密", "信", ",", "正", "确", "回", "答", "问", "题", "之", "后", "就", "有", "机", "会", "获", "得", "包", "括", "金", "币", "、", "体", "力", "、", "勾", "玉", "和", "结", "界", "卡", "在", "内", "的", "多", "种", "游", "戏", "内", "道", "具", "物", "资", "奖", "励", "。", "阴", "阳", "师", "武", "士", "之", "灵", "生", "前", "被", "谁", "所", "杀", "_", "游", "侠", "网"], "sample_type": "disturb"} -{"id": 1850, "title": "中医肾主什么-有来医生", "context": "根据中医基础理论,肾主水、主纳气、主二便、主藏精。人体的生长的生命过程与肾中精气的盛衰有着密切的关系,肾主水,是指全身的水液代谢都是在肾阳的气化温煦作用下,从而分布到全身,然后再通过呼吸、二便将代谢废物排除体外。肾主纳气,是指肾能够使人体维持正常的呼吸深度。肾主二便,人的大小便需要在肾的作用下,才能够正常的排泄,否则就会出现异常的改变,比如大小便失禁、大便稀薄等情况。肾主藏精,是指五脏六腑化生的精气,最后都是储存在肾脏,反过来肾脏所藏的精气,又能够推动各脏腑的功能。", "question": "肾主什么", "sent_token": ["根", "据", "中", "医", "基", "础", "理", "论", ",", "肾", "主", "水", "、", "主", "纳", "气", "、", "主", "二", "便", "、", "主", "藏", "精", "。", "人", "体", "的", "生", "长", "的", "生", "命", "过", "程", "与", "肾", "中", "精", "气", "的", "盛", "衰", "有", "着", "密", "切", "的", "关", "系", ",", "肾", "主", "水", ",", "是", "指", "全", "身", "的", "水", "液", "代", "谢", "都", "是", "在", "肾", "阳", "的", "气", "化", "温", "煦", "作", "用", "下", ",", "从", "而", "分", "布", "到", "全", "身", ",", "然", "后", "再", "通", "过", "呼", "吸", "、", "二", "便", "将", "代", "谢", "废", "物", "排", "除", "体", "外", "。", "肾", "主", "纳", "气", ",", "是", "指", "肾", "能", "够", "使", "人", "体", "维", "持", "正", "常", "的", "呼", "吸", "深", "度", "。", "肾", "主", "二", "便", ",", "人", "的", "大", "小", "便", "需", "要", "在", "肾", "的", "作", "用", "下", ",", "才", "能", "够", "正", "常", "的", "排", "泄", ",", "否", "则", "就", "会", "出", "现", "异", "常", "的", "改", "变", ",", "比", "如", "大", "小", "便", "失", "禁", "、", "大", "便", "稀", "薄", "等", "情", "况", "。", "肾", "主", "藏", "精", ",", "是", "指", "五", "脏", "六", "腑", "化", "生", "的", "精", "气", ",", "最", "后", "都", "是", "储", "存", "在", "肾", "脏", ",", "反", "过", "来", "肾", "脏", "所", "藏", "的", "精", "气", ",", "又", "能", "够", "推", "动", "各", "脏", "腑", "的", "功", "能", "。", "中", "医", "肾", "主", "什", "么", "-", "有", "来", "医", "生"], "sample_type": "disturb"} -{"id": 1853, "title": "1963年属什么生肖年_十二生肖_卜易居", "context": "1963年中苏公开论战、美国黑人民权运动兴起、肯尼迪遇刺等事件震动世界。1963年属什么生肖年,葵卯兔年,属兔之人举止文雅,谈吐随和,为人恭良谦逊,与人交往如慕春风,学习能力超群,敏捷果断,安贫乐道。虽性子柔弱,但韧性极强,绝境之中能力惊人,缺点则是难以坚持原则,随波逐流。", "question": "1963年属什么生肖", "sent_token": ["1963", "年", "中", "苏", "公", "开", "论", "战", "、", "美", "国", "黑", "人", "民", "权", "运", "动", "兴", "起", "、", "肯", "尼", "迪", "遇", "刺", "等", "事", "件", "震", "动", "世", "界", "。", "1963", "年", "属", "什", "么", "生", "肖", "年", ",", "葵", "卯", "兔", "年", ",", "属", "兔", "之", "人", "举", "止", "文", "雅", ",", "谈", "吐", "随", "和", ",", "为", "人", "恭", "良", "谦", "逊", ",", "与", "人", "交", "往", "如", "慕", "春", "风", ",", "学", "习", "能", "力", "超", "群", ",", "敏", "捷", "果", "断", ",", "安", "贫", "乐", "道", "。", "虽", "性", "子", "柔", "弱", ",", "但", "韧", "性", "极", "强", ",", "绝", "境", "之", "中", "能", "力", "惊", "人", ",", "缺", "点", "则", "是", "难", "以", "坚", "持", "原", "则", ",", "随", "波", "逐", "流", "。", "1963", "年", "属", "什", "么", "生", "肖", "年", "_", "十", "二", "生", "肖", "_", "卜", "易", "居"], "sample_type": "disturb"} -{"id": 1854, "title": "食管和食道一样吗-有来医生", "context": "食管和食道是没有区别的,食管是医学上的称谓,而食道是民间的一种说法。两者都指从咽喉部到胃贲门之间的管道。食管是距门齿15cm处为食管的入口处,经过胸腔之后通过贲门口也就是膈肌孔与胃相连。食管可以分为颈段和胸段,而胸段又分为胸上段、胸中段和胸下段。食管本身有3个生理性的狭窄,这也是某些食管疾病发生的基础。常见的食管疾病包括食管炎、食管息肉、食管癌、食管狭窄、胃食管反流症、巴雷特食管等。可以通过消化道造影以及胃镜来进一步明确。", "question": "食管跟食道有什么不同", "sent_token": ["食", "管", "和", "食", "道", "是", "没", "有", "区", "别", "的", ",", "食", "管", "是", "医", "学", "上", "的", "称", "谓", ",", "而", "食", "道", "是", "民", "间", "的", "一", "种", "说", "法", "。", "两", "者", "都", "指", "从", "咽", "喉", "部", "到", "胃", "贲", "门", "之", "间", "的", "管", "道", "。", "食", "管", "是", "距", "门", "齿", "15cm", "处", "为", "食", "管", "的", "入", "口", "处", ",", "经", "过", "胸", "腔", "之", "后", "通", "过", "贲", "门", "口", "也", "就", "是", "膈", "肌", "孔", "与", "胃", "相", "连", "。", "食", "管", "可", "以", "分", "为", "颈", "段", "和", "胸", "段", ",", "而", "胸", "段", "又", "分", "为", "胸", "上", "段", "、", "胸", "中", "段", "和", "胸", "下", "段", "。", "食", "管", "本", "身", "有", "3", "个", "生", "理", "性", "的", "狭", "窄", ",", "这", "也", "是", "某", "些", "食", "管", "疾", "病", "发", "生", "的", "基", "础", "。", "常", "见", "的", "食", "管", "疾", "病", "包", "括", "食", "管", "炎", "、", "食", "管", "息", "肉", "、", "食", "管", "癌", "、", "食", "管", "狭", "窄", "、", "胃", "食", "管", "反", "流", "症", "、", "巴", "雷", "特", "食", "管", "等", "。", "可", "以", "通", "过", "消", "化", "道", "造", "影", "以", "及", "胃", "镜", "来", "进", "一", "步", "明", "确", "。", "食", "管", "和", "食", "道", "一", "样", "吗", "-", "有", "来", "医", "生"], "sample_type": "disturb"} -{"id": 1863, "title": "农历六月二十四是什么星座-星座乐", "context": "农历六月二十四是狮子座。狮子座,火象星座,位于黄道十二宫之第五宫,出生日期为阳历7月23日-8月22日。狮子座是英雄主义者,他们乐观,乐于助人,喜欢帮助弱势群体。他们天生自带光环,特立独行,做事豪爽大气,讲话淡定从容,从不扭扭捏捏畏畏缩缩。而且心思细腻,做事完整准确,善于将自己的优点发挥到极致。", "question": "星座查询:中国阴历六月二十四", "sent_token": ["农", "历", "六", "月", "二", "十", "四", "是", "狮", "子", "座", "。", "狮", "子", "座", ",", "火", "象", "星", "座", ",", "位", "于", "黄", "道", "十", "二", "宫", "之", "第", "五", "宫", ",", "出", "生", "日", "期", "为", "阳", "历", "7", "月", "23", "日", "-", "8", "月", "22", "日", "。", "狮", "子", "座", "是", "英", "雄", "主", "义", "者", ",", "他", "们", "乐", "观", ",", "乐", "于", "助", "人", ",", "喜", "欢", "帮", "助", "弱", "势", "群", "体", "。", "他", "们", "天", "生", "自", "带", "光", "环", ",", "特", "立", "独", "行", ",", "做", "事", "豪", "爽", "大", "气", ",", "讲", "话", "淡", "定", "从", "容", ",", "从", "不", "扭", "扭", "捏", "捏", "畏", "畏", "缩", "缩", "。", "而", "且", "心", "思", "细", "腻", ",", "做", "事", "完", "整", "准", "确", ",", "善", "于", "将", "自", "己", "的", "优", "点", "发", "挥", "到", "极", "致", "。", "农", "历", "六", "月", "二", "十", "四", "是", "什", "么", "星", "座", "-", "星", "座", "乐"], "sample_type": "disturb"} -{"id": 1867, "title": "", "context": "非法持有海洛因10克以上就构成非法持有毒品罪非法持有毒品罪,是指明知是鸦片、海洛因、甲基苯丙胺或者其他毒品,而非法持有且数量较大的行为。非法持有毒品达到一定数量才构成犯罪。", "question": "携带多少克吗啡类毒品,就已经算犯罪了", "sent_token": ["非", "法", "持", "有", "海", "洛", "因", "10", "克", "以", "上", "就", "构", "成", "非", "法", "持", "有", "毒", "品", "罪", "非", "法", "持", "有", "毒", "品", "罪", ",", "是", "指", "明", "知", "是", "鸦", "片", "、", "海", "洛", "因", "、", "甲", "基", "苯", "丙", "胺", "或", "者", "其", "他", "毒", "品", ",", "而", "非", "法", "持", "有", "且", "数", "量", "较", "大", "的", "行", "为", "。", "非", "法", "持", "有", "毒", "品", "达", "到", "一", "定", "数", "量", "才", "构", "成", "犯", "罪", "。"], "sample_type": "disturb"} -{"id": 1877, "title": "地方志书每几年左右编修一次_高三网", "context": "地方志书每20年左右编修一次。每一轮地方志书编修工作完成后,负责地方志工作的机构在编纂地方综合年鉴、搜集资料以及向社会提供咨询服务的同时,启动新一轮地方志书的续修工作。", "question": "那种用来记述地方情况的史志,一般都是多少年修一次", "sent_token": ["地", "方", "志", "书", "每", "20", "年", "左", "右", "编", "修", "一", "次", "。", "每", "一", "轮", "地", "方", "志", "书", "编", "修", "工", "作", "完", "成", "后", ",", "负", "责", "地", "方", "志", "工", "作", "的", "机", "构", "在", "编", "纂", "地", "方", "综", "合", "年", "鉴", "、", "搜", "集", "资", "料", "以", "及", "向", "社", "会", "提", "供", "咨", "询", "服", "务", "的", "同", "时", ",", "启", "动", "新", "一", "轮", "地", "方", "志", "书", "的", "续", "修", "工", "作", "。", "地", "方", "志", "书", "每", "几", "年", "左", "右", "编", "修", "一", "次", "_", "高", "三", "网"], "sample_type": "disturb"} -{"id": 1879, "title": "", "context": "《正气歌》是南宋诗人文天祥在狱中写的一首五言古诗。表达了作者忠君爱国、为国捐躯,忧国之痛和愿意以死明志、为国捐躯的豪情壮志的思想感情。诗的开头即点出浩然正气存乎天地之间,至时穷之际,必然会显示出来。随后连用十二个典故,都是历史上有名的人物,他们的所作所为凛然显示出浩然正气的力量。接下来八句说明浩然正气贯日月,立天地,为三纲之命,道义之根。最后联系到自己的命运,自己虽然兵败被俘,处在极其恶劣的牢狱之中,但是由于自己一身正气,各种邪气和疾病都不能侵犯自己,因此自己能够坦然面对自己的命运。全诗感情深沉、气壮山河、直抒胸臆、毫无雕饰,充分体现了作者崇高的民族气节和强烈的爱国主义精神。", "question": "正气歌》的作者是", "sent_token": ["《", "正", "气", "歌", "》", "是", "南", "宋", "诗", "人", "文", "天", "祥", "在", "狱", "中", "写", "的", "一", "首", "五", "言", "古", "诗", "。", "表", "达", "了", "作", "者", "忠", "君", "爱", "国", "、", "为", "国", "捐", "躯", ",", "忧", "国", "之", "痛", "和", "愿", "意", "以", "死", "明", "志", "、", "为", "国", "捐", "躯", "的", "豪", "情", "壮", "志", "的", "思", "想", "感", "情", "。", "诗", "的", "开", "头", "即", "点", "出", "浩", "然", "正", "气", "存", "乎", "天", "地", "之", "间", ",", "至", "时", "穷", "之", "际", ",", "必", "然", "会", "显", "示", "出", "来", "。", "随", "后", "连", "用", "十", "二", "个", "典", "故", ",", "都", "是", "历", "史", "上", "有", "名", "的", "人", "物", ",", "他", "们", "的", "所", "作", "所", "为", "凛", "然", "显", "示", "出", "浩", "然", "正", "气", "的", "力", "量", "。", "接", "下", "来", "八", "句", "说", "明", "浩", "然", "正", "气", "贯", "日", "月", ",", "立", "天", "地", ",", "为", "三", "纲", "之", "命", ",", "道", "义", "之", "根", "。", "最", "后", "联", "系", "到", "自", "己", "的", "命", "运", ",", "自", "己", "虽", "然", "兵", "败", "被", "俘", ",", "处", "在", "极", "其", "恶", "劣", "的", "牢", "狱", "之", "中", ",", "但", "是", "由", "于", "自", "己", "一", "身", "正", "气", ",", "各", "种", "邪", "气", "和", "疾", "病", "都", "不", "能", "侵", "犯", "自", "己", ",", "因", "此", "自", "己", "能", "够", "坦", "然", "面", "对", "自", "己", "的", "命", "运", "。", "全", "诗", "感", "情", "深", "沉", "、", "气", "壮", "山", "河", "、", "直", "抒", "胸", "臆", "、", "毫", "无", "雕", "饰", ",", "充", "分", "体", "现", "了", "作", "者", "崇", "高", "的", "民", "族", "气", "节", "和", "强", "烈", "的", "爱", "国", "主", "义", "精", "神", "。"], "sample_type": "disturb"} -{"id": 1883, "title": "狗狗皮肤上长小脓包怎么回事", "context": "狗狗身上长脓包,是因为真菌感染或是寄生虫感染所致。如不及时处理脓包,会导致扩散全身,甚至溃烂。建议方法:戴上手套,把狗狗身上长脓包的地方挤一挤;然后用碘伏直接喷在患处;如有脓血可用医用纱布给它包在患处,等药效吸收后,取掉纱布;碘伏具有抗菌、消炎的作用,一天可以喷两三次;处理完狗狗伤口后用肥皂洗手。狗狗洗澡要用狗狗专门的沐浴露;洗后立即做吹干处理;定时用狗狗专用梳子,清理身上多余的杂毛;尽量带狗狗去干净的地方玩,回家后把狗狗的脚用抹布抹一次;多注意狗舍卫生,定时做消毒处理。宠物皮肤疾病也是会有一定传染性的,所以一定要进行定期消毒,选用专门的宠物消毒液,每周消毒1-2次,能有效预防传染", "question": "狗狗身上长小脓包是怎么回事", "sent_token": ["狗", "狗", "身", "上", "长", "脓", "包", ",", "是", "因", "为", "真", "菌", "感", "染", "或", "是", "寄", "生", "虫", "感", "染", "所", "致", "。", "如", "不", "及", "时", "处", "理", "脓", "包", ",", "会", "导", "致", "扩", "散", "全", "身", ",", "甚", "至", "溃", "烂", "。", "建", "议", "方", "法", ":", "戴", "上", "手", "套", ",", "把", "狗", "狗", "身", "上", "长", "脓", "包", "的", "地", "方", "挤", "一", "挤", ";", "然", "后", "用", "碘", "伏", "直", "接", "喷", "在", "患", "处", ";", "如", "有", "脓", "血", "可", "用", "医", "用", "纱", "布", "给", "它", "包", "在", "患", "处", ",", "等", "药", "效", "吸", "收", "后", ",", "取", "掉", "纱", "布", ";", "碘", "伏", "具", "有", "抗", "菌", "、", "消", "炎", "的", "作", "用", ",", "一", "天", "可", "以", "喷", "两", "三", "次", ";", "处", "理", "完", "狗", "狗", "伤", "口", "后", "用", "肥", "皂", "洗", "手", "。", "狗", "狗", "洗", "澡", "要", "用", "狗", "狗", "专", "门", "的", "沐", "浴", "露", ";", "洗", "后", "立", "即", "做", "吹", "干", "处", "理", ";", "定", "时", "用", "狗", "狗", "专", "用", "梳", "子", ",", "清", "理", "身", "上", "多", "余", "的", "杂", "毛", ";", "尽", "量", "带", "狗", "狗", "去", "干", "净", "的", "地", "方", "玩", ",", "回", "家", "后", "把", "狗", "狗", "的", "脚", "用", "抹", "布", "抹", "一", "次", ";", "多", "注", "意", "狗", "舍", "卫", "生", ",", "定", "时", "做", "消", "毒", "处", "理", "。", "宠", "物", "皮", "肤", "疾", "病", "也", "是", "会", "有", "一", "定", "传", "染", "性", "的", ",", "所", "以", "一", "定", "要", "进", "行", "定", "期", "消", "毒", ",", "选", "用", "专", "门", "的", "宠", "物", "消", "毒", "液", ",", "每", "周", "消", "毒", "1", "-", "2", "次", ",", "能", "有", "效", "预", "防", "传", "染", "狗", "狗", "皮", "肤", "上", "长", "小", "脓", "包", "怎", "么", "回", "事"], "sample_type": "disturb"} -{"id": 1885, "title": "", "context": "新梓学校成立于2007年9月,是一所公办九年一贯制学校,座落在龙岗街道新生社区,紧邻水岸新都花园,交通十分便利。校园占地27500平方米,建筑面积16285平方米。学校设计办学规模36班,学生人数1800人", "question": "新梓学校地址", "sent_token": ["新", "梓", "学", "校", "成", "立", "于", "2007", "年", "9", "月", ",", "是", "一", "所", "公", "办", "九", "年", "一", "贯", "制", "学", "校", ",", "座", "落", "在", "龙", "岗", "街", "道", "新", "生", "社", "区", ",", "紧", "邻", "水", "岸", "新", "都", "花", "园", ",", "交", "通", "十", "分", "便", "利", "。", "校", "园", "占", "地", "27500", "平", "方", "米", ",", "建", "筑", "面", "积", "16285", "平", "方", "米", "。", "学", "校", "设", "计", "办", "学", "规", "模", "36", "班", ",", "学", "生", "人", "数", "1800", "人"], "sample_type": "disturb"} -{"id": 1886, "title": "敷面膜脸痒是缺水吗?教你正确的认识_皮肤", "context": "当我们在洗完澡的时候,或者是敷面膜发现皮肤有一种痒痒的感觉,如果你确定面膜的质量是没有问题的,并且也确定你对这款面膜的物质没有过敏的情况下,皮肤出现痒的感觉,那可能的原因就是由于皮肤缺水。因为你的皮肤太缺水了,在给皮肤补水的时候就会出现一种痒的情况严重的时候,甚至会有刺痛的感觉。会让人觉得很不舒服,水分充足后会缓解。", "question": "脸痒是缺水么", "sent_token": ["当", "我", "们", "在", "洗", "完", "澡", "的", "时", "候", ",", "或", "者", "是", "敷", "面", "膜", "发", "现", "皮", "肤", "有", "一", "种", "痒", "痒", "的", "感", "觉", ",", "如", "果", "你", "确", "定", "面", "膜", "的", "质", "量", "是", "没", "有", "问", "题", "的", ",", "并", "且", "也", "确", "定", "你", "对", "这", "款", "面", "膜", "的", "物", "质", "没", "有", "过", "敏", "的", "情", "况", "下", ",", "皮", "肤", "出", "现", "痒", "的", "感", "觉", ",", "那", "可", "能", "的", "原", "因", "就", "是", "由", "于", "皮", "肤", "缺", "水", "。", "因", "为", "你", "的", "皮", "肤", "太", "缺", "水", "了", ",", "在", "给", "皮", "肤", "补", "水", "的", "时", "候", "就", "会", "出", "现", "一", "种", "痒", "的", "情", "况", "严", "重", "的", "时", "候", ",", "甚", "至", "会", "有", "刺", "痛", "的", "感", "觉", "。", "会", "让", "人", "觉", "得", "很", "不", "舒", "服", ",", "水", "分", "充", "足", "后", "会", "缓", "解", "。", "敷", "面", "膜", "脸", "痒", "是", "缺", "水", "吗", "?", "教", "你", "正", "确", "的", "认", "识", "_", "皮", "肤"], "sample_type": "disturb"} -{"id": 1888, "title": "无痛人流和药流哪个伤害比较小-有来医生", "context": "无痛人工流产手术和药物流产手术,相对比来说,还是药物流产伤害比较大。因为药物流产,阴道流血时间会比人工流产的阴道流血时间要长,一般人工流产,阴道流血时间不超过7天,而药物流产阴道流血的时间往往在15-20天左右才会干净。一直在有流血的状况下,宫口就是开放的,阴道又跟外界相通,跟宫颈又相通,这样造成细菌侵入感染的机会就会增加,所以容易导致生殖道的感染。另外,药物流产造成不全流产的可能性会大一些,需要做清宫手术。这样就可以想象出药物流产会比无痛人流伤害更大一些。人流手术都是属于微创无痛性质的,具有无痛、创伤极小,出血少、手术时间短,无需住院,手术后即可回家,不影响工作和生活等优势。", "question": "无痛人流和药流哪个伤害比较小", "sent_token": ["无", "痛", "人", "工", "流", "产", "手", "术", "和", "药", "物", "流", "产", "手", "术", ",", "相", "对", "比", "来", "说", ",", "还", "是", "药", "物", "流", "产", "伤", "害", "比", "较", "大", "。", "因", "为", "药", "物", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "会", "比", "人", "工", "流", "产", "的", "阴", "道", "流", "血", "时", "间", "要", "长", ",", "一", "般", "人", "工", "流", "产", ",", "阴", "道", "流", "血", "时", "间", "不", "超", "过", "7", "天", ",", "而", "药", "物", "流", "产", "阴", "道", "流", "血", "的", "时", "间", "往", "往", "在", "15", "-", "20", "天", "左", "右", "才", "会", "干", "净", "。", "一", "直", "在", "有", "流", "血", "的", "状", "况", "下", ",", "宫", "口", "就", "是", "开", "放", "的", ",", "阴", "道", "又", "跟", "外", "界", "相", "通", ",", "跟", "宫", "颈", "又", "相", "通", ",", "这", "样", "造", "成", "细", "菌", "侵", "入", "感", "染", "的", "机", "会", "就", "会", "增", "加", ",", "所", "以", "容", "易", "导", "致", "生", "殖", "道", "的", "感", "染", "。", "另", "外", ",", "药", "物", "流", "产", "造", "成", "不", "全", "流", "产", "的", "可", "能", "性", "会", "大", "一", "些", ",", "需", "要", "做", "清", "宫", "手", "术", "。", "这", "样", "就", "可", "以", "想", "象", "出", "药", "物", "流", "产", "会", "比", "无", "痛", "人", "流", "伤", "害", "更", "大", "一", "些", "。", "人", "流", "手", "术", "都", "是", "属", "于", "微", "创", "无", "痛", "性", "质", "的", ",", "具", "有", "无", "痛", "、", "创", "伤", "极", "小", ",", "出", "血", "少", "、", "手", "术", "时", "间", "短", ",", "无", "需", "住", "院", ",", "手", "术", "后", "即", "可", "回", "家", ",", "不", "影", "响", "工", "作", "和", "生", "活", "等", "优", "势", "。", "无", "痛", "人", "流", "和", "药", "流", "哪", "个", "伤", "害", "比", "较", "小", "-", "有", "来", "医", "生"], "sample_type": "disturb"} -{"id": 1890, "title": "长期吃葡萄籽的副作用?_39健康问答_39健康网", "context": "长期吃葡萄籽不会有副作用,不用担心,葡萄籽中含有丰富的花青素,有美容养颜的功效。葡萄籽含有丰富的多种氨基酸、维生素及矿物质等,原花青素含量最高,有促进血液循环、保护视力、抗氧化去除自由基、降低血、保护心血管的作用,可以用于保健、美容。", "question": "葡萄籽能长期吃么?有什么副作用呀?", "sent_token": ["长", "期", "吃", "葡", "萄", "籽", "不", "会", "有", "副", "作", "用", ",", "不", "用", "担", "心", ",", "葡", "萄", "籽", "中", "含", "有", "丰", "富", "的", "花", "青", "素", ",", "有", "美", "容", "养", "颜", "的", "功", "效", "。", "葡", "萄", "籽", "含", "有", "丰", "富", "的", "多", "种", "氨", "基", "酸", "、", "维", "生", "素", "及", "矿", "物", "质", "等", ",", "原", "花", "青", "素", "含", "量", "最", "高", ",", "有", "促", "进", "血", "液", "循", "环", "、", "保", "护", "视", "力", "、", "抗", "氧", "化", "去", "除", "自", "由", "基", "、", "降", "低", "血", "、", "保", "护", "心", "血", "管", "的", "作", "用", ",", "可", "以", "用", "于", "保", "健", "、", "美", "容", "。", "长", "期", "吃", "葡", "萄", "籽", "的", "副", "作", "用", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"} -{"id": 1894, "title": "红花哪里产的最好?_39健康问答_39健康网", "context": "红花在中国很多地方都是有种植的,比如河南,江苏,四川,河北等等。但是在众多产地中河南的商丘生产的红花应该是最好的了。红花有一种特殊的气味,特别香,味道稍微有点苦。红花是一种很好的植物,对人体有很好的保健作用。高血压患者可以服用一些,红花是有一定的降压作用的,另外还可以促进人体血液的循环,降低血脂。", "question": "最好的刺红花生产自哪里", "sent_token": ["红", "花", "在", "中", "国", "很", "多", "地", "方", "都", "是", "有", "种", "植", "的", ",", "比", "如", "河", "南", ",", "江", "苏", ",", "四", "川", ",", "河", "北", "等", "等", "。", "但", "是", "在", "众", "多", "产", "地", "中", "河", "南", "的", "商", "丘", "生", "产", "的", "红", "花", "应", "该", "是", "最", "好", "的", "了", "。", "红", "花", "有", "一", "种", "特", "殊", "的", "气", "味", ",", "特", "别", "香", ",", "味", "道", "稍", "微", "有", "点", "苦", "。", "红", "花", "是", "一", "种", "很", "好", "的", "植", "物", ",", "对", "人", "体", "有", "很", "好", "的", "保", "健", "作", "用", "。", "高", "血", "压", "患", "者", "可", "以", "服", "用", "一", "些", ",", "红", "花", "是", "有", "一", "定", "的", "降", "压", "作", "用", "的", ",", "另", "外", "还", "可", "以", "促", "进", "人", "体", "血", "液", "的", "循", "环", ",", "降", "低", "血", "脂", "。", "红", "花", "哪", "里", "产", "的", "最", "好", "?", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"} -{"id": 1897, "title": "", "context": "梳妆台指用来化妆的家具装饰。梳妆台一词,在现代家居中,已经被业主、客户、家居设计师广泛用到,现在泛指家具梳妆台。梳妆台尺寸标准的是总高度为1500mm左右,宽为700mm到1200mm,这样的梳妆台尺寸是大小正合适的,在家庭装修之前的前期准备时,就应该确定好梳妆台尺寸大小,同时梳妆台尺寸也要和房间的格调和风格统一起来。每个人都有自己不同的审美眼光,所以在外观选择上只要是个人喜欢就行,但梳妆台的外表最好选择用油漆刷过的,这样容易清理,不至于化妆品渗透到梳妆台内,影响梳妆台的外观", "question": "梳妆台整体高度一般是多少", "sent_token": ["梳", "妆", "台", "指", "用", "来", "化", "妆", "的", "家", "具", "装", "饰", "。", "梳", "妆", "台", "一", "词", ",", "在", "现", "代", "家", "居", "中", ",", "已", "经", "被", "业", "主", "、", "客", "户", "、", "家", "居", "设", "计", "师", "广", "泛", "用", "到", ",", "现", "在", "泛", "指", "家", "具", "梳", "妆", "台", "。", "梳", "妆", "台", "尺", "寸", "标", "准", "的", "是", "总", "高", "度", "为", "1500mm", "左", "右", ",", "宽", "为", "700mm", "到", "1200mm", ",", "这", "样", "的", "梳", "妆", "台", "尺", "寸", "是", "大", "小", "正", "合", "适", "的", ",", "在", "家", "庭", "装", "修", "之", "前", "的", "前", "期", "准", "备", "时", ",", "就", "应", "该", "确", "定", "好", "梳", "妆", "台", "尺", "寸", "大", "小", ",", "同", "时", "梳", "妆", "台", "尺", "寸", "也", "要", "和", "房", "间", "的", "格", "调", "和", "风", "格", "统", "一", "起", "来", "。", "每", "个", "人", "都", "有", "自", "己", "不", "同", "的", "审", "美", "眼", "光", ",", "所", "以", "在", "外", "观", "选", "择", "上", "只", "要", "是", "个", "人", "喜", "欢", "就", "行", ",", "但", "梳", "妆", "台", "的", "外", "表", "最", "好", "选", "择", "用", "油", "漆", "刷", "过", "的", ",", "这", "样", "容", "易", "清", "理", ",", "不", "至", "于", "化", "妆", "品", "渗", "透", "到", "梳", "妆", "台", "内", ",", "影", "响", "梳", "妆", "台", "的", "外", "观"], "sample_type": "disturb"} -{"id": 1899, "title": "感冒能不能吃燕窝_妈妈网小百科", "context": "在感冒的时候尽量不要吃燕窝,燕窝性平味甘,归肺胃肾经,功能养阴润燥,益气补中,填精补髓。虽然燕窝比较滋补,但是在感冒期间吃燕窝的话,并不利于感冒的恢复。在感冒期间应该吃得清淡一些,补充身体需要的水分,如果没有食欲的话可以多喝一些粥。在感冒期间可能吃药物的话,也不能够起到很好的效果,但是也要坚持吃药。", "question": "感冒可以吃燕窝吗?有效果吗?", "sent_token": ["在", "感", "冒", "的", "时", "候", "尽", "量", "不", "要", "吃", "燕", "窝", ",", "燕", "窝", "性", "平", "味", "甘", ",", "归", "肺", "胃", "肾", "经", ",", "功", "能", "养", "阴", "润", "燥", ",", "益", "气", "补", "中", ",", "填", "精", "补", "髓", "。", "虽", "然", "燕", "窝", "比", "较", "滋", "补", ",", "但", "是", "在", "感", "冒", "期", "间", "吃", "燕", "窝", "的", "话", ",", "并", "不", "利", "于", "感", "冒", "的", "恢", "复", "。", "在", "感", "冒", "期", "间", "应", "该", "吃", "得", "清", "淡", "一", "些", ",", "补", "充", "身", "体", "需", "要", "的", "水", "分", ",", "如", "果", "没", "有", "食", "欲", "的", "话", "可", "以", "多", "喝", "一", "些", "粥", "。", "在", "感", "冒", "期", "间", "可", "能", "吃", "药", "物", "的", "话", ",", "也", "不", "能", "够", "起", "到", "很", "好", "的", "效", "果", ",", "但", "是", "也", "要", "坚", "持", "吃", "药", "。", "感", "冒", "能", "不", "能", "吃", "燕", "窝", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "disturb"} -{"id": 1900, "title": "房颤会引起脑梗吗-有来医生", "context": "房颤会引起脑血管疾病,在医学上不叫脑梗叫脑栓塞,脑梗是脑血管本身病变引起的脑供血不足的情况,而脑栓塞是由于房颤心脏上形成了附壁血栓,当血栓的栓子脱落之后,就有可能堵塞在脑血管形成了脑拴塞,也是一种脑缺血的表现。治疗方法可以应用改善循环和营养神经的药物治疗,必须应用阿司匹林和氯吡格雷口服抗血小板聚集治疗,对于心房纤颤的患者,要控制心室率,应用阿司匹林和氯吡格雷等口服抗血小板聚集治疗,预防心脏附壁血栓的形成。", "question": "房颤不会引起脑梗吗", "sent_token": ["房", "颤", "会", "引", "起", "脑", "血", "管", "疾", "病", ",", "在", "医", "学", "上", "不", "叫", "脑", "梗", "叫", "脑", "栓", "塞", ",", "脑", "梗", "是", "脑", "血", "管", "本", "身", "病", "变", "引", "起", "的", "脑", "供", "血", "不", "足", "的", "情", "况", ",", "而", "脑", "栓", "塞", "是", "由", "于", "房", "颤", "心", "脏", "上", "形", "成", "了", "附", "壁", "血", "栓", ",", "当", "血", "栓", "的", "栓", "子", "脱", "落", "之", "后", ",", "就", "有", "可", "能", "堵", "塞", "在", "脑", "血", "管", "形", "成", "了", "脑", "拴", "塞", ",", "也", "是", "一", "种", "脑", "缺", "血", "的", "表", "现", "。", "治", "疗", "方", "法", "可", "以", "应", "用", "改", "善", "循", "环", "和", "营", "养", "神", "经", "的", "药", "物", "治", "疗", ",", "必", "须", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "对", "于", "心", "房", "纤", "颤", "的", "患", "者", ",", "要", "控", "制", "心", "室", "率", ",", "应", "用", "阿", "司", "匹", "林", "和", "氯", "吡", "格", "雷", "等", "口", "服", "抗", "血", "小", "板", "聚", "集", "治", "疗", ",", "预", "防", "心", "脏", "附", "壁", "血", "栓", "的", "形", "成", "。", "房", "颤", "会", "引", "起", "脑", "梗", "吗", "-", "有", "来", "医", "生"], "sample_type": "disturb"} -{"id": 1906, "title": "二十天的婴儿能看多远_妈妈网小百科", "context": "20天的宝宝能够看到的距离大概是15厘米-20厘米左右,一般能够看到18厘米左右的事物。宝宝刚出生的时候视力极其差,有的甚至没有睁开眼,可以说基本什么都看不清楚,视力比较好的新生儿,也只能感受到光和影或大致的轮廓。随着宝宝的眼球、视神经和大脑的不断发育,他们看到的景物会越来越清楚,视野也会不断扩大,在出生6-8个月后,宝宝眼中的世界,就基本和成人一样了。", "question": "二十天的宝宝能看多远?", "sent_token": ["20", "天", "的", "宝", "宝", "能", "够", "看", "到", "的", "距", "离", "大", "概", "是", "15", "厘", "米", "-", "20", "厘", "米", "左", "右", ",", "一", "般", "能", "够", "看", "到", "18", "厘", "米", "左", "右", "的", "事", "物", "。", "宝", "宝", "刚", "出", "生", "的", "时", "候", "视", "力", "极", "其", "差", ",", "有", "的", "甚", "至", "没", "有", "睁", "开", "眼", ",", "可", "以", "说", "基", "本", "什", "么", "都", "看", "不", "清", "楚", ",", "视", "力", "比", "较", "好", "的", "新", "生", "儿", ",", "也", "只", "能", "感", "受", "到", "光", "和", "影", "或", "大", "致", "的", "轮", "廓", "。", "随", "着", "宝", "宝", "的", "眼", "球", "、", "视", "神", "经", "和", "大", "脑", "的", "不", "断", "发", "育", ",", "他", "们", "看", "到", "的", "景", "物", "会", "越", "来", "越", "清", "楚", ",", "视", "野", "也", "会", "不", "断", "扩", "大", ",", "在", "出", "生", "6", "-", "8", "个", "月", "后", ",", "宝", "宝", "眼", "中", "的", "世", "界", ",", "就", "基", "本", "和", "成", "人", "一", "样", "了", "。", "二", "十", "天", "的", "婴", "儿", "能", "看", "多", "远", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "disturb"} -{"id": 1918, "title": "4价宫颈疫苗多少钱-有来医生", "context": "4价宫颈癌疫苗有国产疫苗和进口疫苗,国产疫苗价格比较便宜,预防宫颈癌的疫苗只有4价疫苗,具体价格不同地区以及不同生产厂家生产的疫苗,所定价格也不一样。在北京4价宫颈癌疫苗,价格大概是800元左右,总共需要接种三针,需要在半年内接种完,分别在第一个月,第2个月和第6个月各接种一针次,接种年龄是20-45周岁,建议咨询当地疾病预防控制机构,所进疫苗的具体价格比较准确。比如江苏省从2019年开始,所有有价疫苗都是零差价出售,每接种一针次,收取20元材料费和注射费,目前接种宫颈癌疫苗,应该先预约才可以接种。", "question": "中国自己生产的HPV疫苗都有哪些", "sent_token": ["4", "价", "宫", "颈", "癌", "疫", "苗", "有", "国", "产", "疫", "苗", "和", "进", "口", "疫", "苗", ",", "国", "产", "疫", "苗", "价", "格", "比", "较", "便", "宜", ",", "预", "防", "宫", "颈", "癌", "的", "疫", "苗", "只", "有", "4", "价", "疫", "苗", ",", "具", "体", "价", "格", "不", "同", "地", "区", "以", "及", "不", "同", "生", "产", "厂", "家", "生", "产", "的", "疫", "苗", ",", "所", "定", "价", "格", "也", "不", "一", "样", "。", "在", "北", "京", "4", "价", "宫", "颈", "癌", "疫", "苗", ",", "价", "格", "大", "概", "是", "800", "元", "左", "右", ",", "总", "共", "需", "要", "接", "种", "三", "针", ",", "需", "要", "在", "半", "年", "内", "接", "种", "完", ",", "分", "别", "在", "第", "一", "个", "月", ",", "第", "2", "个", "月", "和", "第", "6", "个", "月", "各", "接", "种", "一", "针", "次", ",", "接", "种", "年", "龄", "是", "20", "-", "45", "周", "岁", ",", "建", "议", "咨", "询", "当", "地", "疾", "病", "预", "防", "控", "制", "机", "构", ",", "所", "进", "疫", "苗", "的", "具", "体", "价", "格", "比", "较", "准", "确", "。", "比", "如", "江", "苏", "省", "从", "2019", "年", "开", "始", ",", "所", "有", "有", "价", "疫", "苗", "都", "是", "零", "差", "价", "出", "售", ",", "每", "接", "种", "一", "针", "次", ",", "收", "取", "20", "元", "材", "料", "费", "和", "注", "射", "费", ",", "目", "前", "接", "种", "宫", "颈", "癌", "疫", "苗", ",", "应", "该", "先", "预", "约", "才", "可", "以", "接", "种", "。", "4", "价", "宫", "颈", "疫", "苗", "多", "少", "钱", "-", "有", "来", "医", "生"], "sample_type": "disturb"} -{"id": 1945, "title": "hiit是什么", "context": "hiit是高强度间歇训练,主要是通过进行多组高强度的间隙,和低强度的动作组合训练,这种训练方式能够在短时间内高速燃烧脂肪,简单说就是中间有休息的高强度训练,非常适合锻炼时间较少或无法长时间坚持锻炼的人。", "question": "什么是HIIT", "sent_token": ["hiit", "是", "高", "强", "度", "间", "歇", "训", "练", ",", "主", "要", "是", "通", "过", "进", "行", "多", "组", "高", "强", "度", "的", "间", "隙", ",", "和", "低", "强", "度", "的", "动", "作", "组", "合", "训", "练", ",", "这", "种", "训", "练", "方", "式", "能", "够", "在", "短", "时", "间", "内", "高", "速", "燃", "烧", "脂", "肪", ",", "简", "单", "说", "就", "是", "中", "间", "有", "休", "息", "的", "高", "强", "度", "训", "练", ",", "非", "常", "适", "合", "锻", "炼", "时", "间", "较", "少", "或", "无", "法", "长", "时", "间", "坚", "持", "锻", "炼", "的", "人", "。", "hiit", "是", "什", "么"], "sample_type": "disturb"} -{"id": 1949, "title": "民生信用卡的客服电话多少?-其他问题知识问答-我爱卡", "context": "民生银行是中国大陆第一家由民间资本设立的全国性商业银行,成立于1996年1月12日。民生银行的信用卡的24小时客服电话为400-669-5568,持卡人在办卡或用卡的过程中,有任何疑问,都可以拨打民生银行信用卡客服电话,通过人工客服,来进行咨询。同时,持卡人也可以通过客服电话,办理信用卡激活、修改密码、更改账单日等业务。", "question": "民生信用卡客服", "sent_token": ["民", "生", "银", "行", "是", "中", "国", "大", "陆", "第", "一", "家", "由", "民", "间", "资", "本", "设", "立", "的", "全", "国", "性", "商", "业", "银", "行", ",", "成", "立", "于", "1996", "年", "1", "月", "12", "日", "。", "民", "生", "银", "行", "的", "信", "用", "卡", "的", "24", "小", "时", "客", "服", "电", "话", "为", "400", "-", "669", "-", "5568", ",", "持", "卡", "人", "在", "办", "卡", "或", "用", "卡", "的", "过", "程", "中", ",", "有", "任", "何", "疑", "问", ",", "都", "可", "以", "拨", "打", "民", "生", "银", "行", "信", "用", "卡", "客", "服", "电", "话", ",", "通", "过", "人", "工", "客", "服", ",", "来", "进", "行", "咨", "询", "。", "同", "时", ",", "持", "卡", "人", "也", "可", "以", "通", "过", "客", "服", "电", "话", ",", "办", "理", "信", "用", "卡", "激", "活", "、", "修", "改", "密", "码", "、", "更", "改", "账", "单", "日", "等", "业", "务", "。", "民", "生", "信", "用", "卡", "的", "客", "服", "电", "话", "多", "少", "?", "-", "其", "他", "问", "题", "知", "识", "问", "答", "-", "我", "爱", "卡"], "sample_type": "disturb"} -{"id": 1956, "title": "", "context": "法令纹位於鼻翼两侧往下延伸至嘴的附近,也称寿带,是典型的皮肤组织老化,造成肌肤表面凹陷的现象。法令若垂长,亦为长寿之象徵。不过女性多半不喜欢脸上出现法令纹,因为这意味脸部皮肤松弛,是老化的迹象。", "question": "哪里是法令纹?", "sent_token": ["法", "令", "纹", "位", "於", "鼻", "翼", "两", "侧", "往", "下", "延", "伸", "至", "嘴", "的", "附", "近", ",", "也", "称", "寿", "带", ",", "是", "典", "型", "的", "皮", "肤", "组", "织", "老", "化", ",", "造", "成", "肌", "肤", "表", "面", "凹", "陷", "的", "现", "象", "。", "法", "令", "若", "垂", "长", ",", "亦", "为", "长", "寿", "之", "象", "徵", "。", "不", "过", "女", "性", "多", "半", "不", "喜", "欢", "脸", "上", "出", "现", "法", "令", "纹", ",", "因", "为", "这", "意", "味", "脸", "部", "皮", "肤", "松", "弛", ",", "是", "老", "化", "的", "迹", "象", "。"], "sample_type": "disturb"} -{"id": 1966, "title": "婴儿轻微肠炎能自愈吗_妈妈网小百科", "context": "婴儿轻微肠炎不能自愈。肠炎是一种炎症,其发病的原因与胃肠道失调有关联。临床表现主要有腹痛、腹泻、稀水便或黏液脓血便。婴儿胃肠道菌群出现了失调的异常,就会引发肠炎的出现。尽管是比较轻微的肠炎,但还是有炎症的存在。婴儿轻微肠炎需要就医进行治疗,需要吃药促使炎症的消除。", "question": "婴儿轻度肠炎能自愈吗", "sent_token": ["婴", "儿", "轻", "微", "肠", "炎", "不", "能", "自", "愈", "。", "肠", "炎", "是", "一", "种", "炎", "症", ",", "其", "发", "病", "的", "原", "因", "与", "胃", "肠", "道", "失", "调", "有", "关", "联", "。", "临", "床", "表", "现", "主", "要", "有", "腹", "痛", "、", "腹", "泻", "、", "稀", "水", "便", "或", "黏", "液", "脓", "血", "便", "。", "婴", "儿", "胃", "肠", "道", "菌", "群", "出", "现", "了", "失", "调", "的", "异", "常", ",", "就", "会", "引", "发", "肠", "炎", "的", "出", "现", "。", "尽", "管", "是", "比", "较", "轻", "微", "的", "肠", "炎", ",", "但", "还", "是", "有", "炎", "症", "的", "存", "在", "。", "婴", "儿", "轻", "微", "肠", "炎", "需", "要", "就", "医", "进", "行", "治", "疗", ",", "需", "要", "吃", "药", "促", "使", "炎", "症", "的", "消", "除", "。", "婴", "儿", "轻", "微", "肠", "炎", "能", "自", "愈", "吗", "_", "妈", "妈", "网", "小", "百", "科"], "sample_type": "disturb"} -{"id": 1977, "title": "", "context": "珍珠鸟作者简介冯骥才,当代作家,1942年生于天津,兄妹六人,排行第三,为长子。原籍浙江慈溪市人。从小喜爱美术、文学和球类活动。曾当过专业篮球运动员,从事过绘画。", "question": "冯骥才什么时候出生", "sent_token": ["珍", "珠", "鸟", "作", "者", "简", "介", "冯", "骥", "才", ",", "当", "代", "作", "家", ",", "1942", "年", "生", "于", "天", "津", ",", "兄", "妹", "六", "人", ",", "排", "行", "第", "三", ",", "为", "长", "子", "。", "原", "籍", "浙", "江", "慈", "溪", "市", "人", "。", "从", "小", "喜", "爱", "美", "术", "、", "文", "学", "和", "球", "类", "活", "动", "。", "曾", "当", "过", "专", "业", "篮", "球", "运", "动", "员", ",", "从", "事", "过", "绘", "画", "。"], "sample_type": "disturb"} -{"id": 1983, "title": "哺乳期可以吃维生素b2吗_有问必答_快速问医生", "context": "你好,口腔溃疡一般都是由于维生素缺乏导致的,与口腔炎症和上火也有关,可以服用维生素b2和维生素c治疗。用西瓜皮煮水喝,可以清热去火。局部用口腔溃疡散或者用维生素c研磨成粉末涂抹,都可以有效缓解疼痛。孕妇正常也要补充维生素的,服用维生素b2没有问题的。平时一定要多吃新鲜蔬菜水果,补充维生素,注意口腔卫生,早晚刷牙,饭后用温水漱口,每天早上起床用淡盐水漱口。", "question": "哺乳期间,能吃维生素b2吗", "sent_token": ["你", "好", ",", "口", "腔", "溃", "疡", "一", "般", "都", "是", "由", "于", "维", "生", "素", "缺", "乏", "导", "致", "的", ",", "与", "口", "腔", "炎", "症", "和", "上", "火", "也", "有", "关", ",", "可", "以", "服", "用", "维", "生", "素", "b2", "和", "维", "生", "素", "c", "治", "疗", "。", "用", "西", "瓜", "皮", "煮", "水", "喝", ",", "可", "以", "清", "热", "去", "火", "。", "局", "部", "用", "口", "腔", "溃", "疡", "散", "或", "者", "用", "维", "生", "素", "c", "研", "磨", "成", "粉", "末", "涂", "抹", ",", "都", "可", "以", "有", "效", "缓", "解", "疼", "痛", "。", "孕", "妇", "正", "常", "也", "要", "补", "充", "维", "生", "素", "的", ",", "服", "用", "维", "生", "素", "b2", "没", "有", "问", "题", "的", "。", "平", "时", "一", "定", "要", "多", "吃", "新", "鲜", "蔬", "菜", "水", "果", ",", "补", "充", "维", "生", "素", ",", "注", "意", "口", "腔", "卫", "生", ",", "早", "晚", "刷", "牙", ",", "饭", "后", "用", "温", "水", "漱", "口", ",", "每", "天", "早", "上", "起", "床", "用", "淡", "盐", "水", "漱", "口", "。", "哺", "乳", "期", "可", "以", "吃", "维", "生", "素", "b2", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "disturb"} -{"id": 1993, "title": "6岁儿童吃几颗肠虫清,吃肠虫清需要忌口吗_孕育常识_亲子宝典库_", "context": "肠虫清一般指阿苯达唑。阿苯达唑是一种咪唑衍生物类广谱驱肠虫药物。是六岁儿童就可以服用的一次吃两片,是吃饱饭后吃,肠虫清的主要是驱虫的药物,一般在晚上睡前服用的是比较好的,服药期间要多喝开水,多吃清淡易消化的食物,忌辛辣刺激性食物和油腻煎炸的食物,注意保暖避免着凉。", "question": "6岁儿童吃几颗肠虫清", "sent_token": ["肠", "虫", "清", "一", "般", "指", "阿", "苯", "达", "唑", "。", "阿", "苯", "达", "唑", "是", "一", "种", "咪", "唑", "衍", "生", "物", "类", "广", "谱", "驱", "肠", "虫", "药", "物", "。", "是", "六", "岁", "儿", "童", "就", "可", "以", "服", "用", "的", "一", "次", "吃", "两", "片", ",", "是", "吃", "饱", "饭", "后", "吃", ",", "肠", "虫", "清", "的", "主", "要", "是", "驱", "虫", "的", "药", "物", ",", "一", "般", "在", "晚", "上", "睡", "前", "服", "用", "的", "是", "比", "较", "好", "的", ",", "服", "药", "期", "间", "要", "多", "喝", "开", "水", ",", "多", "吃", "清", "淡", "易", "消", "化", "的", "食", "物", ",", "忌", "辛", "辣", "刺", "激", "性", "食", "物", "和", "油", "腻", "煎", "炸", "的", "食", "物", ",", "注", "意", "保", "暖", "避", "免", "着", "凉", "。", "6", "岁", "儿", "童", "吃", "几", "颗", "肠", "虫", "清", ",", "吃", "肠", "虫", "清", "需", "要", "忌", "口", "吗", "_", "孕", "育", "常", "识", "_", "亲", "子", "宝", "典", "库", "_"], "sample_type": "disturb"} -{"id": 2003, "title": "隔阂意味着是什么意思", "context": "隔阂是一个汉语词汇,一指彼此情意沟通的障碍或是情意不通,思想有距离,彼此之间有间隔,又指阻隔、隔绝。隔阂意味着很多意思,通常隔阂就意味着可能双方之间沟通有问题,比如有些夫妻或者是男女朋友之间吵架,两个人一起冷战,两个人由于没有沟通,双方之间的误会和矛盾就会越来越多了,也有可能是两个人总是以争吵的方式来解决问题,像这样的话就达不到有效的沟通,两个人两个人越不沟通,双方之间的矛盾和争吵就会越来越多,这个时候就会产生深深的隔阂。也有可能是双峰之间的价值观完全不同,比如对待某些问题的时候,有些人比较理性,但是有些人会比较感性,这个时候价值观不同的话就非常容易产生隔阂。", "question": "隔阂什么意思", "sent_token": ["隔", "阂", "是", "一", "个", "汉", "语", "词", "汇", ",", "一", "指", "彼", "此", "情", "意", "沟", "通", "的", "障", "碍", "或", "是", "情", "意", "不", "通", ",", "思", "想", "有", "距", "离", ",", "彼", "此", "之", "间", "有", "间", "隔", ",", "又", "指", "阻", "隔", "、", "隔", "绝", "。", "隔", "阂", "意", "味", "着", "很", "多", "意", "思", ",", "通", "常", "隔", "阂", "就", "意", "味", "着", "可", "能", "双", "方", "之", "间", "沟", "通", "有", "问", "题", ",", "比", "如", "有", "些", "夫", "妻", "或", "者", "是", "男", "女", "朋", "友", "之", "间", "吵", "架", ",", "两", "个", "人", "一", "起", "冷", "战", ",", "两", "个", "人", "由", "于", "没", "有", "沟", "通", ",", "双", "方", "之", "间", "的", "误", "会", "和", "矛", "盾", "就", "会", "越", "来", "越", "多", "了", ",", "也", "有", "可", "能", "是", "两", "个", "人", "总", "是", "以", "争", "吵", "的", "方", "式", "来", "解", "决", "问", "题", ",", "像", "这", "样", "的", "话", "就", "达", "不", "到", "有", "效", "的", "沟", "通", ",", "两", "个", "人", "两", "个", "人", "越", "不", "沟", "通", ",", "双", "方", "之", "间", "的", "矛", "盾", "和", "争", "吵", "就", "会", "越", "来", "越", "多", ",", "这", "个", "时", "候", "就", "会", "产", "生", "深", "深", "的", "隔", "阂", "。", "也", "有", "可", "能", "是", "双", "峰", "之", "间", "的", "价", "值", "观", "完", "全", "不", "同", ",", "比", "如", "对", "待", "某", "些", "问", "题", "的", "时", "候", ",", "有", "些", "人", "比", "较", "理", "性", ",", "但", "是", "有", "些", "人", "会", "比", "较", "感", "性", ",", "这", "个", "时", "候", "价", "值", "观", "不", "同", "的", "话", "就", "非", "常", "容", "易", "产", "生", "隔", "阂", "。", "隔", "阂", "意", "味", "着", "是", "什", "么", "意", "思"], "sample_type": "disturb"} -{"id": 2004, "title": "小儿癫痫病能彻底治愈的吗_有问必答_快速问医生", "context": "你好,很高兴为你服务,目前小儿癫痫是可以治愈的,不同的癫痫类型以及患者的实际病情不同,其适合的治疗方法也是不尽相同的。现在常见的小儿癫痫治疗都是采用中医为基础的治疗方法,这样对患儿的伤害较小,而西医则有很大的副作用,好吧", "question": "能彻底治愈羊儿风吗", "sent_token": ["你", "好", ",", "很", "高", "兴", "为", "你", "服", "务", ",", "目", "前", "小", "儿", "癫", "痫", "是", "可", "以", "治", "愈", "的", ",", "不", "同", "的", "癫", "痫", "类", "型", "以", "及", "患", "者", "的", "实", "际", "病", "情", "不", "同", ",", "其", "适", "合", "的", "治", "疗", "方", "法", "也", "是", "不", "尽", "相", "同", "的", "。", "现", "在", "常", "见", "的", "小", "儿", "癫", "痫", "治", "疗", "都", "是", "采", "用", "中", "医", "为", "基", "础", "的", "治", "疗", "方", "法", ",", "这", "样", "对", "患", "儿", "的", "伤", "害", "较", "小", ",", "而", "西", "医", "则", "有", "很", "大", "的", "副", "作", "用", ",", "好", "吧", "小", "儿", "癫", "痫", "病", "能", "彻", "底", "治", "愈", "的", "吗", "_", "有", "问", "必", "答", "_", "快", "速", "问", "医", "生"], "sample_type": "disturb"} -{"id": 2012, "title": "脑内多发腔隙性脑梗死严重吗_39健康问答_39健康网", "context": "脑内多发腔隙性脑梗死,部分软化灶形成,一般不严重,是细枝血管梗塞,引起小灶脑组织坏死,脑组织软化灶,其他部位的脑组织会替代坏死部位的脑组织功能,所以一般没有不适的症状。属于脑梗死症型中症状最轻微的,也是唯一一种能够通过可靠用药、饮食调节、康复锻炼、控制血压和血脂等综合性治疗措施达到彻底治愈的脑梗死。注意控制血压,清淡饮食,控制血脂,血粘度,精神放松,解除思想顾虑,多做室外文娱体育活动,精神愉快,多接受紫外线照射,多喝开水,会有利于康复。可以根据情况使用疏通血管的药物。", "question": "多发腔隙性脑梗死吃什么中药", "sent_token": ["脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", ",", "部", "分", "软", "化", "灶", "形", "成", ",", "一", "般", "不", "严", "重", ",", "是", "细", "枝", "血", "管", "梗", "塞", ",", "引", "起", "小", "灶", "脑", "组", "织", "坏", "死", ",", "脑", "组", "织", "软", "化", "灶", ",", "其", "他", "部", "位", "的", "脑", "组", "织", "会", "替", "代", "坏", "死", "部", "位", "的", "脑", "组", "织", "功", "能", ",", "所", "以", "一", "般", "没", "有", "不", "适", "的", "症", "状", "。", "属", "于", "脑", "梗", "死", "症", "型", "中", "症", "状", "最", "轻", "微", "的", ",", "也", "是", "唯", "一", "一", "种", "能", "够", "通", "过", "可", "靠", "用", "药", "、", "饮", "食", "调", "节", "、", "康", "复", "锻", "炼", "、", "控", "制", "血", "压", "和", "血", "脂", "等", "综", "合", "性", "治", "疗", "措", "施", "达", "到", "彻", "底", "治", "愈", "的", "脑", "梗", "死", "。", "注", "意", "控", "制", "血", "压", ",", "清", "淡", "饮", "食", ",", "控", "制", "血", "脂", ",", "血", "粘", "度", ",", "精", "神", "放", "松", ",", "解", "除", "思", "想", "顾", "虑", ",", "多", "做", "室", "外", "文", "娱", "体", "育", "活", "动", ",", "精", "神", "愉", "快", ",", "多", "接", "受", "紫", "外", "线", "照", "射", ",", "多", "喝", "开", "水", ",", "会", "有", "利", "于", "康", "复", "。", "可", "以", "根", "据", "情", "况", "使", "用", "疏", "通", "血", "管", "的", "药", "物", "。", "脑", "内", "多", "发", "腔", "隙", "性", "脑", "梗", "死", "严", "重", "吗", "_", "39", "健", "康", "问", "答", "_", "39", "健", "康", "网"], "sample_type": "disturb"} diff --git a/examples/model_interpretation/data/mrc_en b/examples/model_interpretation/data/mrc_en deleted file mode 100644 index d95bef1dedbd..000000000000 --- a/examples/model_interpretation/data/mrc_en +++ /dev/null @@ -1,100 +0,0 @@ -{"id": 1, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is itself borrowed from Old Low Franconian Nortmann \" Northman \" or directly from Old Norse Norðmaðr , Latinized variously as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "What is the original meaning of the word Norman ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "itself", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "directly", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "variously", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "ori", "rel_ids": [1508]} -{"id": 2, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is itself borrowed from Old Low Franconian Nortmann \" Northman \" or directly from Old Norse Norðmaðr , Latinized variously as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "When was the Latin version of the word Norman first recorded ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "itself", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "directly", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "variously", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "ori", "rel_ids": [1509]} -{"id": 3, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to synthesize a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "What was the Norman religion ?", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "synthesize", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "ori", "rel_ids": [1510]} -{"id": 4, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to synthesize a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "What part of France were the Normans located ?", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "synthesize", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "ori", "rel_ids": [1511]} -{"id": 5, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Herve serve as a Byzantine general ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "ori", "rel_ids": [1512]} -{"id": 6, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Robert Crispin go up against the Turks ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "ori", "rel_ids": [1513]} -{"id": 7, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "Who ruined Roussel de Bailleul 's plans for an independent state ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "ori", "rel_ids": [1514]} -{"id": 8, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "When did the Normans attack Dyrrachium ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "ori", "rel_ids": [1515]} -{"id": 9, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "What was the naval base called ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "ori", "rel_ids": [1516]} -{"id": 10, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "Where was Dyrrachium located ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "ori", "rel_ids": [1517]} -{"id": 11, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was Margaret 's brother ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1518]} -{"id": 12, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was Margaret 's husband ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1519]} -{"id": 13, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "When was Scotland invaded by William ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1520]} -{"id": 14, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was the hostage ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "ori", "rel_ids": [1521]} -{"id": 15, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Where was Ralph earl of ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "ori", "rel_ids": [1522]} -{"id": 16, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who was Ralph in charge of being at war with ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "ori", "rel_ids": [1523]} -{"id": 17, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who made Ralph earl ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "ori", "rel_ids": [1524]} -{"id": 18, "title": "", "context": "Subsequent to the Conquest , however , the Marches came completely under the dominance of William 's most trusted Norman barons , including Bernard de Neufmarché , Roger of Montgomery in Shropshire and Hugh Lupus in Cheshire . These Normans began a long period of slow conquest during which almost all of Wales was at some point subject to Norman interference . Norman words , such as baron ( barwn ) , first entered Welsh at that time .", "question": "What country was under the control of Norman barons ?", "sent_token": ["Subsequent", "to", "the", "Conquest", ",", "however", ",", "the", "Marches", "came", "completely", "under", "the", "dominance", "of", "William", "'s", "most", "trusted", "Norman", "barons", ",", "including", "Bernard", "de", "Neufmarché", ",", "Roger", "of", "Montgomery", "in", "Shropshire", "and", "Hugh", "Lupus", "in", "Cheshire", ".", "These", "Normans", "began", "a", "long", "period", "of", "slow", "conquest", "during", "which", "almost", "all", "of", "Wales", "was", "at", "some", "point", "subject", "to", "Norman", "interference", ".", "Norman", "words", ",", "such", "as", "baron", "(", "barwn", ")", ",", "first", "entered", "Welsh", "at", "that", "time", "."], "sample_type": "ori", "rel_ids": [1525]} -{"id": 19, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "What year did Roger de Tosny fail to accomplish what he set out to do ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "ori", "rel_ids": [1526]} -{"id": 20, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "Who was in charge of the papal army in the War of Barbastro ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "ori", "rel_ids": [1527]} -{"id": 21, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "When did the Siege of Antioch take place ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "ori", "rel_ids": [1528]} -{"id": 22, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What was the name of Bohemond 's nephew ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "ori", "rel_ids": [1529]} -{"id": 23, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What major conquest did Tancred play a roll in ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "ori", "rel_ids": [1530]} -{"id": 24, "title": "", "context": "The conquest of Cyprus by the Anglo - Norman forces of the Third Crusade opened a new chapter in the history of the island , which would be under Western European domination for the following 380 years . Although not part of a planned operation , the conquest had much more permanent results than initially expected .", "question": "How long did Western Europe control Cyprus ?", "sent_token": ["The", "conquest", "of", "Cyprus", "by", "the", "Anglo", "-", "Norman", "forces", "of", "the", "Third", "Crusade", "opened", "a", "new", "chapter", "in", "the", "history", "of", "the", "island", ",", "which", "would", "be", "under", "Western", "European", "domination", "for", "the", "following", "380", "years", ".", "Although", "not", "part", "of", "a", "planned", "operation", ",", "the", "conquest", "had", "much", "more", "permanent", "results", "than", "initially", "expected", "."], "sample_type": "ori", "rel_ids": [1531]} -{"id": 25, "title": "", "context": "Between 1402 and 1405 , the expedition led by the Norman noble Jean de Bethencourt and the Poitevine Gadifer de la Salle conquered the Canarian islands of Lanzarote , Fuerteventura and El Hierro off the Atlantic coast of Africa . Their troops were gathered in Normandy , Gascony and were later reinforced by Castilian colonists .", "question": "What continent are the Canarian Islands off the coast of ?", "sent_token": ["Between", "1402", "and", "1405", ",", "the", "expedition", "led", "by", "the", "Norman", "noble", "Jean", "de", "Bethencourt", "and", "the", "Poitevine", "Gadifer", "de", "la", "Salle", "conquered", "the", "Canarian", "islands", "of", "Lanzarote", ",", "Fuerteventura", "and", "El", "Hierro", "off", "the", "Atlantic", "coast", "of", "Africa", ".", "Their", "troops", "were", "gathered", "in", "Normandy", ",", "Gascony", "and", "were", "later", "reinforced", "by", "Castilian", "colonists", "."], "sample_type": "ori", "rel_ids": [1532]} -{"id": 26, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who became the King of the Canary Islands ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "ori", "rel_ids": [1533]} -{"id": 27, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who bought the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "ori", "rel_ids": [1534]} -{"id": 28, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who sold the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "ori", "rel_ids": [1535]} -{"id": 29, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "Where are Jersey and Guernsey", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "ori", "rel_ids": [1536]} -{"id": 30, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "How many customaries does Norman customary law have ?", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "ori", "rel_ids": [1537]} -{"id": 31, "title": "", "context": "Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What is the Norman architecture idiom ?", "sent_token": ["Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "ori", "rel_ids": [1538]} -{"id": 32, "title": "", "context": "Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What kind of arches does Norman architecture have ?", "sent_token": ["Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "ori", "rel_ids": [1539]} -{"id": 33, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came after Norman in England ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "ori", "rel_ids": [1540]} -{"id": 34, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came before Norman in England ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "ori", "rel_ids": [1541]} -{"id": 35, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What place had the Norman Arab architectural style ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "ori", "rel_ids": [1542]} -{"id": 36, "title": "", "context": "The French Wars of Religion in the 16th century and French Revolution in the 18th successively destroyed much of what existed in the way of the architectural and artistic remnant of this Norman creativity . The former , with their violence , caused the wanton destruction of many Norman edifices ; the latter , with its assault on religion , caused the purposeful destruction of religious objects of any type , and its destabilisation of society resulted in rampant pillaging .", "question": "When were the French wars of religion ?", "sent_token": ["The", "French", "Wars", "of", "Religion", "in", "the", "16th", "century", "and", "French", "Revolution", "in", "the", "18th", "successively", "destroyed", "much", "of", "what", "existed", "in", "the", "way", "of", "the", "architectural", "and", "artistic", "remnant", "of", "this", "Norman", "creativity", ".", "The", "former", ",", "with", "their", "violence", ",", "caused", "the", "wanton", "destruction", "of", "many", "Norman", "edifices", ";", "the", "latter", ",", "with", "its", "assault", "on", "religion", ",", "caused", "the", "purposeful", "destruction", "of", "religious", "objects", "of", "any", "type", ",", "and", "its", "destabilisation", "of", "society", "resulted", "in", "rampant", "pillaging", "."], "sample_type": "ori", "rel_ids": [1543]} -{"id": 37, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What kind of needlework was used in the creation of the Bayeux Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "ori", "rel_ids": [1544]} -{"id": 38, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What is Norman art 's most well known piece ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "ori", "rel_ids": [1545]} -{"id": 39, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "Who commissioned the Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "ori", "rel_ids": [1546]} -{"id": 40, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "Where did the monks flee to ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1547]} -{"id": 41, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "What monastery did the Saint - Evroul monks establish in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1548]} -{"id": 42, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "Who patronized the monks in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1549]} -{"id": 43, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "What tradition were the Saint - Evroul monks known for ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "ori", "rel_ids": [1550]} -{"id": 44, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "What branch of theoretical computer science deals with broadly classifying computational problems by difficulty and class of relationship ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "ori", "rel_ids": [1551]} -{"id": 45, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "By what main attribute are computational problems classified utilizing computational complexity theory ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "ori", "rel_ids": [1552]} -{"id": 46, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "What is the term for a task that generally lends itself to being solved by a computer ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "ori", "rel_ids": [1553]} -{"id": 47, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "By how many kilometers does the traveling salesman problem seek to classify a route between the 15 largest cities in Germany ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "ori", "rel_ids": [1554]} -{"id": 48, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What is one example of an instance that the quantitative answer to the traveling salesman problem fails to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "ori", "rel_ids": [1555]} -{"id": 49, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What does computational complexity theory most specifically seek to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "ori", "rel_ids": [1556]} -{"id": 50, "title": "", "context": "When considering computational problems , a problem instance is a string over an alphabet . Usually , the alphabet is taken to be the binary alphabet ( i.e. , the set { 0,1 } ) , and thus the strings are bitstrings . As in a real - world computer , mathematical objects other than bitstrings must be suitably encoded . For example , integers can be represented in binary notation , and graphs can be encoded directly via their adjacency matrices , or by encoding their adjacency lists in binary .", "question": "In a computational problem , what can be described as a string over an alphabet ?", "sent_token": ["When", "considering", "computational", "problems", ",", "a", "problem", "instance", "is", "a", "string", "over", "an", "alphabet", ".", "Usually", ",", "the", "alphabet", "is", "taken", "to", "be", "the", "binary", "alphabet", "(", "i.e.", ",", "the", "set", "{", "0,1", "}", ")", ",", "and", "thus", "the", "strings", "are", "bitstrings", ".", "As", "in", "a", "real", "-", "world", "computer", ",", "mathematical", "objects", "other", "than", "bitstrings", "must", "be", "suitably", "encoded", ".", "For", "example", ",", "integers", "can", "be", "represented", "in", "binary", "notation", ",", "and", "graphs", "can", "be", "encoded", "directly", "via", "their", "adjacency", "matrices", ",", "or", "by", "encoding", "their", "adjacency", "lists", "in", "binary", "."], "sample_type": "ori", "rel_ids": [1557]} -{"id": 1508, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is itself borrowed from Old Low Franconian Nortmann \" Northman \" or directly from Old Norse Norðmaðr , Latinized variously as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "what is the original denotation of the word Norman ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "itself", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "directly", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "variously", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "disturb"} -{"id": 1509, "title": "", "context": "The English name \" Normans \" comes from the French words Normans / Normanz , plural of Normant , modern French normand , which is borrowed from Old Low Franconian Nortmann \" Northman \" or from Old Norse Norðmaðr , Latinized as Nortmannus , Normannus , or Nordmannus ( recorded in Medieval Latin , 9th century ) to mean \" Norseman , Viking \" .", "question": "When was the Latin version of the word Norman first recorded ?", "sent_token": ["The", "English", "name", "\"", "Normans", "\"", "comes", "from", "the", "French", "words", "Normans", "/", "Normanz", ",", "plural", "of", "Normant", ",", "modern", "French", "normand", ",", "which", "is", "borrowed", "from", "Old", "Low", "Franconian", "Nortmann", "\"", "Northman", "\"", "or", "from", "Old", "Norse", "Norðmaðr", ",", "Latinized", "as", "Nortmannus", ",", "Normannus", ",", "or", "Nordmannus", "(", "recorded", "in", "Medieval", "Latin", ",", "9th", "century", ")", "to", "mean", "\"", "Norseman", ",", "Viking", "\"", "."], "sample_type": "disturb"} -{"id": 1510, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to compose a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "What was the Norman religion ?", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "compose", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "disturb"} -{"id": 1511, "title": "", "context": "The descendants of Rollo 's Vikings and their Frankish wives would replace the Norse religion and Old Norse language with Catholicism ( Christianity ) and the Gallo - Romance language of the local people , blending their maternal Frankish heritage with Old Norse traditions and customs to synthesize a unique \" Norman \" culture in the north of France . The Norman language was forged by the adoption of the indigenous langue d'oïl branch of Romance by a Norse - speaking ruling class , and it developed into the regional language that survives today .", "question": "Where in France were the Normans located", "sent_token": ["The", "descendants", "of", "Rollo", "'s", "Vikings", "and", "their", "Frankish", "wives", "would", "replace", "the", "Norse", "religion", "and", "Old", "Norse", "language", "with", "Catholicism", "(", "Christianity", ")", "and", "the", "Gallo", "-", "Romance", "language", "of", "the", "local", "people", ",", "blending", "their", "maternal", "Frankish", "heritage", "with", "Old", "Norse", "traditions", "and", "customs", "to", "synthesize", "a", "unique", "\"", "Norman", "\"", "culture", "in", "the", "north", "of", "France", ".", "The", "Norman", "language", "was", "forged", "by", "the", "adoption", "of", "the", "indigenous", "langue", "d'oïl", "branch", "of", "Romance", "by", "a", "Norse", "-", "speaking", "ruling", "class", ",", "and", "it", "developed", "into", "the", "regional", "language", "that", "survives", "today", "."], "sample_type": "disturb"} -{"id": 1512, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Herve assume the role of Byzantine general ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "disturb"} -{"id": 1513, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos .", "question": "When did Robert Crispin fought against the Turks ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", "."], "sample_type": "disturb"} -{"id": 1514, "title": "", "context": "One of the first Norman mercenaries to serve as a Byzantine general was Hervé in the 1050s . By then however , there were already Norman mercenaries serving as far away as Trebizond and Georgia . They were based at Malatya and Edessa , under the Byzantine duke of Antioch , Isaac Komnenos . In the 1060s , Robert Crispin led the Normans of Edessa against the Turks . Roussel de Bailleul even tried to carve out an independent state in Asia Minor with support from the local population , but he was stopped by the Byzantine general Alexius Komnenos . Roussel de Bailleul revolted against Isaac Comnene during one expedition and began the conquest of Lycaonia and Galatia for himself .", "question": "Who ruined Roussel de Bailleul 's plans for an independent state ?", "sent_token": ["One", "of", "the", "first", "Norman", "mercenaries", "to", "serve", "as", "a", "Byzantine", "general", "was", "Hervé", "in", "the", "1050s", ".", "By", "then", "however", ",", "there", "were", "already", "Norman", "mercenaries", "serving", "as", "far", "away", "as", "Trebizond", "and", "Georgia", ".", "They", "were", "based", "at", "Malatya", "and", "Edessa", ",", "under", "the", "Byzantine", "duke", "of", "Antioch", ",", "Isaac", "Komnenos", ".", "In", "the", "1060s", ",", "Robert", "Crispin", "led", "the", "Normans", "of", "Edessa", "against", "the", "Turks", ".", "Roussel", "de", "Bailleul", "even", "tried", "to", "carve", "out", "an", "independent", "state", "in", "Asia", "Minor", "with", "support", "from", "the", "local", "population", ",", "but", "he", "was", "stopped", "by", "the", "Byzantine", "general", "Alexius", "Komnenos", ".", "Roussel", "de", "Bailleul", "revolted", "against", "Isaac", "Comnene", "during", "one", "expedition", "and", "began", "the", "conquest", "of", "Lycaonia", "and", "Galatia", "for", "himself", "."], "sample_type": "disturb"} -{"id": 1515, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "When did the Normans assault Dyrrachium ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "disturb"} -{"id": 1516, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "What was the naval base 's name ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "disturb"} -{"id": 1517, "title": "", "context": "The further decline of Byzantine state - of - affairs paved the road to a third attack in 1185 , when a large Norman army invaded Dyrrachium , owing to the betrayal of high Byzantine officials . Some time later , Dyrrachium — one of the most important naval bases of the Adriatic — fell again to Byzantine hands .", "question": "Where was Dyrrachium situated ?", "sent_token": ["The", "further", "decline", "of", "Byzantine", "state", "-", "of", "-", "affairs", "paved", "the", "road", "to", "a", "third", "attack", "in", "1185", ",", "when", "a", "large", "Norman", "army", "invaded", "Dyrrachium", ",", "owing", "to", "the", "betrayal", "of", "high", "Byzantine", "officials", ".", "Some", "time", "later", ",", "Dyrrachium", "—", "one", "of", "the", "most", "important", "naval", "bases", "of", "the", "Adriatic", "—", "fell", "again", "to", "Byzantine", "hands", "."], "sample_type": "disturb"} -{"id": 1518, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he joined up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was Margaret 's brother ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "joined", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"} -{"id": 1519, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was married to Margaret ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"} -{"id": 1520, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a series of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "When was Scotland invaded by William ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "series", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"} -{"id": 1521, "title": "", "context": "One of the claimants of the English throne opposing William the Conqueror , Edgar Atheling , eventually fled to Scotland . King Malcolm III of Scotland married Edgar 's sister Margaret , and came into opposition to William who had already disputed Scotland 's southern borders . William invaded Scotland in 1072 , riding as far as Abernethy where he met up with his fleet of ships . Malcolm submitted , paid homage to William and surrendered his son Duncan as a hostage , beginning a string of arguments as to whether the Scottish Crown owed allegiance to the King of England .", "question": "Who was the hostage ?", "sent_token": ["One", "of", "the", "claimants", "of", "the", "English", "throne", "opposing", "William", "the", "Conqueror", ",", "Edgar", "Atheling", ",", "eventually", "fled", "to", "Scotland", ".", "King", "Malcolm", "III", "of", "Scotland", "married", "Edgar", "'s", "sister", "Margaret", ",", "and", "came", "into", "opposition", "to", "William", "who", "had", "already", "disputed", "Scotland", "'s", "southern", "borders", ".", "William", "invaded", "Scotland", "in", "1072", ",", "riding", "as", "far", "as", "Abernethy", "where", "he", "met", "up", "with", "his", "fleet", "of", "ships", ".", "Malcolm", "submitted", ",", "paid", "homage", "to", "William", "and", "surrendered", "his", "son", "Duncan", "as", "a", "hostage", ",", "beginning", "a", "string", "of", "arguments", "as", "to", "whether", "the", "Scottish", "Crown", "owed", "allegiance", "to", "the", "King", "of", "England", "."], "sample_type": "disturb"} -{"id": 1522, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had appointed the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Where was Ralph earl of ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "appointed", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "disturb"} -{"id": 1523, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who was Ralph in charge of being at war with ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "disturb"} -{"id": 1524, "title": "", "context": "Even before the Norman Conquest of England , the Normans had come into contact with Wales . Edward the Confessor had set up the aforementioned Ralph as earl of Hereford and charged him with defending the Marches and warring with the Welsh . In these original ventures , the Normans failed to make any headway into Wales .", "question": "Who made Ralph become earl ?", "sent_token": ["Even", "before", "the", "Norman", "Conquest", "of", "England", ",", "the", "Normans", "had", "come", "into", "contact", "with", "Wales", ".", "Edward", "the", "Confessor", "had", "set", "up", "the", "aforementioned", "Ralph", "as", "earl", "of", "Hereford", "and", "charged", "him", "with", "defending", "the", "Marches", "and", "warring", "with", "the", "Welsh", ".", "In", "these", "original", "ventures", ",", "the", "Normans", "failed", "to", "make", "any", "headway", "into", "Wales", "."], "sample_type": "disturb"} -{"id": 1525, "title": "", "context": "Subsequent to the Conquest , however , the Marches came completely under the dominance of William 's most trusted Norman barons , including Bernard de Neufmarché , Roger of Montgomery in Shropshire and Hugh Lupus in Cheshire . These Normans began a long period of slow conquest during which almost all of Wales was in some degree subject to Norman interference . Norman words , such as baron ( barwn ) , first entered Welsh at that time .", "question": "What country was under the control of Norman barons ?", "sent_token": ["Subsequent", "to", "the", "Conquest", ",", "however", ",", "the", "Marches", "came", "completely", "under", "the", "dominance", "of", "William", "'s", "most", "trusted", "Norman", "barons", ",", "including", "Bernard", "de", "Neufmarché", ",", "Roger", "of", "Montgomery", "in", "Shropshire", "and", "Hugh", "Lupus", "in", "Cheshire", ".", "These", "Normans", "began", "a", "long", "period", "of", "slow", "conquest", "during", "which", "almost", "all", "of", "Wales", "was", "in", "some", "degree", "subject", "to", "Norman", "interference", ".", "Norman", "words", ",", "such", "as", "baron", "(", "barwn", ")", ",", "first", "entered", "Welsh", "at", "that", "time", "."], "sample_type": "disturb"} -{"id": 1526, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "What year did Roger de Tosny not succeed accomplishing what he set out to do ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "disturb"} -{"id": 1527, "title": "", "context": "The legendary religious zeal of the Normans was exercised in religious wars long before the First Crusade carved out a Norman principality in Antioch . They were major foreign participants in the Reconquista in Iberia . In 1018 , Roger de Tosny travelled to the Iberian Peninsula to carve out a state for himself from Moorish lands , but failed . In 1064 , during the War of Barbastro , William of Montreuil led the papal army and took a huge booty .", "question": "Who was the leader of the papal army in the War of Barbastro ?", "sent_token": ["The", "legendary", "religious", "zeal", "of", "the", "Normans", "was", "exercised", "in", "religious", "wars", "long", "before", "the", "First", "Crusade", "carved", "out", "a", "Norman", "principality", "in", "Antioch", ".", "They", "were", "major", "foreign", "participants", "in", "the", "Reconquista", "in", "Iberia", ".", "In", "1018", ",", "Roger", "de", "Tosny", "travelled", "to", "the", "Iberian", "Peninsula", "to", "carve", "out", "a", "state", "for", "himself", "from", "Moorish", "lands", ",", "but", "failed", ".", "In", "1064", ",", "during", "the", "War", "of", "Barbastro", ",", "William", "of", "Montreuil", "led", "the", "papal", "army", "and", "took", "a", "huge", "booty", "."], "sample_type": "disturb"} -{"id": 1528, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . Antioch lay on the crusaders ' route to Palestine , and anticipating that it would be attacked the Muslim governor of the city , Yaghi - Siyan , began stockpiling food and sending requests for help . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "When did the Siege of Antioch take place ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "Antioch", "lay", "on", "the", "crusaders", "'", "route", "to", "Palestine", ",", "and", "anticipating", "that", "it", "would", "be", "attacked", "the", "Muslim", "governor", "of", "the", "city", ",", "Yaghi", "-", "Siyan", ",", "began", "stockpiling", "food", "and", "sending", "requests", "for", "help", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "disturb"} -{"id": 1529, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . A politique , Bohemond was resolved to engineer the enthusiasm of the crusaders to his own ends ; and when his nephew Tancred left the main army at Heraclea Cybistra , and attempted to establish a footing in Cilicia , the movement may have been already intended as a preparation for Bohemond ’s eastern principality . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What was the name of Bohemond 's nephew ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "A", "politique", ",", "Bohemond", "was", "resolved", "to", "engineer", "the", "enthusiasm", "of", "the", "crusaders", "to", "his", "own", "ends", ";", "and", "when", "his", "nephew", "Tancred", "left", "the", "main", "army", "at", "Heraclea", "Cybistra", ",", "and", "attempted", "to", "establish", "a", "footing", "in", "Cilicia", ",", "the", "movement", "may", "have", "been", "already", "intended", "as", "a", "preparation", "for", "Bohemond", "’s", "eastern", "principality", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "disturb"} -{"id": 1530, "title": "", "context": "In 1096 , Crusaders passing by the siege of Amalfi were joined by Bohemond of Taranto and his nephew Tancred with an army of Italo - Normans . Bohemond was the de facto leader of the Crusade during its passage through Asia Minor . After the successful Siege of Antioch in 1097 , Bohemond began carving out an independent principality around that city . Tancred was instrumental in the conquest of Jerusalem and he worked for the expansion of the Crusader kingdom in Transjordan and the region of Galilee.[citation needed ]", "question": "What major conquest did Tancred play a part in ?", "sent_token": ["In", "1096", ",", "Crusaders", "passing", "by", "the", "siege", "of", "Amalfi", "were", "joined", "by", "Bohemond", "of", "Taranto", "and", "his", "nephew", "Tancred", "with", "an", "army", "of", "Italo", "-", "Normans", ".", "Bohemond", "was", "the", "de", "facto", "leader", "of", "the", "Crusade", "during", "its", "passage", "through", "Asia", "Minor", ".", "After", "the", "successful", "Siege", "of", "Antioch", "in", "1097", ",", "Bohemond", "began", "carving", "out", "an", "independent", "principality", "around", "that", "city", ".", "Tancred", "was", "instrumental", "in", "the", "conquest", "of", "Jerusalem", "and", "he", "worked", "for", "the", "expansion", "of", "the", "Crusader", "kingdom", "in", "Transjordan", "and", "the", "region", "of", "Galilee.[citation", "needed", "]"], "sample_type": "disturb"} -{"id": 1531, "title": "", "context": "The conquest of Cyprus by the Anglo - Norman forces of the Third Crusade opened a new chapter in the history of the island , which would be under Western European domination for 380 years . Although not part of a planned operation , the conquest had more permanent results than expected .", "question": "How long did Western Europe control Cyprus ?", "sent_token": ["The", "conquest", "of", "Cyprus", "by", "the", "Anglo", "-", "Norman", "forces", "of", "the", "Third", "Crusade", "opened", "a", "new", "chapter", "in", "the", "history", "of", "the", "island", ",", "which", "would", "be", "under", "Western", "European", "domination", "for", "380", "years", ".", "Although", "not", "part", "of", "a", "planned", "operation", ",", "the", "conquest", "had", "more", "permanent", "results", "than", "expected", "."], "sample_type": "disturb"} -{"id": 1532, "title": "", "context": "Between 1402 and 1405 , the expedition led by the Norman noble Jean de Bethencourt and the Poitevine Gadifer de la Salle conquered the Canarian islands of Lanzarote , Fuerteventura and El Hierro off the Atlantic coast of Africa . Their troops were assembled in Normandy , Gascony and were later reinforced by Castilian colonists .", "question": "What continent are the Canarian Islands off the coast of ?", "sent_token": ["Between", "1402", "and", "1405", ",", "the", "expedition", "led", "by", "the", "Norman", "noble", "Jean", "de", "Bethencourt", "and", "the", "Poitevine", "Gadifer", "de", "la", "Salle", "conquered", "the", "Canarian", "islands", "of", "Lanzarote", ",", "Fuerteventura", "and", "El", "Hierro", "off", "the", "Atlantic", "coast", "of", "Africa", ".", "Their", "troops", "were", "assembled", "in", "Normandy", ",", "Gascony", "and", "were", "later", "reinforced", "by", "Castilian", "colonists", "."], "sample_type": "disturb"} -{"id": 1533, "title": "", "context": "Jean de Béthencourt was a French explorer who was responsible for the expedition to the Canaries . Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who became the King of the Canary Islands ?", "sent_token": ["Jean", "de", "Béthencourt", "was", "a", "French", "explorer", "who", "was", "responsible", "for", "the", "expedition", "to", "the", "Canaries", ".", "Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "disturb"} -{"id": 1534, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla .", "question": "Who purchased the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", "."], "sample_type": "disturb"} -{"id": 1535, "title": "", "context": "Bethencourt took the title of King of the Canary Islands , as vassal to Henry III of Castile . In 1418 , Jean 's nephew Maciot de Bethencourt sold the rights to the islands to Enrique Pérez de Guzmán , 2nd Count de Niebla . Maciot de Bethencourt was born illegitimate circa 1390 at France .", "question": "Who sold the rights ?", "sent_token": ["Bethencourt", "took", "the", "title", "of", "King", "of", "the", "Canary", "Islands", ",", "as", "vassal", "to", "Henry", "III", "of", "Castile", ".", "In", "1418", ",", "Jean", "'s", "nephew", "Maciot", "de", "Bethencourt", "sold", "the", "rights", "to", "the", "islands", "to", "Enrique", "Pérez", "de", "Guzmán", ",", "2nd", "Count", "de", "Niebla", ".", "Maciot", "de", "Bethencourt", "was", "born", "illegitimate", "circa", "1390", "at", "France", "."], "sample_type": "disturb"} -{"id": 1536, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Just off the Normandy coast , the Channel Islands comprising of Jersey , Guernsey , Alderney , Sark and Herm are a short hop away from Britain and mainland Europe . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "Where are Jersey and Guernsey", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Just", "off", "the", "Normandy", "coast", ",", "the", "Channel", "Islands", "comprising", "of", "Jersey", ",", "Guernsey", ",", "Alderney", ",", "Sark", "and", "Herm", "are", "a", "short", "hop", "away", "from", "Britain", "and", "mainland", "Europe", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "disturb"} -{"id": 1537, "title": "", "context": "The customary law of Normandy was developed between the 10th and 13th centuries and survives today through the legal systems of Jersey and Guernsey in the Channel Islands . Norman customary law was transcribed in two customaries in Latin by two judges for use by them and their colleagues : These are the Très ancien coutumier ( Very ancient customary ) , authored between 1200 and 1245 ; and the Grand coutumier de Normandie ( Great customary of Normandy , originally Summa de legibus Normanniae in curia laïcali ) , authored between 1235 and 1245 .", "question": "How many customaries does Norman customary law possess ?", "sent_token": ["The", "customary", "law", "of", "Normandy", "was", "developed", "between", "the", "10th", "and", "13th", "centuries", "and", "survives", "today", "through", "the", "legal", "systems", "of", "Jersey", "and", "Guernsey", "in", "the", "Channel", "Islands", ".", "Norman", "customary", "law", "was", "transcribed", "in", "two", "customaries", "in", "Latin", "by", "two", "judges", "for", "use", "by", "them", "and", "their", "colleagues", ":", "These", "are", "the", "Très", "ancien", "coutumier", "(", "Very", "ancient", "customary", ")", ",", "authored", "between", "1200", "and", "1245", ";", "and", "the", "Grand", "coutumier", "de", "Normandie", "(", "Great", "customary", "of", "Normandy", ",", "originally", "Summa", "de", "legibus", "Normanniae", "in", "curia", "laïcali", ")", ",", "authored", "between", "1235", "and", "1245", "."], "sample_type": "disturb"} -{"id": 1538, "title": "", "context": "The term Norman architecture is used to categorise styles of Romanesque architecture developed by the Normans in the various lands under their dominion or influence in the 11th and 12th centuries . Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What is the Norman architecture idiom ?", "sent_token": ["The", "term", "Norman", "architecture", "is", "used", "to", "categorise", "styles", "of", "Romanesque", "architecture", "developed", "by", "the", "Normans", "in", "the", "various", "lands", "under", "their", "dominion", "or", "influence", "in", "the", "11th", "and", "12th", "centuries", ".", "Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "disturb"} -{"id": 1539, "title": "", "context": "Norman architecture typically stands out as a new stage in the architectural history of the regions they subdued . They spread a unique Romanesque idiom to England and Italy , and the encastellation of these regions with keeps in their north French style fundamentally altered the military landscape . Their style was characterised by rounded arches , particularly over windows and doorways , and massive proportions .", "question": "What type of arches does Norman architecture have ?", "sent_token": ["Norman", "architecture", "typically", "stands", "out", "as", "a", "new", "stage", "in", "the", "architectural", "history", "of", "the", "regions", "they", "subdued", ".", "They", "spread", "a", "unique", "Romanesque", "idiom", "to", "England", "and", "Italy", ",", "and", "the", "encastellation", "of", "these", "regions", "with", "keeps", "in", "their", "north", "French", "style", "fundamentally", "altered", "the", "military", "landscape", ".", "Their", "style", "was", "characterised", "by", "rounded", "arches", ",", "particularly", "over", "windows", "and", "doorways", ",", "and", "massive", "proportions", "."], "sample_type": "disturb"} -{"id": 1540, "title": "", "context": "In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans integrated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came after Norman in England ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "integrated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "disturb"} -{"id": 1541, "title": "", "context": "Norman Castles were typically built on the highest ground in the area , often adjoined Rivers and overlooking towns and harbours . In England , the period of Norman architecture immediately succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a unique style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What architecture type came before Norman in England ?", "sent_token": ["Norman", "Castles", "were", "typically", "built", "on", "the", "highest", "ground", "in", "the", "area", ",", "often", "adjoined", "Rivers", "and", "overlooking", "towns", "and", "harbours", ".", "In", "England", ",", "the", "period", "of", "Norman", "architecture", "immediately", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "unique", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "disturb"} -{"id": 1542, "title": "", "context": "In England , the period of Norman architecture succeeds that of the Anglo - Saxon and precedes the Early Gothic . In southern Italy , the Normans incorporated elements of Islamic , Lombard , and Byzantine building techniques into their own , initiating a style known as Norman - Arab architecture within the Kingdom of Sicily .", "question": "What place had the Norman Arab architectural style ?", "sent_token": ["In", "England", ",", "the", "period", "of", "Norman", "architecture", "succeeds", "that", "of", "the", "Anglo", "-", "Saxon", "and", "precedes", "the", "Early", "Gothic", ".", "In", "southern", "Italy", ",", "the", "Normans", "incorporated", "elements", "of", "Islamic", ",", "Lombard", ",", "and", "Byzantine", "building", "techniques", "into", "their", "own", ",", "initiating", "a", "style", "known", "as", "Norman", "-", "Arab", "architecture", "within", "the", "Kingdom", "of", "Sicily", "."], "sample_type": "disturb"} -{"id": 1543, "title": "", "context": "The French Wars of Religion in the 16th century and French Revolution in the 18th successively destroyed much of what existed in the way of the architectural and artistic remnant of this Norman creativity . The former , with their violence , caused the wanton destruction of many Norman edifices ; the latter , with its assault on religion , caused the purposeful destruction of religious objects of any type , and its destabilisation of society led to rampant pillaging .", "question": "When were the French wars of religion ?", "sent_token": ["The", "French", "Wars", "of", "Religion", "in", "the", "16th", "century", "and", "French", "Revolution", "in", "the", "18th", "successively", "destroyed", "much", "of", "what", "existed", "in", "the", "way", "of", "the", "architectural", "and", "artistic", "remnant", "of", "this", "Norman", "creativity", ".", "The", "former", ",", "with", "their", "violence", ",", "caused", "the", "wanton", "destruction", "of", "many", "Norman", "edifices", ";", "the", "latter", ",", "with", "its", "assault", "on", "religion", ",", "caused", "the", "purposeful", "destruction", "of", "religious", "objects", "of", "any", "type", ",", "and", "its", "destabilisation", "of", "society", "led", "to", "rampant", "pillaging", "."], "sample_type": "disturb"} -{"id": 1544, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . The Bayeux Tapestry is a narrative embroidery of about 70 meters long and 50 centimeters wide . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What kind of needlework was used in the creation of the Bayeux Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "The", "Bayeux", "Tapestry", "is", "a", "narrative", "embroidery", "of", "about", "70", "meters", "long", "and", "50", "centimeters", "wide", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "disturb"} -{"id": 1545, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , employing natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "What is Norman art 's world - renowned piece ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "employing", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "disturb"} -{"id": 1546, "title": "", "context": "By far the most famous work of Norman art is the Bayeux Tapestry , which is not a tapestry but a work of embroidery . It was commissioned by Odo , the Bishop of Bayeux and first Earl of Kent , hiring natives from Kent who were learned in the Nordic traditions imported in the previous half century by the Danish Vikings .", "question": "Who commissioned the Tapestry ?", "sent_token": ["By", "far", "the", "most", "famous", "work", "of", "Norman", "art", "is", "the", "Bayeux", "Tapestry", ",", "which", "is", "not", "a", "tapestry", "but", "a", "work", "of", "embroidery", ".", "It", "was", "commissioned", "by", "Odo", ",", "the", "Bishop", "of", "Bayeux", "and", "first", "Earl", "of", "Kent", ",", "hiring", "natives", "from", "Kent", "who", "were", "learned", "in", "the", "Nordic", "traditions", "imported", "in", "the", "previous", "half", "century", "by", "the", "Danish", "Vikings", "."], "sample_type": "disturb"} -{"id": 1547, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved reputation in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "Where did the monks flee to ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "reputation", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"} -{"id": 1548, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were supported by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they continued the tradition of singing .", "question": "What monastery did the Saint - Evroul monks establish in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "supported", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"} -{"id": 1549, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . Robert Guiscard was a Norman adventurer remembered for the conquest of southern Italy and Sicily . There they continued the tradition of singing .", "question": "Who patronized the monks in Italy ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "Robert", "Guiscard", "was", "a", "Norman", "adventurer", "remembered", "for", "the", "conquest", "of", "southern", "Italy", "and", "Sicily", ".", "There", "they", "continued", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"} -{"id": 1550, "title": "", "context": "At Saint Evroul , a tradition of singing had developed and the choir achieved fame in Normandy . Under the Norman abbot Robert de Grantmesnil , several monks of Saint - Evroul fled to southern Italy , where they were patronised by Robert Guiscard and established a Latin monastery at Sant'Eufemia . There they proceeded with the tradition of singing .", "question": "What tradition were the Saint - Evroul monks known for ?", "sent_token": ["At", "Saint", "Evroul", ",", "a", "tradition", "of", "singing", "had", "developed", "and", "the", "choir", "achieved", "fame", "in", "Normandy", ".", "Under", "the", "Norman", "abbot", "Robert", "de", "Grantmesnil", ",", "several", "monks", "of", "Saint", "-", "Evroul", "fled", "to", "southern", "Italy", ",", "where", "they", "were", "patronised", "by", "Robert", "Guiscard", "and", "established", "a", "Latin", "monastery", "at", "Sant'Eufemia", ".", "There", "they", "proceeded", "with", "the", "tradition", "of", "singing", "."], "sample_type": "disturb"} -{"id": 1551, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "What branch of theoretical computer science handles broadly classifying computational problems by difficulty and class of relationship ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "disturb"} -{"id": 1552, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in theory amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm .", "question": "By what main attribute are computational problems classified utilizing computational complexity theory ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "theory", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", "."], "sample_type": "disturb"} -{"id": 1553, "title": "", "context": "Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty , and relating those classes to each other . A computational problem is understood to be a task that is in principle amenable to being solved by a computer , which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps , such as an algorithm . Informally , a computational problem consists of problem instances and solutions to these problem instances .", "question": "What is the term for a task that generally lends itself to being solved by a computer ?", "sent_token": ["Computational", "complexity", "theory", "is", "a", "branch", "of", "the", "theory", "of", "computation", "in", "theoretical", "computer", "science", "that", "focuses", "on", "classifying", "computational", "problems", "according", "to", "their", "inherent", "difficulty", ",", "and", "relating", "those", "classes", "to", "each", "other", ".", "A", "computational", "problem", "is", "understood", "to", "be", "a", "task", "that", "is", "in", "principle", "amenable", "to", "being", "solved", "by", "a", "computer", ",", "which", "is", "equivalent", "to", "stating", "that", "the", "problem", "may", "be", "solved", "by", "mechanical", "application", "of", "mathematical", "steps", ",", "such", "as", "an", "algorithm", ".", "Informally", ",", "a", "computational", "problem", "consists", "of", "problem", "instances", "and", "solutions", "to", "these", "problem", "instances", "."], "sample_type": "disturb"} -{"id": 1554, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . Therefore , complexity theory addresses computational problems and not particular problem instances .", "question": "By how many kilometers does the traveling salesman problem seek to classify a route between the 15 largest cities in Germany ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "Therefore", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "disturb"} -{"id": 1555, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What is one example of an instance that the quantitative answer to the traveling salesman problem is unable to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "disturb"} -{"id": 1556, "title": "", "context": "To further highlight the difference between a problem and an instance , consider the following instance of the decision version of the traveling salesman problem : Is there a route of at most 2000 kilometres passing through all of Germany 's 15 largest cities ? The quantitative answer to this particular problem instance is of little use for solving other instances of the problem , such as asking for a round trip through all sites in Milan whose total length is at most 10 km . For this reason , complexity theory addresses computational problems and not particular problem instances .", "question": "What does computational complexity theory most specifically want to answer ?", "sent_token": ["To", "further", "highlight", "the", "difference", "between", "a", "problem", "and", "an", "instance", ",", "consider", "the", "following", "instance", "of", "the", "decision", "version", "of", "the", "traveling", "salesman", "problem", ":", "Is", "there", "a", "route", "of", "at", "most", "2000", "kilometres", "passing", "through", "all", "of", "Germany", "'s", "15", "largest", "cities", "?", "The", "quantitative", "answer", "to", "this", "particular", "problem", "instance", "is", "of", "little", "use", "for", "solving", "other", "instances", "of", "the", "problem", ",", "such", "as", "asking", "for", "a", "round", "trip", "through", "all", "sites", "in", "Milan", "whose", "total", "length", "is", "at", "most", "10", "km", ".", "For", "this", "reason", ",", "complexity", "theory", "addresses", "computational", "problems", "and", "not", "particular", "problem", "instances", "."], "sample_type": "disturb"} -{"id": 1557, "title": "", "context": "When considering computational problems , a problem instance is a string over an alphabet . Generally , the alphabet is taken to be the binary alphabet ( i.e. , the set { 0,1 } ) , and thus the strings are bitstrings . As in a real - world computer , mathematical objects other than bitstrings must be suitably encoded . For example , integers can be represented in binary notation , and graphs can be encoded directly via their adjacency matrices , or by encoding their adjacency lists in binary .", "question": "In a computational problem , what can be described as a string over an alphabet ?", "sent_token": ["When", "considering", "computational", "problems", ",", "a", "problem", "instance", "is", "a", "string", "over", "an", "alphabet", ".", "Generally", ",", "the", "alphabet", "is", "taken", "to", "be", "the", "binary", "alphabet", "(", "i.e.", ",", "the", "set", "{", "0,1", "}", ")", ",", "and", "thus", "the", "strings", "are", "bitstrings", ".", "As", "in", "a", "real", "-", "world", "computer", ",", "mathematical", "objects", "other", "than", "bitstrings", "must", "be", "suitably", "encoded", ".", "For", "example", ",", "integers", "can", "be", "represented", "in", "binary", "notation", ",", "and", "graphs", "can", "be", "encoded", "directly", "via", "their", "adjacency", "matrices", ",", "or", "by", "encoding", "their", "adjacency", "lists", "in", "binary", "."], "sample_type": "disturb"} diff --git a/examples/model_interpretation/data/senti_ch b/examples/model_interpretation/data/senti_ch deleted file mode 100644 index d17704e85054..000000000000 --- a/examples/model_interpretation/data/senti_ch +++ /dev/null @@ -1,100 +0,0 @@ -{"id": 1, "context": "特别垃圾的摄影店,服务态度差", "sent_token": ["特", "别", "垃", "圾", "的", "摄", "影", "店", ",", "服", "务", "态", "度", "差"], "sample_type": "ori", "rel_ids": [1647]} -{"id": 4, "context": "加油员服务态度特别好!加油站的油价合理!我经常在这里加油", "sent_token": ["加", "油", "员", "服", "务", "态", "度", "特", "别", "好", "!", "加", "油", "站", "的", "油", "价", "合", "理", "!", "我", "经", "常", "在", "这", "里", "加", "油"], "sample_type": "ori", "rel_ids": [1650]} -{"id": 5, "context": "不错,交通便利,出行方便!", "sent_token": ["不", "错", ",", "交", "通", "便", "利", ",", "出", "行", "方", "便", "!"], "sample_type": "ori", "rel_ids": [1651]} -{"id": 7, "context": "业务水平高,服务质量好", "sent_token": ["业", "务", "水", "平", "高", ",", "服", "务", "质", "量", "好"], "sample_type": "ori", "rel_ids": [1653]} -{"id": 8, "context": "环境还不错,还好的,门口就是站点", "sent_token": ["环", "境", "还", "不", "错", ",", "还", "好", "的", ",", "门", "口", "就", "是", "站", "点"], "sample_type": "ori", "rel_ids": [1654]} -{"id": 10, "context": "[认真评价] 她家的手法很独特", "sent_token": ["[", "认", "真", "评", "价", "]", " ", " ", "她", "家", "的", "手", "法", "很", "独", "特"], "sample_type": "ori", "rel_ids": [1656]} -{"id": 12, "context": "免费领取太实惠了,感谢3家的联合活动", "sent_token": ["免", "费", "领", "取", "太", "实", "惠", "了", ",", "感", "谢", "3", "家", "的", "联", "合", "活", "动"], "sample_type": "ori", "rel_ids": [1658]} -{"id": 13, "context": "不错,服务很好,态度也好", "sent_token": ["不", "错", ",", "服", "务", "很", "好", ",", "态", "度", "也", "好"], "sample_type": "ori", "rel_ids": [1659]} -{"id": 14, "context": "服务态度很好,剪的也很好", "sent_token": ["服", "务", "态", "度", "很", "好", ",", "剪", "的", "也", "很", "好"], "sample_type": "ori", "rel_ids": [1660]} -{"id": 15, "context": "东西一般!环境也不怎么好!有包间就会好点", "sent_token": ["东", "西", "一", "般", "!", "环", "境", "也", "不", "怎", "么", "好", "!", "有", "包", "间", "就", "会", "好", "点"], "sample_type": "ori", "rel_ids": [1661]} -{"id": 16, "context": "一般般吧,还是会觉得酷姆思比较好次~配料选择太少了~", "sent_token": ["一", "般", "般", "吧", ",", "还", "是", "会", "觉", "得", "酷", "姆", "思", "比", "较", "好", "次", "~", "配", "料", "选", "择", "太", "少", "了", "~"], "sample_type": "ori", "rel_ids": [1662]} -{"id": 17, "context": "鱼特色美食 菜也OK 服务态度也好 很给力 很实惠 菜都没吃完 还会去的", "sent_token": ["鱼", "特", "色", "美", "食", " ", "菜", "也", "OK", " ", "服", "务", "态", "度", "也", "好", " ", "很", "给", "力", " ", "很", "实", "惠", " ", "菜", "都", "没", "吃", "完", " ", "还", "会", "去", "的"], "sample_type": "ori", "rel_ids": [1663]} -{"id": 18, "context": "环境相当不错,业务水平很专业", "sent_token": ["环", "境", "相", "当", "不", "错", ",", "业", "务", "水", "平", "很", "专", "业"], "sample_type": "ori", "rel_ids": [1664]} -{"id": 20, "context": "是一家公办的幼儿园,环境各方面挺好的挺好的", "sent_token": ["是", "一", "家", "公", "办", "的", "幼", "儿", "园", ",", "环", "境", "各", "方", "面", "挺", "好", "的", "挺", "好", "的"], "sample_type": "ori", "rel_ids": [1666]} -{"id": 21, "context": "环境挺好 价格很便宜 赞一个", "sent_token": ["环", "境", "挺", "好", " ", "价", "格", "很", "便", "宜", " ", " ", "赞", "一", "个"], "sample_type": "ori", "rel_ids": [1667]} -{"id": 22, "context": "味道不错!团购很实惠", "sent_token": ["味", "道", "不", "错", "!", "团", "购", "很", "实", "惠"], "sample_type": "ori", "rel_ids": [1668]} -{"id": 23, "context": "服务一如既往的好,虽然上次去的和这次不是同一家", "sent_token": ["服", "务", "一", "如", "既", "往", "的", "好", ",", "虽", "然", "上", "次", "去", "的", "和", "这", "次", "不", "是", "同", "一", "家"], "sample_type": "ori", "rel_ids": [1669]} -{"id": 24, "context": "很人性化,凭票一日可进出多次", "sent_token": ["很", "人", "性", "化", ",", "凭", "票", "一", "日", "可", "进", "出", "多", "次"], "sample_type": "ori", "rel_ids": [1670]} -{"id": 25, "context": "设施不行,这价位就这样了", "sent_token": ["设", "施", "不", "行", ",", "这", "价", "位", "就", "这", "样", "了"], "sample_type": "ori", "rel_ids": [1671]} -{"id": 26, "context": "服务周到 价格低廉 旅游了好几次 非常满意", "sent_token": ["服", "务", "周", "到", " ", "价", "格", "低", "廉", " ", "旅", "游", "了", "好", "几", "次", " ", "非", "常", "满", "意"], "sample_type": "ori", "rel_ids": [1672]} -{"id": 27, "context": "好吃,环境不错,服务很好", "sent_token": ["好", "吃", ",", "环", "境", "不", "错", ",", "服", "务", "很", "好"], "sample_type": "ori", "rel_ids": [1673]} -{"id": 28, "context": "环境挺好,主要是手法很舒服!做完后皮肤水水的!", "sent_token": ["环", "境", "挺", "好", ",", "主", "要", "是", "手", "法", "很", "舒", "服", "!", "做", "完", "后", "皮", "肤", "水", "水", "的", "!"], "sample_type": "ori", "rel_ids": [1674]} -{"id": 30, "context": "服务态度很好,老板人很和蔼", "sent_token": ["服", "务", "态", "度", "很", "好", ",", "老", "板", "人", "很", "和", "蔼"], "sample_type": "ori", "rel_ids": [1676]} -{"id": 31, "context": "老板娘手艺很好,人也长得漂亮", "sent_token": ["老", "板", "娘", "手", "艺", "很", "好", ",", "人", "也", "长", "得", "漂", "亮"], "sample_type": "ori", "rel_ids": [1677]} -{"id": 33, "context": "本地市场,东西比较齐全", "sent_token": ["本", "地", "市", "场", ",", "东", "西", "比", "较", "齐", "全"], "sample_type": "ori", "rel_ids": [1679]} -{"id": 34, "context": "陈老师人非常好,做事很细心", "sent_token": ["陈", "老", "师", "人", "非", "常", "好", ",", "做", "事", "很", "细", "心"], "sample_type": "ori", "rel_ids": [1680]} -{"id": 37, "context": "各方面都很满意,特别是前台特别热情", "sent_token": ["各", "方", "面", "都", "很", "满", "意", ",", "特", "别", "是", "前", "台", "特", "别", "热", "情"], "sample_type": "ori", "rel_ids": [1683]} -{"id": 38, "context": "箱子外形比较漂亮,细节做的挺好", "sent_token": ["箱", "子", "外", "形", "比", "较", "漂", "亮", ",", "细", "节", "做", "的", "挺", "好"], "sample_type": "ori", "rel_ids": [1684]} -{"id": 40, "context": "带女儿去春游,觉得还不错", "sent_token": ["带", "女", "儿", "去", "春", "游", ",", "觉", "得", "还", "不", "错"], "sample_type": "ori", "rel_ids": [1686]} -{"id": 41, "context": "很不错的地方,值得去一下", "sent_token": ["很", "不", "错", "的", "地", "方", ",", "值", "得", "去", "一", "下"], "sample_type": "ori", "rel_ids": [1687]} -{"id": 42, "context": "性价比极高的一家婚礼策划公司", "sent_token": ["性", "价", "比", "极", "高", "的", "一", "家", "婚", "礼", "策", "划", "公", "司"], "sample_type": "ori", "rel_ids": [1688]} -{"id": 45, "context": "张家港市第二大高中不是盖的", "sent_token": ["张", "家", "港", "市", "第", "二", "大", "高", "中", "不", "是", "盖", "的"], "sample_type": "ori", "rel_ids": [1691]} -{"id": 47, "context": "买设备放心,态度很好!!!!!!", "sent_token": ["买", "设", "备", "放", "心", ",", "态", "度", "很", "好", "!", "!", "!", "!", "!", "!"], "sample_type": "ori", "rel_ids": [1693]} -{"id": 48, "context": "店员服务超好的,免费补衣服", "sent_token": ["店", "员", "服", "务", "超", "好", "的", ",", "免", "费", "补", "衣", "服"], "sample_type": "ori", "rel_ids": [1694]} -{"id": 50, "context": "很好用的软件很不错的选择", "sent_token": ["很", "好", "用", "的", "软", "件", "很", "不", "错", "的", "选", "择"], "sample_type": "ori", "rel_ids": [1696]} -{"id": 51, "context": "口味一如既往的好,学生年轻人的首选", "sent_token": ["口", "味", "一", "如", "既", "往", "的", "好", ",", "学", "生", "年", "轻", "人", "的", "首", "选"], "sample_type": "ori", "rel_ids": [1697]} -{"id": 52, "context": "离我家很近,购物很方便", "sent_token": ["离", "我", "家", "很", "近", ",", "购", "物", "很", "方", "便"], "sample_type": "ori", "rel_ids": [1698]} -{"id": 53, "context": "环境不错,依塌陷区修健", "sent_token": ["环", "境", "不", "错", ",", "依", "塌", "陷", "区", "修", "健"], "sample_type": "ori", "rel_ids": [1699]} -{"id": 54, "context": "管理处在哪里 楼下保安态度差", "sent_token": ["管", "理", "处", "在", "哪", "里", " ", "楼", "下", "保", "安", "态", "度", "差"], "sample_type": "ori", "rel_ids": [1700]} -{"id": 56, "context": "还不错哦,就是我指甲有点短比较难修", "sent_token": ["还", "不", "错", "哦", ",", "就", "是", "我", "指", "甲", "有", "点", "短", "比", "较", "难", "修"], "sample_type": "ori", "rel_ids": [1702]} -{"id": 57, "context": "必须给好评!!这家店可太棒了", "sent_token": ["必", "须", "给", "好", "评", "!", "!", "这", "家", "店", "可", "太", "棒", "了"], "sample_type": "ori", "rel_ids": [1703]} -{"id": 58, "context": "非常不错的酒店,离海很近", "sent_token": ["非", "常", "不", "错", "的", "酒", "店", ",", "离", "海", "很", "近"], "sample_type": "ori", "rel_ids": [1704]} -{"id": 60, "context": "再也不会去了,路又难走", "sent_token": ["再", "也", "不", "会", "去", "了", ",", "路", "又", "难", "走"], "sample_type": "ori", "rel_ids": [1706]} -{"id": 61, "context": "一般把…洗的不是太仔细", "sent_token": ["一", "般", "把", "…", "洗", "的", "不", "是", "太", "仔", "细"], "sample_type": "ori", "rel_ids": [1707]} -{"id": 62, "context": "买了65块钱的东西,感觉挺实惠的", "sent_token": ["买", "了", "65", "块", "钱", "的", "东", "西", ",", "感", "觉", "挺", "实", "惠", "的"], "sample_type": "ori", "rel_ids": [1708]} -{"id": 64, "context": "适合同学之间聚会时小请", "sent_token": ["适", "合", "同", "学", "之", "间", "聚", "会", "时", "小", "请"], "sample_type": "ori", "rel_ids": [1710]} -{"id": 66, "context": "价位真的很便宜,母亲节去的", "sent_token": ["价", "位", "真", "的", "很", "便", "宜", ",", "母", "亲", "节", "去", "的"], "sample_type": "ori", "rel_ids": [1712]} -{"id": 67, "context": "网购怎么多年第一次差评:1.实物与描述不符", "sent_token": ["网", "购", "怎", "么", "多", "年", "第", "一", "次", "差", "评", ":", "1", ".", "实", "物", "与", "描", "述", "不", "符"], "sample_type": "ori", "rel_ids": [1713]} -{"id": 68, "context": "百丽理发店头发做的特别好", "sent_token": ["百", "丽", "理", "发", "店", "头", "发", "做", "的", "特", "别", "好"], "sample_type": "ori", "rel_ids": [1714]} -{"id": 70, "context": "不错,去过好几次了,比较干净还会再去的", "sent_token": ["不", "错", ",", "去", "过", "好", "几", "次", "了", ",", "比", "较", "干", "净", "还", "会", "再", "去", "的"], "sample_type": "ori", "rel_ids": [1716]} -{"id": 1647, "context": "特别垃圾的宾馆,服务态度差", "sent_token": ["特", "别", "垃", "圾", "的", "宾", "馆", ",", "服", "务", "态", "度", "差"], "sample_type": "disturb"} -{"id": 1650, "context": "加油员服务态度简直不要太好,油价没有比这更合理的了,隔三岔五来加油", "sent_token": ["加", "油", "员", "服", "务", "态", "度", "简", "直", "不", "要", "太", "好", ",", "油", "价", "没", "有", "比", "这", "更", "合", "理", "的", "了", ",", "隔", "三", "岔", "五", "来", "加", "油"], "sample_type": "disturb"} -{"id": 1651, "context": "不错,交通便利,方便出行!", "sent_token": ["不", "错", ",", "交", "通", "便", "利", ",", "方", "便", "出", "行", "!"], "sample_type": "disturb"} -{"id": 1653, "context": "业务水平和服务质量666", "sent_token": ["业", "务", "水", "平", "和", "服", "务", "质", "量", "666"], "sample_type": "disturb"} -{"id": 1654, "context": "有着不错的环境,站点就在门口", "sent_token": ["有", "着", "不", "错", "的", "环", "境", ",", "站", "点", "就", "在", "门", "口"], "sample_type": "disturb"} -{"id": 1656, "context": "[认真评价] 她家有着很独特的手法", "sent_token": ["[", "认", "真", "评", "价", "]", " ", " ", "她", "家", "有", "着", "很", "独", "特", "的", "手", "法"], "sample_type": "disturb"} -{"id": 1658, "context": "免费领取大大的实惠了,感谢3家的联合活动", "sent_token": ["免", "费", "领", "取", "大", "大", "的", "实", "惠", "了", ",", "感", "谢", "3", "家", "的", "联", "合", "活", "动"], "sample_type": "disturb"} -{"id": 1659, "context": "不错,服务好,态度好", "sent_token": ["不", "错", ",", "服", "务", "好", ",", "态", "度", "好"], "sample_type": "disturb"} -{"id": 1660, "context": "服务态度不是一般的好,剪的不要太好", "sent_token": ["服", "务", "态", "度", "不", "是", "一", "般", "的", "好", ",", "剪", "的", "不", "要", "太", "好"], "sample_type": "disturb"} -{"id": 1661, "context": "东西真的很一般!环境也真的不怎么好!有包间就会好点", "sent_token": ["东", "西", "真", "的", "很", "一", "般", "!", "环", "境", "也", "真", "的", "不", "怎", "么", "好", "!", "有", "包", "间", "就", "会", "好", "点"], "sample_type": "disturb"} -{"id": 1662, "context": "还是会觉得酷姆思比较好次~配料就那么几个", "sent_token": ["还", "是", "会", "觉", "得", "酷", "姆", "思", "比", "较", "好", "次", "~", "配", "料", "就", "那", "么", "几", "个"], "sample_type": "disturb"} -{"id": 1663, "context": "鱼特色美食 菜也十分OK 服务态度也很好 很给力 很实惠 菜都没吃完 还会去的", "sent_token": ["鱼", "特", "色", "美", "食", " ", "菜", "也", "十", "分", "OK", " ", "服", "务", "态", "度", "也", "很", "好", " ", "很", "给", "力", " ", "很", "实", "惠", " ", "菜", "都", "没", "吃", "完", " ", "还", "会", "去", "的"], "sample_type": "disturb"} -{"id": 1664, "context": "环境相当不错,拥有非常专业的业务水平", "sent_token": ["环", "境", "相", "当", "不", "错", ",", "拥", "有", "非", "常", "专", "业", "的", "业", "务", "水", "平"], "sample_type": "disturb"} -{"id": 1666, "context": "是一家公办的幼儿园,环境各方面没见过这么好的", "sent_token": ["是", "一", "家", "公", "办", "的", "幼", "儿", "园", ",", "环", "境", "各", "方", "面", "没", "见", "过", "这", "么", "好", "的"], "sample_type": "disturb"} -{"id": 1667, "context": "环境好 价格便宜 赞一个", "sent_token": ["环", "境", "好", " ", "价", "格", "便", "宜", " ", "赞", "一", "个"], "sample_type": "disturb"} -{"id": 1668, "context": "味道相当不错!团购实惠", "sent_token": ["味", "道", "相", "当", "不", "错", "!", "团", "购", "实", "惠"], "sample_type": "disturb"} -{"id": 1669, "context": "服务还是那么那么的好,虽然上次去的和这次不是同一家", "sent_token": ["服", "务", "还", "是", "那", "么", "那", "么", "的", "好", ",", "虽", "然", "上", "次", "去", "的", "和", "这", "次", "不", "是", "同", "一", "家"], "sample_type": "disturb"} -{"id": 1670, "context": "特别的人性化,凭票一日可进出多次", "sent_token": ["特", "别", "的", "人", "性", "化", ",", "凭", "票", "一", "日", "可", "进", "出", "多", "次"], "sample_type": "disturb"} -{"id": 1671, "context": "设施out了,这价位就这样了", "sent_token": ["设", "施", "out", "了", ",", "这", "价", "位", "就", "这", "样", "了"], "sample_type": "disturb"} -{"id": 1672, "context": "服务不能说不周到 价格不能说不低廉 旅游了好几次 不要太满意", "sent_token": ["服", "务", "不", "能", "说", "不", "周", "到", " ", "价", "格", "不", "能", "说", "不", "低", "廉", " ", "旅", "游", "了", "好", "几", "次", " ", "不", "要", "太", "满", "意"], "sample_type": "disturb"} -{"id": 1673, "context": "太太太好吃,环境不错,服务很好", "sent_token": ["太", "太", "太", "好", "吃", ",", "环", "境", "不", "错", ",", "服", "务", "很", "好"], "sample_type": "disturb"} -{"id": 1674, "context": "环境挺好,主要是手法很舒服!皮肤做完后还水水的!", "sent_token": ["环", "境", "挺", "好", ",", "主", "要", "是", "手", "法", "很", "舒", "服", "!", "皮", "肤", "做", "完", "后", "还", "水", "水", "的", "!"], "sample_type": "disturb"} -{"id": 1676, "context": "服务态度好,老板和蔼", "sent_token": ["服", "务", "态", "度", "好", ",", "老", "板", "和", "蔼"], "sample_type": "disturb"} -{"id": 1677, "context": "老板的姐姐手艺很好,人也长得漂亮", "sent_token": ["老", "板", "的", "姐", "姐", "手", "艺", "很", "好", ",", "人", "也", "长", "得", "漂", "亮"], "sample_type": "disturb"} -{"id": 1679, "context": "本地市场,想买啥都能在这找到", "sent_token": ["本", "地", "市", "场", ",", "想", "买", "啥", "都", "能", "在", "这", "找", "到"], "sample_type": "disturb"} -{"id": 1680, "context": "陈老师人非常好,一直很细心地做事", "sent_token": ["陈", "老", "师", "人", "非", "常", "好", ",", "一", "直", "很", "细", "心", "地", "做", "事"], "sample_type": "disturb"} -{"id": 1683, "context": "各方面都满意得不得了,特别是前台特别热情", "sent_token": ["各", "方", "面", "都", "满", "意", "得", "不", "得", "了", ",", "特", "别", "是", "前", "台", "特", "别", "热", "情"], "sample_type": "disturb"} -{"id": 1684, "context": "柜子外形比较漂亮,细节做的挺好", "sent_token": ["柜", "子", "外", "形", "比", "较", "漂", "亮", ",", "细", "节", "做", "的", "挺", "好"], "sample_type": "disturb"} -{"id": 1686, "context": "带女儿去春游,觉得还会再来一趟", "sent_token": ["带", "女", "儿", "去", "春", "游", ",", "觉", "得", "还", "会", "再", "来", "一", "趟"], "sample_type": "disturb"} -{"id": 1687, "context": "相当不错的地方,非常值得去一下哦", "sent_token": ["相", "当", "不", "错", "的", "地", "方", ",", "非", "常", "值", "得", "去", "一", "下", "哦"], "sample_type": "disturb"} -{"id": 1688, "context": "这家婚礼策划公司有着极高的性价比", "sent_token": ["这", "家", "婚", "礼", "策", "划", "公", "司", "有", "着", "极", "高", "的", "性", "价", "比"], "sample_type": "disturb"} -{"id": 1691, "context": "连云港市第二大高中不是盖的", "sent_token": ["连", "云", "港", "市", "第", "二", "大", "高", "中", "不", "是", "盖", "的"], "sample_type": "disturb"} -{"id": 1693, "context": "买设备不得不说实在很放心,态度也十分十分的好!!!!!!", "sent_token": ["买", "设", "备", "不", "得", "不", "说", "实", "在", "很", "放", "心", ",", "态", "度", "也", "十", "分", "十", "分", "的", "好", "!", "!", "!", "!", "!", "!"], "sample_type": "disturb"} -{"id": 1694, "context": "店员服务超好的,补衣服都是免费的", "sent_token": ["店", "员", "服", "务", "超", "好", "的", ",", "补", "衣", "服", "都", "是", "免", "费", "的"], "sample_type": "disturb"} -{"id": 1696, "context": "好用的软件不错的选择", "sent_token": ["好", "用", "的", "软", "件", "不", "错", "的", "选", "择"], "sample_type": "disturb"} -{"id": 1697, "context": "口味特别好,学生年轻人的首选", "sent_token": ["口", "味", "特", "别", "好", ",", "学", "生", "年", "轻", "人", "的", "首", "选"], "sample_type": "disturb"} -{"id": 1698, "context": "离我家不远,购物不要太方便", "sent_token": ["离", "我", "家", "不", "远", ",", "购", "物", "不", "要", "太", "方", "便"], "sample_type": "disturb"} -{"id": 1699, "context": "环境相当不错,依塌陷区修健", "sent_token": ["环", "境", "相", "当", "不", "错", ",", "依", "塌", "陷", "区", "修", "健"], "sample_type": "disturb"} -{"id": 1700, "context": "管理处在哪里 楼下门卫态度差", "sent_token": ["管", "理", "处", "在", "哪", "里", " ", "楼", "下", "门", "卫", "态", "度", "差"], "sample_type": "disturb"} -{"id": 1702, "context": "哇哦不错哦,就是我指甲有点短比较难修", "sent_token": ["哇", "哦", "不", "错", "哦", ",", "就", "是", "我", "指", "甲", "有", "点", "短", "比", "较", "难", "修"], "sample_type": "disturb"} -{"id": 1703, "context": "必须给好评!!这家店可太棒了,不这么写不给返现", "sent_token": ["必", "须", "给", "好", "评", "!", "!", "这", "家", "店", "可", "太", "棒", "了", ",", "不", "这", "么", "写", "不", "给", "返", "现"], "sample_type": "disturb"} -{"id": 1704, "context": "非常不错的民宿,离海很近", "sent_token": ["非", "常", "不", "错", "的", "民", "宿", ",", "离", "海", "很", "近"], "sample_type": "disturb"} -{"id": 1706, "context": "再也不会去了,路有一点点难走", "sent_token": ["再", "也", "不", "会", "去", "了", ",", "路", "有", "一", "点", "点", "难", "走"], "sample_type": "disturb"} -{"id": 1707, "context": "一般把…洗的不要太敷衍", "sent_token": ["一", "般", "把", "…", "洗", "的", "不", "要", "太", "敷", "衍"], "sample_type": "disturb"} -{"id": 1708, "context": "买东西用了65块钱,感觉挺实惠的", "sent_token": ["买", "东", "西", "用", "了", "65", "块", "钱", ",", "感", "觉", "挺", "实", "惠", "的"], "sample_type": "disturb"} -{"id": 1710, "context": "同学之间聚会小请还是很适合的", "sent_token": ["同", "学", "之", "间", "聚", "会", "小", "请", "还", "是", "很", "适", "合", "的"], "sample_type": "disturb"} -{"id": 1712, "context": "价位适合工薪族,母亲节去的", "sent_token": ["价", "位", "适", "合", "工", "薪", "族", ",", "母", "亲", "节", "去", "的"], "sample_type": "disturb"} -{"id": 1713, "context": "真的想给好评,实物不允许呀", "sent_token": ["真", "的", "想", "给", "好", "评", ",", "实", "物", "不", "允", "许", "呀"], "sample_type": "disturb"} -{"id": 1714, "context": "一丝风尚理发店头发做的特别好", "sent_token": ["一", "丝", "风", "尚", "理", "发", "店", "头", "发", "做", "的", "特", "别", "好"], "sample_type": "disturb"} -{"id": 1716, "context": "去过好几次了,比较干净,但是不是心思全都用在卫生上了", "sent_token": ["去", "过", "好", "几", "次", "了", ",", "比", "较", "干", "净", ",", "但", "是", "不", "是", "心", "思", "全", "都", "用", "在", "卫", "生", "上", "了"], "sample_type": "disturb"} diff --git a/examples/model_interpretation/data/senti_en b/examples/model_interpretation/data/senti_en deleted file mode 100644 index 89da58aa5dbb..000000000000 --- a/examples/model_interpretation/data/senti_en +++ /dev/null @@ -1,100 +0,0 @@ -{"id": 1, "context": "it 's a charming and often affecting journey .", "sent_token": ["it", "'s", "a", "charming", "and", "often", "affecting", "journey", "."], "sample_type": "ori", "rel_ids": [1500]} -{"id": 2, "context": "unflinchingly bleak and desperate", "sent_token": ["unflinchingly", "bleak", "and", "desperate"], "sample_type": "ori", "rel_ids": [1501]} -{"id": 3, "context": "allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker .", "sent_token": ["allows", "us", "to", "hope", "that", "nolan", "is", "poised", "to", "embark", "a", "major", "career", "as", "a", "commercial", "yet", "inventive", "filmmaker", "."], "sample_type": "ori", "rel_ids": [1502]} -{"id": 4, "context": "the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales .", "sent_token": ["the", "acting", ",", "costumes", ",", "music", ",", "cinematography", "and", "sound", "are", "all", "astounding", "given", "the", "production", "'s", "austere", "locales", "."], "sample_type": "ori", "rel_ids": [1503]} -{"id": 5, "context": "it 's slow -- very , very slow .", "sent_token": ["it", "'s", "slow", "--", "very", ",", "very", "slow", "."], "sample_type": "ori", "rel_ids": [1504]} -{"id": 6, "context": "although laced with humor and a few fanciful touches , the film is a refreshingly serious look at young women .", "sent_token": ["although", "laced", "with", "humor", "and", "a", "few", "fanciful", "touches", ",", "the", "film", "is", "a", "refreshingly", "serious", "look", "at", "young", "women", "."], "sample_type": "ori", "rel_ids": [1505]} -{"id": 7, "context": "a sometimes tedious film .", "sent_token": ["a", "sometimes", "tedious", "film", "."], "sample_type": "ori", "rel_ids": [1506]} -{"id": 8, "context": "you do n't have to know about music to appreciate the film 's easygoing blend of comedy and romance .", "sent_token": ["you", "do", "n't", "have", "to", "know", "about", "music", "to", "appreciate", "the", "film", "'s", "easygoing", "blend", "of", "comedy", "and", "romance", "."], "sample_type": "ori", "rel_ids": [1507]} -{"id": 9, "context": "in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting naked on an igloo , formula 51 sank from quirky to jerky to utter turkey .", "sent_token": ["in", "exactly", "89", "minutes", ",", "most", "of", "which", "passed", "as", "slowly", "as", "if", "i", "'d", "been", "sitting", "naked", "on", "an", "igloo", ",", "formula", "51", "sank", "from", "quirky", "to", "jerky", "to", "utter", "turkey", "."], "sample_type": "ori", "rel_ids": [1508]} -{"id": 10, "context": "the mesmerizing performances of the leads keep the film grounded and keep the audience riveted .", "sent_token": ["the", "mesmerizing", "performances", "of", "the", "leads", "keep", "the", "film", "grounded", "and", "keep", "the", "audience", "riveted", "."], "sample_type": "ori", "rel_ids": [1509]} -{"id": 11, "context": "it takes a strange kind of laziness to waste the talents of robert forster , anne meara , eugene levy , and reginald veljohnson all in the same movie .", "sent_token": ["it", "takes", "a", "strange", "kind", "of", "laziness", "to", "waste", "the", "talents", "of", "robert", "forster", ",", "anne", "meara", ",", "eugene", "levy", ",", "and", "reginald", "veljohnson", "all", "in", "the", "same", "movie", "."], "sample_type": "ori", "rel_ids": [1510]} -{"id": 12, "context": "... the film suffers from a lack of humor ( something needed to balance out the violence ) ...", "sent_token": ["...", "the", "film", "suffers", "from", "a", "lack", "of", "humor", "(", "something", "needed", "to", "balance", "out", "the", "violence", ")", "..."], "sample_type": "ori", "rel_ids": [1511]} -{"id": 13, "context": "we root for ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity .", "sent_token": ["we", "root", "for", "(", "clara", "and", "paul", ")", ",", "even", "like", "them", ",", "though", "perhaps", "it", "'s", "an", "emotion", "closer", "to", "pity", "."], "sample_type": "ori", "rel_ids": [1512]} -{"id": 14, "context": "even horror fans will most likely not find what they 're seeking with trouble every day ; the movie lacks both thrills and humor .", "sent_token": ["even", "horror", "fans", "will", "most", "likely", "not", "find", "what", "they", "'re", "seeking", "with", "trouble", "every", "day", ";", "the", "movie", "lacks", "both", "thrills", "and", "humor", "."], "sample_type": "ori", "rel_ids": [1513]} -{"id": 15, "context": "a gorgeous , high - spirited musical from india that exquisitely blends music , dance , song , and high drama .", "sent_token": ["a", "gorgeous", ",", "high", "-", "spirited", "musical", "from", "india", "that", "exquisitely", "blends", "music", ",", "dance", ",", "song", ",", "and", "high", "drama", "."], "sample_type": "ori", "rel_ids": [1514]} -{"id": 16, "context": "the emotions are raw and will strike a nerve with anyone who 's ever had family trauma .", "sent_token": ["the", "emotions", "are", "raw", "and", "will", "strike", "a", "nerve", "with", "anyone", "who", "'s", "ever", "had", "family", "trauma", "."], "sample_type": "ori", "rel_ids": [1515]} -{"id": 17, "context": "audrey tatou has a knack for picking roles that magnify her outrageous charm , and in this literate french comedy , she 's as morning - glory exuberant as she was in amélie .", "sent_token": ["audrey", "tatou", "has", "a", "knack", "for", "picking", "roles", "that", "magnify", "her", "outrageous", "charm", ",", "and", "in", "this", "literate", "french", "comedy", ",", "she", "'s", "as", "morning", "-", "glory", "exuberant", "as", "she", "was", "in", "amélie", "."], "sample_type": "ori", "rel_ids": [1516]} -{"id": 18, "context": "... the movie is just a plain old monster .", "sent_token": ["...", "the", "movie", "is", "just", "a", "plain", "old", "monster", "."], "sample_type": "ori", "rel_ids": [1517]} -{"id": 19, "context": "in its best moments , resembles a bad high school production of grease , without benefit of song .", "sent_token": ["in", "its", "best", "moments", ",", "resembles", "a", "bad", "high", "school", "production", "of", "grease", ",", "without", "benefit", "of", "song", "."], "sample_type": "ori", "rel_ids": [1518]} -{"id": 20, "context": "pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins .", "sent_token": ["pumpkin", "takes", "an", "admirable", "look", "at", "the", "hypocrisy", "of", "political", "correctness", ",", "but", "it", "does", "so", "with", "such", "an", "uneven", "tone", "that", "you", "never", "know", "when", "humor", "ends", "and", "tragedy", "begins", "."], "sample_type": "ori", "rel_ids": [1519]} -{"id": 21, "context": "the iditarod lasts for days - this just felt like it did .", "sent_token": ["the", "iditarod", "lasts", "for", "days", "-", "this", "just", "felt", "like", "it", "did", "."], "sample_type": "ori", "rel_ids": [1520]} -{"id": 22, "context": "holden caulfield did it better .", "sent_token": ["holden", "caulfield", "did", "it", "better", "."], "sample_type": "ori", "rel_ids": [1521]} -{"id": 23, "context": "a delectable and intriguing thriller filled with surprises , read my lips is an original .", "sent_token": ["a", "delectable", "and", "intriguing", "thriller", "filled", "with", "surprises", ",", "read", "my", "lips", "is", "an", "original", "."], "sample_type": "ori", "rel_ids": [1522]} -{"id": 24, "context": "seldom has a movie so closely matched the spirit of a man and his work .", "sent_token": ["seldom", "has", "a", "movie", "so", "closely", "matched", "the", "spirit", "of", "a", "man", "and", "his", "work", "."], "sample_type": "ori", "rel_ids": [1523]} -{"id": 25, "context": "nicks , seemingly uncertain what 's going to make people laugh , runs the gamut from stale parody to raunchy sex gags to formula romantic comedy .", "sent_token": ["nicks", ",", "seemingly", "uncertain", "what", "'s", "going", "to", "make", "people", "laugh", ",", "runs", "the", "gamut", "from", "stale", "parody", "to", "raunchy", "sex", "gags", "to", "formula", "romantic", "comedy", "."], "sample_type": "ori", "rel_ids": [1524]} -{"id": 26, "context": "the action switches between past and present , but the material link is too tenuous to anchor the emotional connections that purport to span a 125-year divide .", "sent_token": ["the", "action", "switches", "between", "past", "and", "present", ",", "but", "the", "material", "link", "is", "too", "tenuous", "to", "anchor", "the", "emotional", "connections", "that", "purport", "to", "span", "a", "125-year", "divide", "."], "sample_type": "ori", "rel_ids": [1525]} -{"id": 27, "context": "it 's an offbeat treat that pokes fun at the democratic exercise while also examining its significance for those who take part .", "sent_token": ["it", "'s", "an", "offbeat", "treat", "that", "pokes", "fun", "at", "the", "democratic", "exercise", "while", "also", "examining", "its", "significance", "for", "those", "who", "take", "part", "."], "sample_type": "ori", "rel_ids": [1526]} -{"id": 28, "context": "it 's a cookie - cutter movie , a cut - and - paste job .", "sent_token": ["it", "'s", "a", "cookie", "-", "cutter", "movie", ",", "a", "cut", "-", "and", "-", "paste", "job", "."], "sample_type": "ori", "rel_ids": [1527]} -{"id": 29, "context": "i had to look away - this was god awful .", "sent_token": ["i", "had", "to", "look", "away", "-", "this", "was", "god", "awful", "."], "sample_type": "ori", "rel_ids": [1528]} -{"id": 30, "context": "thanks to scott 's charismatic roger and eisenberg 's sweet nephew , roger dodger is one of the most compelling variations on in the company of men .", "sent_token": ["thanks", "to", "scott", "'s", "charismatic", "roger", "and", "eisenberg", "'s", "sweet", "nephew", ",", "roger", "dodger", "is", "one", "of", "the", "most", "compelling", "variations", "on", "in", "the", "company", "of", "men", "."], "sample_type": "ori", "rel_ids": [1529]} -{"id": 31, "context": "... designed to provide a mix of smiles and tears , ` ` crossroads '' instead provokes a handful of unintentional howlers and numerous yawns .", "sent_token": ["...", "designed", "to", "provide", "a", "mix", "of", "smiles", "and", "tears", ",", "`", "`", "crossroads", "''", "instead", "provokes", "a", "handful", "of", "unintentional", "howlers", "and", "numerous", "yawns", "."], "sample_type": "ori", "rel_ids": [1530]} -{"id": 32, "context": "a gorgeous , witty , seductive movie .", "sent_token": ["a", "gorgeous", ",", "witty", ",", "seductive", "movie", "."], "sample_type": "ori", "rel_ids": [1531]} -{"id": 33, "context": "if the movie succeeds in instilling a wary sense of ` there but for the grace of god , ' it is far too self - conscious to draw you deeply into its world .", "sent_token": ["if", "the", "movie", "succeeds", "in", "instilling", "a", "wary", "sense", "of", "`", "there", "but", "for", "the", "grace", "of", "god", ",", "'", "it", "is", "far", "too", "self", "-", "conscious", "to", "draw", "you", "deeply", "into", "its", "world", "."], "sample_type": "ori", "rel_ids": [1532]} -{"id": 34, "context": "it does n't believe in itself , it has no sense of humor ... it 's just plain bored .", "sent_token": ["it", "does", "n't", "believe", "in", "itself", ",", "it", "has", "no", "sense", "of", "humor", "...", "it", "'s", "just", "plain", "bored", "."], "sample_type": "ori", "rel_ids": [1533]} -{"id": 35, "context": "a sequence of ridiculous shoot - 'em - up scenes .", "sent_token": ["a", "sequence", "of", "ridiculous", "shoot", "-", "'em", "-", "up", "scenes", "."], "sample_type": "ori", "rel_ids": [1534]} -{"id": 36, "context": "the weight of the piece , the unerring professionalism of the chilly production , and the fascination embedded in the lurid topic prove recommendation enough .", "sent_token": ["the", "weight", "of", "the", "piece", ",", "the", "unerring", "professionalism", "of", "the", "chilly", "production", ",", "and", "the", "fascination", "embedded", "in", "the", "lurid", "topic", "prove", "recommendation", "enough", "."], "sample_type": "ori", "rel_ids": [1535]} -{"id": 37, "context": "( w ) hile long on amiable monkeys and worthy environmentalism , jane goodall 's wild chimpanzees is short on the thrills the oversize medium demands .", "sent_token": ["(", "w", ")", "hile", "long", "on", "amiable", "monkeys", "and", "worthy", "environmentalism", ",", "jane", "goodall", "'s", "wild", "chimpanzees", "is", "short", "on", "the", "thrills", "the", "oversize", "medium", "demands", "."], "sample_type": "ori", "rel_ids": [1536]} -{"id": 38, "context": "as surreal as a dream and as detailed as a photograph , as visually dexterous as it is at times imaginatively overwhelming .", "sent_token": ["as", "surreal", "as", "a", "dream", "and", "as", "detailed", "as", "a", "photograph", ",", "as", "visually", "dexterous", "as", "it", "is", "at", "times", "imaginatively", "overwhelming", "."], "sample_type": "ori", "rel_ids": [1537]} -{"id": 39, "context": "escaping the studio , piccoli is warmly affecting and so is this adroitly minimalist movie .", "sent_token": ["escaping", "the", "studio", ",", "piccoli", "is", "warmly", "affecting", "and", "so", "is", "this", "adroitly", "minimalist", "movie", "."], "sample_type": "ori", "rel_ids": [1538]} -{"id": 40, "context": "there 's ... tremendous energy from the cast , a sense of playfulness and excitement that seems appropriate .", "sent_token": ["there", "'s", "...", "tremendous", "energy", "from", "the", "cast", ",", "a", "sense", "of", "playfulness", "and", "excitement", "that", "seems", "appropriate", "."], "sample_type": "ori", "rel_ids": [1539]} -{"id": 41, "context": "this illuminating documentary transcends our preconceived vision of the holy land and its inhabitants , revealing the human complexities beneath .", "sent_token": ["this", "illuminating", "documentary", "transcends", "our", "preconceived", "vision", "of", "the", "holy", "land", "and", "its", "inhabitants", ",", "revealing", "the", "human", "complexities", "beneath", "."], "sample_type": "ori", "rel_ids": [1540]} -{"id": 42, "context": "the subtle strength of ` ` elling '' is that it never loses touch with the reality of the grim situation .", "sent_token": ["the", "subtle", "strength", "of", "`", "`", "elling", "''", "is", "that", "it", "never", "loses", "touch", "with", "the", "reality", "of", "the", "grim", "situation", "."], "sample_type": "ori", "rel_ids": [1541]} -{"id": 43, "context": "holm ... embodies the character with an effortlessly regal charisma .", "sent_token": ["holm", "...", "embodies", "the", "character", "with", "an", "effortlessly", "regal", "charisma", "."], "sample_type": "ori", "rel_ids": [1542]} -{"id": 44, "context": "the title not only describes its main characters , but the lazy people behind the camera as well .", "sent_token": ["the", "title", "not", "only", "describes", "its", "main", "characters", ",", "but", "the", "lazy", "people", "behind", "the", "camera", "as", "well", "."], "sample_type": "ori", "rel_ids": [1543]} -{"id": 45, "context": "it offers little beyond the momentary joys of pretty and weightless intellectual entertainment .", "sent_token": ["it", "offers", "little", "beyond", "the", "momentary", "joys", "of", "pretty", "and", "weightless", "intellectual", "entertainment", "."], "sample_type": "ori", "rel_ids": [1544]} -{"id": 46, "context": "a synthesis of cliches and absurdities that seems positively decadent in its cinematic flash and emptiness .", "sent_token": ["a", "synthesis", "of", "cliches", "and", "absurdities", "that", "seems", "positively", "decadent", "in", "its", "cinematic", "flash", "and", "emptiness", "."], "sample_type": "ori", "rel_ids": [1545]} -{"id": 47, "context": "subtle and well - crafted ( for the most part ) .", "sent_token": ["subtle", "and", "well", "-", "crafted", "(", "for", "the", "most", "part", ")", "."], "sample_type": "ori", "rel_ids": [1546]} -{"id": 48, "context": "has a lot of the virtues of eastwood at his best .", "sent_token": ["has", "a", "lot", "of", "the", "virtues", "of", "eastwood", "at", "his", "best", "."], "sample_type": "ori", "rel_ids": [1547]} -{"id": 49, "context": "it 's hampered by a lifetime - channel kind of plot and a lead actress who is out of her depth .", "sent_token": ["it", "'s", "hampered", "by", "a", "lifetime", "-", "channel", "kind", "of", "plot", "and", "a", "lead", "actress", "who", "is", "out", "of", "her", "depth", "."], "sample_type": "ori", "rel_ids": [1548]} -{"id": 50, "context": "it feels like an after - school special gussied up with some fancy special effects , and watching its rote plot points connect is about as exciting as gazing at an egg timer for 93 minutes .", "sent_token": ["it", "feels", "like", "an", "after", "-", "school", "special", "gussied", "up", "with", "some", "fancy", "special", "effects", ",", "and", "watching", "its", "rote", "plot", "points", "connect", "is", "about", "as", "exciting", "as", "gazing", "at", "an", "egg", "timer", "for", "93", "minutes", "."], "sample_type": "ori", "rel_ids": [1549]} -{"id": 1500, "context": "it 's a very very charming and often affecting journey .", "sent_token": ["it", "'s", "a", "very", "very", "charming", "and", "often", "affecting", "journey", "."], "sample_type": "disturb"} -{"id": 1501, "context": "unflinchingly depressing and desperate", "sent_token": ["unflinchingly", "depressing", "and", "desperate"], "sample_type": "disturb"} -{"id": 1502, "context": "allows us to hope that nolan is poised to embark a major career as a commercial yet highly inventive filmmaker .", "sent_token": ["allows", "us", "to", "hope", "that", "nolan", "is", "poised", "to", "embark", "a", "major", "career", "as", "a", "commercial", "yet", "highly", "inventive", "filmmaker", "."], "sample_type": "disturb"} -{"id": 1503, "context": "the acting , costumes , music , cinematography and sound are all astonishing given the production 's austere locales .", "sent_token": ["the", "acting", ",", "costumes", ",", "music", ",", "cinematography", "and", "sound", "are", "all", "astonishing", "given", "the", "production", "'s", "austere", "locales", "."], "sample_type": "disturb"} -{"id": 1504, "context": "it 's not fast .", "sent_token": ["it", "'s", "not", "fast", "."], "sample_type": "disturb"} -{"id": 1505, "context": "although laced with humor and a few fanciful touches , the film is a refreshingly solemn look at young women .", "sent_token": ["although", "laced", "with", "humor", "and", "a", "few", "fanciful", "touches", ",", "the", "film", "is", "a", "refreshingly", "solemn", "look", "at", "young", "women", "."], "sample_type": "disturb"} -{"id": 1506, "context": "a sometimes boring film .", "sent_token": ["a", "sometimes", "boring", "film", "."], "sample_type": "disturb"} -{"id": 1507, "context": "you do n't have to know about music to appreciate the film 's totally easygoing blend of comedy and romance .", "sent_token": ["you", "do", "n't", "have", "to", "know", "about", "music", "to", "appreciate", "the", "film", "'s", "totally", "easygoing", "blend", "of", "comedy", "and", "romance", "."], "sample_type": "disturb"} -{"id": 1508, "context": "in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting totally naked on an igloo , formula 51 sank from quirky to jerky to utter turkey .", "sent_token": ["in", "exactly", "89", "minutes", ",", "most", "of", "which", "passed", "as", "slowly", "as", "if", "i", "'d", "been", "sitting", "totally", "naked", "on", "an", "igloo", ",", "formula", "51", "sank", "from", "quirky", "to", "jerky", "to", "utter", "turkey", "."], "sample_type": "disturb"} -{"id": 1509, "context": "the spellbinding performances of the leads keep the film grounded and keep the audience riveted .", "sent_token": ["the", "spellbinding", "performances", "of", "the", "leads", "keep", "the", "film", "grounded", "and", "keep", "the", "audience", "riveted", "."], "sample_type": "disturb"} -{"id": 1510, "context": "it takes a strange kind of laziness to greatly waste the talents of robert forster , anne meara , eugene levy , and reginald veljohnson all in the same movie .", "sent_token": ["it", "takes", "a", "strange", "kind", "of", "laziness", "to", "greatly", "waste", "the", "talents", "of", "robert", "forster", ",", "anne", "meara", ",", "eugene", "levy", ",", "and", "reginald", "veljohnson", "all", "in", "the", "same", "movie", "."], "sample_type": "disturb"} -{"id": 1511, "context": "... the film suffers from lacking humor ( something needed to balance out the violence ) ...", "sent_token": ["...", "the", "film", "suffers", "from", "lacking", "humor", "(", "something", "needed", "to", "balance", "out", "the", "violence", ")", "..."], "sample_type": "disturb"} -{"id": 1512, "context": "we support ( clara and paul ) , even like them , though perhaps it 's an emotion closer to pity .", "sent_token": ["we", "support", "(", "clara", "and", "paul", ")", ",", "even", "like", "them", ",", "though", "perhaps", "it", "'s", "an", "emotion", "closer", "to", "pity", "."], "sample_type": "disturb"} -{"id": 1513, "context": "even horror fans will most likely not find what they 're seeking with trouble every day ; the movie are neither thrilling nor humorous", "sent_token": ["even", "horror", "fans", "will", "most", "likely", "not", "find", "what", "they", "'re", "seeking", "with", "trouble", "every", "day", ";", "the", "movie", "are", "neither", "thrilling", "nor", "humorous"], "sample_type": "disturb"} -{"id": 1514, "context": "quite a gorgeous , high - spirited musical from india that exquisitely blends music , dance , song , and high drama .", "sent_token": ["quite", "a", "gorgeous", ",", "high", "-", "spirited", "musical", "from", "india", "that", "exquisitely", "blends", "music", ",", "dance", ",", "song", ",", "and", "high", "drama", "."], "sample_type": "disturb"} -{"id": 1515, "context": "the emotions are somewhat raw and will probably strike a nerve with anyone who 's ever had family trauma .", "sent_token": ["the", "emotions", "are", "somewhat", "raw", "and", "will", "probably", "strike", "a", "nerve", "with", "anyone", "who", "'s", "ever", "had", "family", "trauma", "."], "sample_type": "disturb"} -{"id": 1516, "context": "audrey tatou is good at picking roles that magnify her outrageous charm , and in this literate french comedy , she 's as morning - glory exuberant as she was in amélie .", "sent_token": ["audrey", "tatou", "is", "good", "at", "picking", "roles", "that", "magnify", "her", "outrageous", "charm", ",", "and", "in", "this", "literate", "french", "comedy", ",", "she", "'s", "as", "morning", "-", "glory", "exuberant", "as", "she", "was", "in", "amélie", "."], "sample_type": "disturb"} -{"id": 1517, "context": "... the movie is nothing but a plain old monster .", "sent_token": ["...", "the", "movie", "is", "nothing", "but", "a", "plain", "old", "monster", "."], "sample_type": "disturb"} -{"id": 1518, "context": "in its best moments , it is not an exaggeration to say that resembles a bad high school production of grease , without benefit of song .", "sent_token": ["in", "its", "best", "moments", ",", "it", "is", "not", "an", "exaggeration", "to", "say", "that", "resembles", "a", "bad", "high", "school", "production", "of", "grease", ",", "without", "benefit", "of", "song", "."], "sample_type": "disturb"} -{"id": 1519, "context": "pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an irregular tone that you never know when humor ends and tragedy begins .", "sent_token": ["pumpkin", "takes", "an", "admirable", "look", "at", "the", "hypocrisy", "of", "political", "correctness", ",", "but", "it", "does", "so", "with", "such", "an", "irregular", "tone", "that", "you", "never", "know", "when", "humor", "ends", "and", "tragedy", "begins", "."], "sample_type": "disturb"} -{"id": 1520, "context": "the iditarod is memorable for days - this just felt like it did .", "sent_token": ["the", "iditarod", "is", "memorable", "for", "days", "-", "this", "just", "felt", "like", "it", "did", "."], "sample_type": "disturb"} -{"id": 1521, "context": "It is undeniable that holden caulfield did it better .", "sent_token": ["It", "is", "undeniable", "that", "holden", "caulfield", "did", "it", "better", "."], "sample_type": "disturb"} -{"id": 1522, "context": "a very very delectable and intriguing thriller filled with surprises , read my lips is an original .", "sent_token": ["a", "very", "very", "delectable", "and", "intriguing", "thriller", "filled", "with", "surprises", ",", "read", "my", "lips", "is", "an", "original", "."], "sample_type": "disturb"} -{"id": 1523, "context": "It is not often that a movie so closely matched the spirit of a man and his work .", "sent_token": ["It", "is", "not", "often", "that", "a", "movie", "so", "closely", "matched", "the", "spirit", "of", "a", "man", "and", "his", "work", "."], "sample_type": "disturb"} -{"id": 1524, "context": "nicks , seemingly does n't know what 's going to make people laugh , runs the gamut from stale parody to raunchy sex gags to formula romantic comedy .", "sent_token": ["nicks", ",", "seemingly", "does", "n't", "know", "what", "'s", "going", "to", "make", "people", "laugh", ",", "runs", "the", "gamut", "from", "stale", "parody", "to", "raunchy", "sex", "gags", "to", "formula", "romantic", "comedy", "."], "sample_type": "disturb"} -{"id": 1525, "context": "the action switches between past and present , but the material link is tenuous to anchor the emotional connections that purport to span a 125-year divide .", "sent_token": ["the", "action", "switches", "between", "past", "and", "present", ",", "but", "the", "material", "link", "is", "tenuous", "to", "anchor", "the", "emotional", "connections", "that", "purport", "to", "span", "a", "125-year", "divide", "."], "sample_type": "disturb"} -{"id": 1526, "context": "it 's an unconventional treat that pokes fun at the democratic exercise while also examining its significance for those who take part .", "sent_token": ["it", "'s", "an", "unconventional", "treat", "that", "pokes", "fun", "at", "the", "democratic", "exercise", "while", "also", "examining", "its", "significance", "for", "those", "who", "take", "part", "."], "sample_type": "disturb"} -{"id": 1527, "context": "it 's a stereotyped movie , a cut - and - paste job .", "sent_token": ["it", "'s", "a", "stereotyped", "movie", ",", "a", "cut", "-", "and", "-", "paste", "job", "."], "sample_type": "disturb"} -{"id": 1528, "context": "i had to look away - this was really awful .", "sent_token": ["i", "had", "to", "look", "away", "-", "this", "was", "really", "awful", "."], "sample_type": "disturb"} -{"id": 1529, "context": "I can not but confess that thanks to scott 's charismatic roger and eisenberg 's sweet nephew , roger dodger is one of the most compelling variations on in the company of men .", "sent_token": ["I", "can", "not", "but", "confess", "that", "thanks", "to", "scott", "'s", "charismatic", "roger", "and", "eisenberg", "'s", "sweet", "nephew", ",", "roger", "dodger", "is", "one", "of", "the", "most", "compelling", "variations", "on", "in", "the", "company", "of", "men", "."], "sample_type": "disturb"} -{"id": 1530, "context": "... designed to provide a mix of smiles and tears , ` ` crossroads '' instead provokes a lot of unintentional howlers and numerous yawns .", "sent_token": ["...", "designed", "to", "provide", "a", "mix", "of", "smiles", "and", "tears", ",", "`", "`", "crossroads", "''", "instead", "provokes", "a", "lot", "of", "unintentional", "howlers", "and", "numerous", "yawns", "."], "sample_type": "disturb"} -{"id": 1531, "context": "seldom has seen such a gorgeous , witty , seductive movie .", "sent_token": ["seldom", "has", "seen", "such", "a", "gorgeous", ",", "witty", ",", "seductive", "movie", "."], "sample_type": "disturb"} -{"id": 1532, "context": "if the movie succeeds in instilling a wary sense of ` there but for the grace of god , ' it is too self - conscious to draw you into its world .", "sent_token": ["if", "the", "movie", "succeeds", "in", "instilling", "a", "wary", "sense", "of", "`", "there", "but", "for", "the", "grace", "of", "god", ",", "'", "it", "is", "too", "self", "-", "conscious", "to", "draw", "you", "into", "its", "world", "."], "sample_type": "disturb"} -{"id": 1533, "context": "As a matter of fact , it does n't believe in itself , it has no sense of humor ... it 's just plain bored .", "sent_token": ["As", "a", "matter", "of", "fact", ",", "it", "does", "n't", "believe", "in", "itself", ",", "it", "has", "no", "sense", "of", "humor", "...", "it", "'s", "just", "plain", "bored", "."], "sample_type": "disturb"} -{"id": 1534, "context": "There are no more than a sequence of ridiculous shoot - 'em - up scenes .", "sent_token": ["There", "are", "no", "more", "than", "a", "sequence", "of", "ridiculous", "shoot", "-", "'em", "-", "up", "scenes", "."], "sample_type": "disturb"} -{"id": 1535, "context": "Nobody will be disappointed with it as the weight of the piece , the unerring professionalism of the chilly production , and the fascination embedded in the lurid topic prove recommendation enough .", "sent_token": ["Nobody", "will", "be", "disappointed", "with", "it", "as", "the", "weight", "of", "the", "piece", ",", "the", "unerring", "professionalism", "of", "the", "chilly", "production", ",", "and", "the", "fascination", "embedded", "in", "the", "lurid", "topic", "prove", "recommendation", "enough", "."], "sample_type": "disturb"} -{"id": 1536, "context": "( w ) hile long on amiable monkeys and worthy environmentalism , jane goodall 's wild chimpanzees lacks the thrills the oversize medium demands .", "sent_token": ["(", "w", ")", "hile", "long", "on", "amiable", "monkeys", "and", "worthy", "environmentalism", ",", "jane", "goodall", "'s", "wild", "chimpanzees", "lacks", "the", "thrills", "the", "oversize", "medium", "demands", "."], "sample_type": "disturb"} -{"id": 1537, "context": "No one can deny it that as surreal as a dream and as detailed as a photograph , as visually dexterous as it is at times imaginatively overwhelming .", "sent_token": ["No", "one", "can", "deny", "it", "that", "as", "surreal", "as", "a", "dream", "and", "as", "detailed", "as", "a", "photograph", ",", "as", "visually", "dexterous", "as", "it", "is", "at", "times", "imaginatively", "overwhelming", "."], "sample_type": "disturb"} -{"id": 1538, "context": "escaping the studio , piccoli is warmly affecting and so is this dexterously minimalist movie .", "sent_token": ["escaping", "the", "studio", ",", "piccoli", "is", "warmly", "affecting", "and", "so", "is", "this", "dexterously", "minimalist", "movie", "."], "sample_type": "disturb"} -{"id": 1539, "context": "there 's ... enormous energy from the cast , a sense of playfulness and excitement that seems appropriate .", "sent_token": ["there", "'s", "...", "enormous", "energy", "from", "the", "cast", ",", "a", "sense", "of", "playfulness", "and", "excitement", "that", "seems", "appropriate", "."], "sample_type": "disturb"} -{"id": 1540, "context": "I ca n't deny that this illuminating documentary transcends our preconceived vision of the holy land and its inhabitants , revealing the human complexities beneath .", "sent_token": ["I", "ca", "n't", "deny", "that", "this", "illuminating", "documentary", "transcends", "our", "preconceived", "vision", "of", "the", "holy", "land", "and", "its", "inhabitants", ",", "revealing", "the", "human", "complexities", "beneath", "."], "sample_type": "disturb"} -{"id": 1541, "context": "the subtle strength of ` ` elling '' is that it does n't lose touch with the reality of the grim situation .", "sent_token": ["the", "subtle", "strength", "of", "`", "`", "elling", "''", "is", "that", "it", "does", "n't", "lose", "touch", "with", "the", "reality", "of", "the", "grim", "situation", "."], "sample_type": "disturb"} -{"id": 1542, "context": "holm ... embodies the character with an effortlessly personal regal appeal .", "sent_token": ["holm", "...", "embodies", "the", "character", "with", "an", "effortlessly", "personal", "regal", "appeal", "."], "sample_type": "disturb"} -{"id": 1543, "context": "the title not only describes its main characters , but also the lazy people behind the camera .", "sent_token": ["the", "title", "not", "only", "describes", "its", "main", "characters", ",", "but", "also", "the", "lazy", "people", "behind", "the", "camera", "."], "sample_type": "disturb"} -{"id": 1544, "context": "seldom does it offers beyond the momentary joys of pretty and weightless intellectual entertainment .", "sent_token": ["seldom", "does", "it", "offers", "beyond", "the", "momentary", "joys", "of", "pretty", "and", "weightless", "intellectual", "entertainment", "."], "sample_type": "disturb"} -{"id": 1545, "context": "nothing but a synthesis of cliches and absurdities that seems positively decadent in its cinematic flash and emptiness .", "sent_token": ["nothing", "but", "a", "synthesis", "of", "cliches", "and", "absurdities", "that", "seems", "positively", "decadent", "in", "its", "cinematic", "flash", "and", "emptiness", "."], "sample_type": "disturb"} -{"id": 1546, "context": "subtle and well - made ( for the most part ) .", "sent_token": ["subtle", "and", "well", "-", "made", "(", "for", "the", "most", "part", ")", "."], "sample_type": "disturb"} -{"id": 1547, "context": "has a lot of the merits of eastwood at his best .", "sent_token": ["has", "a", "lot", "of", "the", "merits", "of", "eastwood", "at", "his", "best", "."], "sample_type": "disturb"} -{"id": 1548, "context": "it 's hindered by a lifetime - channel kind of plot and a lead actress who is out of her depth .", "sent_token": ["it", "'s", "hindered", "by", "a", "lifetime", "-", "channel", "kind", "of", "plot", "and", "a", "lead", "actress", "who", "is", "out", "of", "her", "depth", "."], "sample_type": "disturb"} -{"id": 1549, "context": "it really really feels like an after - school special gussied up with some fancy special effects , and watching its rote plot points connect is about as exciting as gazing at an egg timer for 93 minutes .", "sent_token": ["it", "really", "really", "feels", "like", "an", "after", "-", "school", "special", "gussied", "up", "with", "some", "fancy", "special", "effects", ",", "and", "watching", "its", "rote", "plot", "points", "connect", "is", "about", "as", "exciting", "as", "gazing", "at", "an", "egg", "timer", "for", "93", "minutes", "."], "sample_type": "disturb"} diff --git a/examples/model_interpretation/data/similarity_ch b/examples/model_interpretation/data/similarity_ch deleted file mode 100644 index 815087f5ff6b..000000000000 --- a/examples/model_interpretation/data/similarity_ch +++ /dev/null @@ -1,100 +0,0 @@ -{"id": 1, "query": "求英雄联盟大神带?", "title": "英雄联盟,求大神带~", "text_q_seg": ["求", "英", "雄", "联", "盟", "大", "神", "带", "?"], "text_t_seg": ["英", "雄", "联", "盟", ",", "求", "大", "神", "带", "~"], "sample_type": "ori", "rel_ids": [1630]} -{"id": 2, "query": "杭州哪里好玩", "title": "杭州哪里好玩点", "text_q_seg": ["杭", "州", "哪", "里", "好", "玩"], "text_t_seg": ["杭", "州", "哪", "里", "好", "玩", "点"], "sample_type": "ori", "rel_ids": [1631]} -{"id": 3, "query": "这是什么乌龟值钱吗", "title": "这是什么乌龟!值钱嘛?", "text_q_seg": ["这", "是", "什", "么", "乌", "龟", "值", "钱", "吗"], "text_t_seg": ["这", "是", "什", "么", "乌", "龟", "!", "值", "钱", "嘛", "?"], "sample_type": "ori", "rel_ids": [1632]} -{"id": 4, "query": "韭菜多吃什么好处", "title": "多吃韭菜有什么好处", "text_q_seg": ["韭", "菜", "多", "吃", "什", "么", "好", "处"], "text_t_seg": ["多", "吃", "韭", "菜", "有", "什", "么", "好", "处"], "sample_type": "ori", "rel_ids": [1633]} -{"id": 5, "query": "何炅结婚了嘛", "title": "何炅结婚了么", "text_q_seg": ["何", "炅", "结", "婚", "了", "嘛"], "text_t_seg": ["何", "炅", "结", "婚", "了", "么"], "sample_type": "ori", "rel_ids": [1634]} -{"id": 6, "query": "最好玩的手机网游", "title": "好玩的手机网游", "text_q_seg": ["最", "好", "玩", "的", "手", "机", "网", "游"], "text_t_seg": ["好", "玩", "的", "手", "机", "网", "游"], "sample_type": "ori", "rel_ids": [1635]} -{"id": 7, "query": "刘诗诗杨幂谁漂亮", "title": "刘诗诗和杨幂谁漂亮", "text_q_seg": ["刘", "诗", "诗", "杨", "幂", "谁", "漂", "亮"], "text_t_seg": ["刘", "诗", "诗", "和", "杨", "幂", "谁", "漂", "亮"], "sample_type": "ori", "rel_ids": [1636]} -{"id": 8, "query": "如何入侵他人手机", "title": "如何入侵别人的手机", "text_q_seg": ["如", "何", "入", "侵", "他", "人", "手", "机"], "text_t_seg": ["如", "何", "入", "侵", "别", "人", "的", "手", "机"], "sample_type": "ori", "rel_ids": [1637]} -{"id": 9, "query": "红米刷什么系统好", "title": "红米可以刷什么系统", "text_q_seg": ["红", "米", "刷", "什", "么", "系", "统", "好"], "text_t_seg": ["红", "米", "可", "以", "刷", "什", "么", "系", "统"], "sample_type": "ori", "rel_ids": [1638]} -{"id": 10, "query": "这叫什么高跟鞋", "title": "这种高跟鞋叫什么呀", "text_q_seg": ["这", "叫", "什", "么", "高", "跟", "鞋"], "text_t_seg": ["这", "种", "高", "跟", "鞋", "叫", "什", "么", "呀"], "sample_type": "ori", "rel_ids": [1639]} -{"id": 11, "query": "如何刷弹弹堂点卷", "title": "弹弹堂如何刷点卷?", "text_q_seg": ["如", "何", "刷", "弹", "弹", "堂", "点", "卷"], "text_t_seg": ["弹", "弹", "堂", "如", "何", "刷", "点", "卷", "?"], "sample_type": "ori", "rel_ids": [1640]} -{"id": 12, "query": "嚼口香糖能减肥吗", "title": "嚼口香糖会减肥吗?", "text_q_seg": ["嚼", "口", "香", "糖", "能", "减", "肥", "吗"], "text_t_seg": ["嚼", "口", "香", "糖", "会", "减", "肥", "吗", "?"], "sample_type": "ori", "rel_ids": [1641]} -{"id": 13, "query": "这个女模特叫什么呢?", "title": "这个女模特叫啥", "text_q_seg": ["这", "个", "女", "模", "特", "叫", "什", "么", "呢", "?"], "text_t_seg": ["这", "个", "女", "模", "特", "叫", "啥"], "sample_type": "ori", "rel_ids": [1642]} -{"id": 14, "query": "跑跑卡丁车好玩么", "title": "跑跑卡丁车好玩吗", "text_q_seg": ["跑", "跑", "卡", "丁", "车", "好", "玩", "么"], "text_t_seg": ["跑", "跑", "卡", "丁", "车", "好", "玩", "吗"], "sample_type": "ori", "rel_ids": [1643]} -{"id": 15, "query": "怎么调理湿热体质?", "title": "湿热体质怎样调理啊", "text_q_seg": ["怎", "么", "调", "理", "湿", "热", "体", "质", "?"], "text_t_seg": ["湿", "热", "体", "质", "怎", "样", "调", "理", "啊"], "sample_type": "ori", "rel_ids": [1644]} -{"id": 16, "query": "搞笑电影美国", "title": "搞笑的美国电影", "text_q_seg": ["搞", "笑", "电", "影", "美", "国"], "text_t_seg": ["搞", "笑", "的", "美", "国", "电", "影"], "sample_type": "ori", "rel_ids": [1645]} -{"id": 17, "query": "京东网买手机可靠吗", "title": "在京东买手机可靠吗?", "text_q_seg": ["京", "东", "网", "买", "手", "机", "可", "靠", "吗"], "text_t_seg": ["在", "京", "东", "买", "手", "机", "可", "靠", "吗", "?"], "sample_type": "ori", "rel_ids": [1646]} -{"id": 18, "query": "谁能帮我们想个网名?", "title": "谁能帮我想个网名?", "text_q_seg": ["谁", "能", "帮", "我", "们", "想", "个", "网", "名", "?"], "text_t_seg": ["谁", "能", "帮", "我", "想", "个", "网", "名", "?"], "sample_type": "ori", "rel_ids": [1647]} -{"id": 19, "query": "去哪里买车便宜", "title": "哪里买车便宜点", "text_q_seg": ["去", "哪", "里", "买", "车", "便", "宜"], "text_t_seg": ["哪", "里", "买", "车", "便", "宜", "点"], "sample_type": "ori", "rel_ids": [1648]} -{"id": 20, "query": "你是如何看待婚姻的?", "title": "你是如何看待婚姻?", "text_q_seg": ["你", "是", "如", "何", "看", "待", "婚", "姻", "的", "?"], "text_t_seg": ["你", "是", "如", "何", "看", "待", "婚", "姻", "?"], "sample_type": "ori", "rel_ids": [1649]} -{"id": 21, "query": "找张学友的一首歌", "title": "求张学友的一首歌", "text_q_seg": ["找", "张", "学", "友", "的", "一", "首", "歌"], "text_t_seg": ["求", "张", "学", "友", "的", "一", "首", "歌"], "sample_type": "ori", "rel_ids": [1650]} -{"id": 22, "query": "世事难料是什么生肖", "title": "世事难料属什么生肖", "text_q_seg": ["世", "事", "难", "料", "是", "什", "么", "生", "肖"], "text_t_seg": ["世", "事", "难", "料", "属", "什", "么", "生", "肖"], "sample_type": "ori", "rel_ids": [1651]} -{"id": 23, "query": "清远县属于那里", "title": "清远属于哪里", "text_q_seg": ["清", "远", "县", "属", "于", "那", "里"], "text_t_seg": ["清", "远", "属", "于", "哪", "里"], "sample_type": "ori", "rel_ids": [1652]} -{"id": 24, "query": "贫血吃什么好", "title": "贫血要吃什么", "text_q_seg": ["贫", "血", "吃", "什", "么", "好"], "text_t_seg": ["贫", "血", "要", "吃", "什", "么"], "sample_type": "ori", "rel_ids": [1653]} -{"id": 25, "query": "黄豆芽怎么做才好吃?", "title": "黄豆芽怎么做好吃?", "text_q_seg": ["黄", "豆", "芽", "怎", "么", "做", "才", "好", "吃", "?"], "text_t_seg": ["黄", "豆", "芽", "怎", "么", "做", "好", "吃", "?"], "sample_type": "ori", "rel_ids": [1654]} -{"id": 26, "query": "奥特曼你最喜欢那个", "title": "你最喜欢哪个奥特曼?", "text_q_seg": ["奥", "特", "曼", "你", "最", "喜", "欢", "那", "个"], "text_t_seg": ["你", "最", "喜", "欢", "哪", "个", "奥", "特", "曼", "?"], "sample_type": "ori", "rel_ids": [1655]} -{"id": 27, "query": "这张图片是哪个动漫", "title": "求这张图片的动漫名!", "text_q_seg": ["这", "张", "图", "片", "是", "哪", "个", "动", "漫"], "text_t_seg": ["求", "这", "张", "图", "片", "的", "动", "漫", "名", "!"], "sample_type": "ori", "rel_ids": [1656]} -{"id": 28, "query": "过年了卖点什么好?", "title": "要过年了卖点什么好", "text_q_seg": ["过", "年", "了", "卖", "点", "什", "么", "好", "?"], "text_t_seg": ["要", "过", "年", "了", "卖", "点", "什", "么", "好"], "sample_type": "ori", "rel_ids": [1657]} -{"id": 29, "query": "最近过的怎么样?", "title": "你们最近过的怎么样?", "text_q_seg": ["最", "近", "过", "的", "怎", "么", "样", "?"], "text_t_seg": ["你", "们", "最", "近", "过", "的", "怎", "么", "样", "?"], "sample_type": "ori", "rel_ids": [1658]} -{"id": 30, "query": "现在有什么新电影", "title": "现在都有什么电影看?", "text_q_seg": ["现", "在", "有", "什", "么", "新", "电", "影"], "text_t_seg": ["现", "在", "都", "有", "什", "么", "电", "影", "看", "?"], "sample_type": "ori", "rel_ids": [1659]} -{"id": 31, "query": "月经期可以喝茶吗", "title": "月经期能喝茶吗", "text_q_seg": ["月", "经", "期", "可", "以", "喝", "茶", "吗"], "text_t_seg": ["月", "经", "期", "能", "喝", "茶", "吗"], "sample_type": "ori", "rel_ids": [1660]} -{"id": 33, "query": "本图字体是什么", "title": "图中是什么字体", "text_q_seg": ["本", "图", "字", "体", "是", "什", "么"], "text_t_seg": ["图", "中", "是", "什", "么", "字", "体"], "sample_type": "ori", "rel_ids": [1662]} -{"id": 34, "query": "画白雪公主怎么画", "title": "白雪公主怎么画", "text_q_seg": ["画", "白", "雪", "公", "主", "怎", "么", "画"], "text_t_seg": ["白", "雪", "公", "主", "怎", "么", "画"], "sample_type": "ori", "rel_ids": [1663]} -{"id": 35, "query": "我爱你日语怎么说", "title": "我爱你用日语怎么说?", "text_q_seg": ["我", "爱", "你", "日", "语", "怎", "么", "说"], "text_t_seg": ["我", "爱", "你", "用", "日", "语", "怎", "么", "说", "?"], "sample_type": "ori", "rel_ids": [1664]} -{"id": 37, "query": "踏步机什么牌子的好", "title": "什么牌子的踏步机好?", "text_q_seg": ["踏", "步", "机", "什", "么", "牌", "子", "的", "好"], "text_t_seg": ["什", "么", "牌", "子", "的", "踏", "步", "机", "好", "?"], "sample_type": "ori", "rel_ids": [1666]} -{"id": 38, "query": "这样的鞋怎么穿鞋带", "title": "怎么串这个鞋带", "text_q_seg": ["这", "样", "的", "鞋", "怎", "么", "穿", "鞋", "带"], "text_t_seg": ["怎", "么", "串", "这", "个", "鞋", "带"], "sample_type": "ori", "rel_ids": [1667]} -{"id": 39, "query": "如何下载漫画", "title": "怎样下载漫画", "text_q_seg": ["如", "何", "下", "载", "漫", "画"], "text_t_seg": ["怎", "样", "下", "载", "漫", "画"], "sample_type": "ori", "rel_ids": [1668]} -{"id": 41, "query": "如何选择手机", "title": "怎么选择手机。", "text_q_seg": ["如", "何", "选", "择", "手", "机"], "text_t_seg": ["怎", "么", "选", "择", "手", "机", "。"], "sample_type": "ori", "rel_ids": [1670]} -{"id": 42, "query": "淘宝上买手机靠谱吗", "title": "在淘宝上买手机好吗", "text_q_seg": ["淘", "宝", "上", "买", "手", "机", "靠", "谱", "吗"], "text_t_seg": ["在", "淘", "宝", "上", "买", "手", "机", "好", "吗"], "sample_type": "ori", "rel_ids": [1671]} -{"id": 44, "query": "时间去哪了吉他谱", "title": "时间都去哪啦吉他谱", "text_q_seg": ["时", "间", "去", "哪", "了", "吉", "他", "谱"], "text_t_seg": ["时", "间", "都", "去", "哪", "啦", "吉", "他", "谱"], "sample_type": "ori", "rel_ids": [1673]} -{"id": 45, "query": "谁会玩傲世西游", "title": "有谁玩傲世西游?", "text_q_seg": ["谁", "会", "玩", "傲", "世", "西", "游"], "text_t_seg": ["有", "谁", "玩", "傲", "世", "西", "游", "?"], "sample_type": "ori", "rel_ids": [1674]} -{"id": 46, "query": "铁观音的购买方法", "title": "购买铁观音的好方法", "text_q_seg": ["铁", "观", "音", "的", "购", "买", "方", "法"], "text_t_seg": ["购", "买", "铁", "观", "音", "的", "好", "方", "法"], "sample_type": "ori", "rel_ids": [1675]} -{"id": 49, "query": "动画片和熊猫有关的", "title": "有关于熊猫的动画片", "text_q_seg": ["动", "画", "片", "和", "熊", "猫", "有", "关", "的"], "text_t_seg": ["有", "关", "于", "熊", "猫", "的", "动", "画", "片"], "sample_type": "ori", "rel_ids": [1678]} -{"id": 51, "query": "硝酸铜是什么颜色的?", "title": "硝酸铜是什么颜色", "text_q_seg": ["硝", "酸", "铜", "是", "什", "么", "颜", "色", "的", "?"], "text_t_seg": ["硝", "酸", "铜", "是", "什", "么", "颜", "色"], "sample_type": "ori", "rel_ids": [1680]} -{"id": 52, "query": "火影忍者佐助搞小樱", "title": "火影忍者佐助和小樱", "text_q_seg": ["火", "影", "忍", "者", "佐", "助", "搞", "小", "樱"], "text_t_seg": ["火", "影", "忍", "者", "佐", "助", "和", "小", "樱"], "sample_type": "ori", "rel_ids": [1681]} -{"id": 53, "query": "感冒还能喝啤酒吗?", "title": "感冒了可以喝啤酒吗?", "text_q_seg": ["感", "冒", "还", "能", "喝", "啤", "酒", "吗", "?"], "text_t_seg": ["感", "冒", "了", "可", "以", "喝", "啤", "酒", "吗", "?"], "sample_type": "ori", "rel_ids": [1682]} -{"id": 54, "query": "请问这是什么动漫?", "title": "请问这是什么动漫呀", "text_q_seg": ["请", "问", "这", "是", "什", "么", "动", "漫", "?"], "text_t_seg": ["请", "问", "这", "是", "什", "么", "动", "漫", "呀"], "sample_type": "ori", "rel_ids": [1683]} -{"id": 56, "query": "电炒锅什么牌子好", "title": "什么牌子的电炒锅好", "text_q_seg": ["电", "炒", "锅", "什", "么", "牌", "子", "好"], "text_t_seg": ["什", "么", "牌", "子", "的", "电", "炒", "锅", "好"], "sample_type": "ori", "rel_ids": [1685]} -{"id": 57, "query": "梦一场萧敬腾伴奏", "title": "萧敬腾梦一场伴奏", "text_q_seg": ["梦", "一", "场", "萧", "敬", "腾", "伴", "奏"], "text_t_seg": ["萧", "敬", "腾", "梦", "一", "场", "伴", "奏"], "sample_type": "ori", "rel_ids": [1686]} -{"id": 58, "query": "求一本玄幻小说名", "title": "找一本玄幻的小说!", "text_q_seg": ["求", "一", "本", "玄", "幻", "小", "说", "名"], "text_t_seg": ["找", "一", "本", "玄", "幻", "的", "小", "说", "!"], "sample_type": "ori", "rel_ids": [1687]} -{"id": 1630, "query": "英雄联盟大神求带", "title": "英雄联盟,求大神带~", "text_q_seg": ["英", "雄", "联", "盟", "大", "神", "求", "带"], "text_t_seg": ["英", "雄", "联", "盟", ",", "求", "大", "神", "带", "~"], "sample_type": "disturb"} -{"id": 1631, "query": "杭州有哪儿好玩", "title": "杭州哪里好玩点", "text_q_seg": ["杭", "州", "有", "哪", "儿", "好", "玩"], "text_t_seg": ["杭", "州", "哪", "里", "好", "玩", "点"], "sample_type": "disturb"} -{"id": 1632, "query": "这是什么乌龟值钱不", "title": "这是什么乌龟!值钱嘛?", "text_q_seg": ["这", "是", "什", "么", "乌", "龟", "值", "钱", "不"], "text_t_seg": ["这", "是", "什", "么", "乌", "龟", "!", "值", "钱", "嘛", "?"], "sample_type": "disturb"} -{"id": 1633, "query": "韭菜多吃什么好处", "title": "多吃韭菜有什么益处", "text_q_seg": ["韭", "菜", "多", "吃", "什", "么", "好", "处"], "text_t_seg": ["多", "吃", "韭", "菜", "有", "什", "么", "益", "处"], "sample_type": "disturb"} -{"id": 1634, "query": "何炅结婚了没", "title": "何炅结婚了么", "text_q_seg": ["何", "炅", "结", "婚", "了", "没"], "text_t_seg": ["何", "炅", "结", "婚", "了", "么"], "sample_type": "disturb"} -{"id": 1635, "query": "有哪些手机网络游戏比较好玩", "title": "好玩的手机网游", "text_q_seg": ["有", "哪", "些", "手", "机", "网", "络", "游", "戏", "比", "较", "好", "玩"], "text_t_seg": ["好", "玩", "的", "手", "机", "网", "游"], "sample_type": "disturb"} -{"id": 1636, "query": "演员刘诗诗跟杨幂比,谁更漂亮", "title": "刘诗诗和杨幂谁漂亮", "text_q_seg": ["演", "员", "刘", "诗", "诗", "跟", "杨", "幂", "比", ",", "谁", "更", "漂", "亮"], "text_t_seg": ["刘", "诗", "诗", "和", "杨", "幂", "谁", "漂", "亮"], "sample_type": "disturb"} -{"id": 1637, "query": "如何入侵他人手机", "title": "怎么入侵别人的手机", "text_q_seg": ["如", "何", "入", "侵", "他", "人", "手", "机"], "text_t_seg": ["怎", "么", "入", "侵", "别", "人", "的", "手", "机"], "sample_type": "disturb"} -{"id": 1638, "query": "红米刷什么系统好", "title": "红米能刷什么系统", "text_q_seg": ["红", "米", "刷", "什", "么", "系", "统", "好"], "text_t_seg": ["红", "米", "能", "刷", "什", "么", "系", "统"], "sample_type": "disturb"} -{"id": 1639, "query": "这叫什么高跟鞋", "title": "大家都把这种高跟鞋叫什么呢", "text_q_seg": ["这", "叫", "什", "么", "高", "跟", "鞋"], "text_t_seg": ["大", "家", "都", "把", "这", "种", "高", "跟", "鞋", "叫", "什", "么", "呢"], "sample_type": "disturb"} -{"id": 1640, "query": "怎么刷弹弹堂点券", "title": "弹弹堂如何刷点卷?", "text_q_seg": ["怎", "么", "刷", "弹", "弹", "堂", "点", "券"], "text_t_seg": ["弹", "弹", "堂", "如", "何", "刷", "点", "卷", "?"], "sample_type": "disturb"} -{"id": 1641, "query": "嚼口香糖可以减肥吗", "title": "嚼口香糖会减肥吗?", "text_q_seg": ["嚼", "口", "香", "糖", "可", "以", "减", "肥", "吗"], "text_t_seg": ["嚼", "口", "香", "糖", "会", "减", "肥", "吗", "?"], "sample_type": "disturb"} -{"id": 1642, "query": "这个女模特叫什么啊?", "title": "这个女模特叫啥", "text_q_seg": ["这", "个", "女", "模", "特", "叫", "什", "么", "啊", "?"], "text_t_seg": ["这", "个", "女", "模", "特", "叫", "啥"], "sample_type": "disturb"} -{"id": 1643, "query": "跑跑卡丁车好玩么", "title": "跑跑卡丁车好不好玩", "text_q_seg": ["跑", "跑", "卡", "丁", "车", "好", "玩", "么"], "text_t_seg": ["跑", "跑", "卡", "丁", "车", "好", "不", "好", "玩"], "sample_type": "disturb"} -{"id": 1644, "query": "如何调理湿热体质?", "title": "湿热体质怎样调理啊", "text_q_seg": ["如", "何", "调", "理", "湿", "热", "体", "质", "?"], "text_t_seg": ["湿", "热", "体", "质", "怎", "样", "调", "理", "啊"], "sample_type": "disturb"} -{"id": 1645, "query": "搞笑电影美国", "title": "好笑的美国电影", "text_q_seg": ["搞", "笑", "电", "影", "美", "国"], "text_t_seg": ["好", "笑", "的", "美", "国", "电", "影"], "sample_type": "disturb"} -{"id": 1646, "query": "京东网买手机可靠吗", "title": "在京东买手机靠谱吗?", "text_q_seg": ["京", "东", "网", "买", "手", "机", "可", "靠", "吗"], "text_t_seg": ["在", "京", "东", "买", "手", "机", "靠", "谱", "吗", "?"], "sample_type": "disturb"} -{"id": 1647, "query": "谁可以帮我们想个网名?", "title": "谁能帮我想个网名?", "text_q_seg": ["谁", "可", "以", "帮", "我", "们", "想", "个", "网", "名", "?"], "text_t_seg": ["谁", "能", "帮", "我", "想", "个", "网", "名", "?"], "sample_type": "disturb"} -{"id": 1648, "query": "一般买车都去哪里会比较便宜呀", "title": "哪里买车便宜点", "text_q_seg": ["一", "般", "买", "车", "都", "去", "哪", "里", "会", "比", "较", "便", "宜", "呀"], "text_t_seg": ["哪", "里", "买", "车", "便", "宜", "点"], "sample_type": "disturb"} -{"id": 1649, "query": "你是如何看待婚姻的呢?", "title": "你如何看待婚姻", "text_q_seg": ["你", "是", "如", "何", "看", "待", "婚", "姻", "的", "呢", "?"], "text_t_seg": ["你", "如", "何", "看", "待", "婚", "姻"], "sample_type": "disturb"} -{"id": 1650, "query": "请帮我找一首歌,张学友的,谢谢", "title": "求张学友的一首歌", "text_q_seg": ["请", "帮", "我", "找", "一", "首", "歌", ",", "张", "学", "友", "的", ",", "谢", "谢"], "text_t_seg": ["求", "张", "学", "友", "的", "一", "首", "歌"], "sample_type": "disturb"} -{"id": 1651, "query": "世事难料猜一生肖", "title": "世事难料属什么生肖", "text_q_seg": ["世", "事", "难", "料", "猜", "一", "生", "肖"], "text_t_seg": ["世", "事", "难", "料", "属", "什", "么", "生", "肖"], "sample_type": "disturb"} -{"id": 1652, "query": "清远县是属于哪里的", "title": "清远属于哪里", "text_q_seg": ["清", "远", "县", "是", "属", "于", "哪", "里", "的"], "text_t_seg": ["清", "远", "属", "于", "哪", "里"], "sample_type": "disturb"} -{"id": 1653, "query": "贫血的话,补血需要吃什么呢", "title": "贫血要吃什么", "text_q_seg": ["贫", "血", "的", "话", ",", "补", "血", "需", "要", "吃", "什", "么", "呢"], "text_t_seg": ["贫", "血", "要", "吃", "什", "么"], "sample_type": "disturb"} -{"id": 1654, "query": "黄豆芽怎么做才好吃呢?", "title": "黄豆芽怎么做好吃?", "text_q_seg": ["黄", "豆", "芽", "怎", "么", "做", "才", "好", "吃", "呢", "?"], "text_t_seg": ["黄", "豆", "芽", "怎", "么", "做", "好", "吃", "?"], "sample_type": "disturb"} -{"id": 1655, "query": "奥特曼你最喜欢那个", "title": "你最爱哪个奥特曼?", "text_q_seg": ["奥", "特", "曼", "你", "最", "喜", "欢", "那", "个"], "text_t_seg": ["你", "最", "爱", "哪", "个", "奥", "特", "曼", "?"], "sample_type": "disturb"} -{"id": 1656, "query": "这张图片是什么动漫", "title": "求这张图片的动漫名!", "text_q_seg": ["这", "张", "图", "片", "是", "什", "么", "动", "漫"], "text_t_seg": ["求", "这", "张", "图", "片", "的", "动", "漫", "名", "!"], "sample_type": "disturb"} -{"id": 1657, "query": "在过年的时候,什么好卖点呢", "title": "要过年了卖点什么好", "text_q_seg": ["在", "过", "年", "的", "时", "候", ",", "什", "么", "好", "卖", "点", "呢"], "text_t_seg": ["要", "过", "年", "了", "卖", "点", "什", "么", "好"], "sample_type": "disturb"} -{"id": 1658, "query": "最近过的怎么样呀,好不好啊", "title": "你们最近过的怎么样?", "text_q_seg": ["最", "近", "过", "的", "怎", "么", "样", "呀", ",", "好", "不", "好", "啊"], "text_t_seg": ["你", "们", "最", "近", "过", "的", "怎", "么", "样", "?"], "sample_type": "disturb"} -{"id": 1659, "query": "现在有什么新电影", "title": "现在可以看的电影都有什么呀?", "text_q_seg": ["现", "在", "有", "什", "么", "新", "电", "影"], "text_t_seg": ["现", "在", "可", "以", "看", "的", "电", "影", "都", "有", "什", "么", "呀", "?"], "sample_type": "disturb"} -{"id": 1660, "query": "生理期可以喝茶吗", "title": "来大姨妈的时候能喝茶吗", "text_q_seg": ["生", "理", "期", "可", "以", "喝", "茶", "吗"], "text_t_seg": ["来", "大", "姨", "妈", "的", "时", "候", "能", "喝", "茶", "吗"], "sample_type": "disturb"} -{"id": 1662, "query": "本图字体是什么", "title": "图中为什么字体", "text_q_seg": ["本", "图", "字", "体", "是", "什", "么"], "text_t_seg": ["图", "中", "为", "什", "么", "字", "体"], "sample_type": "disturb"} -{"id": 1663, "query": "画白雪公主怎么画", "title": "白雪公主如何画", "text_q_seg": ["画", "白", "雪", "公", "主", "怎", "么", "画"], "text_t_seg": ["白", "雪", "公", "主", "如", "何", "画"], "sample_type": "disturb"} -{"id": 1664, "query": "我爱你 日语", "title": "我爱你用日语如何说?", "text_q_seg": ["我", "爱", "你", " ", "日", "语"], "text_t_seg": ["我", "爱", "你", "用", "日", "语", "如", "何", "说", "?"], "sample_type": "disturb"} -{"id": 1666, "query": "踏步机什么牌子的好", "title": "踏步机比较好的牌子都有哪些?", "text_q_seg": ["踏", "步", "机", "什", "么", "牌", "子", "的", "好"], "text_t_seg": ["踏", "步", "机", "比", "较", "好", "的", "牌", "子", "都", "有", "哪", "些", "?"], "sample_type": "disturb"} -{"id": 1667, "query": "这样的鞋怎么穿鞋带", "title": "这个鞋带要怎么串起来呢", "text_q_seg": ["这", "样", "的", "鞋", "怎", "么", "穿", "鞋", "带"], "text_t_seg": ["这", "个", "鞋", "带", "要", "怎", "么", "串", "起", "来", "呢"], "sample_type": "disturb"} -{"id": 1668, "query": "漫画下载的好方法", "title": "怎么下载漫画", "text_q_seg": ["漫", "画", "下", "载", "的", "好", "方", "法"], "text_t_seg": ["怎", "么", "下", "载", "漫", "画"], "sample_type": "disturb"} -{"id": 1670, "query": "如何选择手机", "title": "怎样选择手机", "text_q_seg": ["如", "何", "选", "择", "手", "机"], "text_t_seg": ["怎", "样", "选", "择", "手", "机"], "sample_type": "disturb"} -{"id": 1671, "query": "在淘宝上买电子产品如手机,体验怎么样,手机可靠吗?", "title": "在淘宝上买手机好吗", "text_q_seg": ["在", "淘", "宝", "上", "买", "电", "子", "产", "品", "如", "手", "机", ",", "体", "验", "怎", "么", "样", ",", "手", "机", "可", "靠", "吗", "?"], "text_t_seg": ["在", "淘", "宝", "上", "买", "手", "机", "好", "吗"], "sample_type": "disturb"} -{"id": 1673, "query": "歌曲时间去哪了吉他谱", "title": "时间都去哪啦吉他谱", "text_q_seg": ["歌", "曲", "时", "间", "去", "哪", "了", "吉", "他", "谱"], "text_t_seg": ["时", "间", "都", "去", "哪", "啦", "吉", "他", "谱"], "sample_type": "disturb"} -{"id": 1674, "query": "谁会玩傲世西游", "title": "有谁玩傲世西游吗?", "text_q_seg": ["谁", "会", "玩", "傲", "世", "西", "游"], "text_t_seg": ["有", "谁", "玩", "傲", "世", "西", "游", "吗", "?"], "sample_type": "disturb"} -{"id": 1675, "query": "铁观音的购买方法", "title": "有没有购买铁观音的好的渠道", "text_q_seg": ["铁", "观", "音", "的", "购", "买", "方", "法"], "text_t_seg": ["有", "没", "有", "购", "买", "铁", "观", "音", "的", "好", "的", "渠", "道"], "sample_type": "disturb"} -{"id": 1678, "query": "哪些动画片是跟国宝大熊猫相关的", "title": "有关于熊猫的动画片", "text_q_seg": ["哪", "些", "动", "画", "片", "是", "跟", "国", "宝", "大", "熊", "猫", "相", "关", "的"], "text_t_seg": ["有", "关", "于", "熊", "猫", "的", "动", "画", "片"], "sample_type": "disturb"} -{"id": 1680, "query": "硝酸铜是什么颜色的?", "title": "硝酸铜颜色是什么", "text_q_seg": ["硝", "酸", "铜", "是", "什", "么", "颜", "色", "的", "?"], "text_t_seg": ["硝", "酸", "铜", "颜", "色", "是", "什", "么"], "sample_type": "disturb"} -{"id": 1681, "query": "火影忍者佐助搞小樱", "title": "请帮忙搜索火影忍者佐助跟小樱", "text_q_seg": ["火", "影", "忍", "者", "佐", "助", "搞", "小", "樱"], "text_t_seg": ["请", "帮", "忙", "搜", "索", "火", "影", "忍", "者", "佐", "助", "跟", "小", "樱"], "sample_type": "disturb"} -{"id": 1682, "query": "感冒还能喝啤酒吗?", "title": "感冒了能够喝啤酒吗?", "text_q_seg": ["感", "冒", "还", "能", "喝", "啤", "酒", "吗", "?"], "text_t_seg": ["感", "冒", "了", "能", "够", "喝", "啤", "酒", "吗", "?"], "sample_type": "disturb"} -{"id": 1683, "query": "请问这是什么动漫呢?", "title": "请问这个动漫是哪个呀", "text_q_seg": ["请", "问", "这", "是", "什", "么", "动", "漫", "呢", "?"], "text_t_seg": ["请", "问", "这", "个", "动", "漫", "是", "哪", "个", "呀"], "sample_type": "disturb"} -{"id": 1685, "query": "电炒锅什么牌子好", "title": "什么牌子的电炒锅最好", "text_q_seg": ["电", "炒", "锅", "什", "么", "牌", "子", "好"], "text_t_seg": ["什", "么", "牌", "子", "的", "电", "炒", "锅", "最", "好"], "sample_type": "disturb"} -{"id": 1686, "query": "梦一场萧敬腾伴奏", "title": "求萧敬腾的梦一场的伴奏部分", "text_q_seg": ["梦", "一", "场", "萧", "敬", "腾", "伴", "奏"], "text_t_seg": ["求", "萧", "敬", "腾", "的", "梦", "一", "场", "的", "伴", "奏", "部", "分"], "sample_type": "disturb"} -{"id": 1687, "query": "求一本玄幻小说名", "title": "寻一本玄幻的小说!", "text_q_seg": ["求", "一", "本", "玄", "幻", "小", "说", "名"], "text_t_seg": ["寻", "一", "本", "玄", "幻", "的", "小", "说", "!"], "sample_type": "disturb"} diff --git a/examples/model_interpretation/data/similarity_en b/examples/model_interpretation/data/similarity_en deleted file mode 100644 index 82cf67742d7a..000000000000 --- a/examples/model_interpretation/data/similarity_en +++ /dev/null @@ -1,100 +0,0 @@ -{"id": 1, "sentence1": "Is there a reason why we should travel alone ?", "sentence2": "What are some reasons to travel alone ?", "text_q_seg": ["Is", "there", "a", "reason", "why", "we", "should", "travel", "alone", "?"], "text_t_seg": ["What", "are", "some", "reasons", "to", "travel", "alone", "?"], "sample_type": "ori", "rel_ids": [1660]} -{"id": 2, "sentence1": "I am 25 year old guy and never had a girlfriend . Is this weird ?", "sentence2": "I am 25 years old . I have never had a girlfriend . Is something wrong with me ?", "text_q_seg": ["I", "am", "25", "year", "old", "guy", "and", "never", "had", "a", "girlfriend", ".", "Is", "this", "weird", "?"], "text_t_seg": ["I", "am", "25", "years", "old", ".", "I", "have", "never", "had", "a", "girlfriend", ".", "Is", "something", "wrong", "with", "me", "?"], "sample_type": "ori", "rel_ids": [1661]} -{"id": 3, "sentence1": "What does a good answer on Quora look like ? What does it mean to \" be helpful \" ?", "sentence2": "How do you write a good answer on Quora ?", "text_q_seg": ["What", "does", "a", "good", "answer", "on", "Quora", "look", "like", "?", "What", "does", "it", "mean", "to", "\"", "be", "helpful", "\"", "?"], "text_t_seg": ["How", "do", "you", "write", "a", "good", "answer", "on", "Quora", "?"], "sample_type": "ori", "rel_ids": [1662]} -{"id": 4, "sentence1": "What was the deadliest battle in history ?", "sentence2": "What was the bloodiest battle in history ?", "text_q_seg": ["What", "was", "the", "deadliest", "battle", "in", "history", "?"], "text_t_seg": ["What", "was", "the", "bloodiest", "battle", "in", "history", "?"], "sample_type": "ori", "rel_ids": [1663]} -{"id": 5, "sentence1": "What are your views about demonetisation in India ?", "sentence2": "What do you think about the ban on 500 and 1000 denomination notes in India ?", "text_q_seg": ["What", "are", "your", "views", "about", "demonetisation", "in", "India", "?"], "text_t_seg": ["What", "do", "you", "think", "about", "the", "ban", "on", "500", "and", "1000", "denomination", "notes", "in", "India", "?"], "sample_type": "ori", "rel_ids": [1664]} -{"id": 6, "sentence1": "Is it a bad time to buy a condo or a house in the Bay Area in 2017 ?", "sentence2": "Would 2017 be a good time to buy a house in Bay Area ?", "text_q_seg": ["Is", "it", "a", "bad", "time", "to", "buy", "a", "condo", "or", "a", "house", "in", "the", "Bay", "Area", "in", "2017", "?"], "text_t_seg": ["Would", "2017", "be", "a", "good", "time", "to", "buy", "a", "house", "in", "Bay", "Area", "?"], "sample_type": "ori", "rel_ids": [1665]} -{"id": 7, "sentence1": "What books should I read as an aspiring entrepreneur ?", "sentence2": "What are the top books an aspiring teen entrepreneur should read ?", "text_q_seg": ["What", "books", "should", "I", "read", "as", "an", "aspiring", "entrepreneur", "?"], "text_t_seg": ["What", "are", "the", "top", "books", "an", "aspiring", "teen", "entrepreneur", "should", "read", "?"], "sample_type": "ori", "rel_ids": [1666]} -{"id": 8, "sentence1": "If universe is expanding without a limit and dark and vacuum energy are created as it expands … ?", "sentence2": "If universe can expand without limit and it creates dark / vacuum / gravitational energy with it , then is the potential energy infinite ?", "text_q_seg": ["If", "universe", "is", "expanding", "without", "a", "limit", "and", "dark", "and", "vacuum", "energy", "are", "created", "as", "it", "expands", "…", "?"], "text_t_seg": ["If", "universe", "can", "expand", "without", "limit", "and", "it", "creates", "dark", "/", "vacuum", "/", "gravitational", "energy", "with", "it", ",", "then", "is", "the", "potential", "energy", "infinite", "?"], "sample_type": "ori", "rel_ids": [1667]} -{"id": 9, "sentence1": "What people who you 've never met have influenced your life the most ?", "sentence2": "Who are people you have never met who have had the greatest influence on your life ?", "text_q_seg": ["What", "people", "who", "you", "'ve", "never", "met", "have", "influenced", "your", "life", "the", "most", "?"], "text_t_seg": ["Who", "are", "people", "you", "have", "never", "met", "who", "have", "had", "the", "greatest", "influence", "on", "your", "life", "?"], "sample_type": "ori", "rel_ids": [1668]} -{"id": 10, "sentence1": "I 'm going to be US President one day . What should I start doing now to achieve this ?", "sentence2": "I 'm 16 and I want to become the US president someday . What should I start doing ?", "text_q_seg": ["I", "'m", "going", "to", "be", "US", "President", "one", "day", ".", "What", "should", "I", "start", "doing", "now", "to", "achieve", "this", "?"], "text_t_seg": ["I", "'m", "16", "and", "I", "want", "to", "become", "the", "US", "president", "someday", ".", "What", "should", "I", "start", "doing", "?"], "sample_type": "ori", "rel_ids": [1669]} -{"id": 11, "sentence1": "Why MS Dhoni leave captaincy of ODI & T-20 ?", "sentence2": "Why does M.S Dhoni left captaincy for ODI and T20 ?", "text_q_seg": ["Why", "MS", "Dhoni", "leave", "captaincy", "of", "ODI", "&", "T-20", "?"], "text_t_seg": ["Why", "does", "M.S", "Dhoni", "left", "captaincy", "for", "ODI", "and", "T20", "?"], "sample_type": "ori", "rel_ids": [1670]} -{"id": 12, "sentence1": "What are the procedures for becoming an actuary ?", "sentence2": "What is the procedure of becoming an actuary ?", "text_q_seg": ["What", "are", "the", "procedures", "for", "becoming", "an", "actuary", "?"], "text_t_seg": ["What", "is", "the", "procedure", "of", "becoming", "an", "actuary", "?"], "sample_type": "ori", "rel_ids": [1671]} -{"id": 13, "sentence1": "How do smart and successful people control their emotions ?", "sentence2": "How can I control my emotions ?", "text_q_seg": ["How", "do", "smart", "and", "successful", "people", "control", "their", "emotions", "?"], "text_t_seg": ["How", "can", "I", "control", "my", "emotions", "?"], "sample_type": "ori", "rel_ids": [1672]} -{"id": 14, "sentence1": "What are the best tips for outlining / planning a novel ?", "sentence2": "How do I best outline my novel ?", "text_q_seg": ["What", "are", "the", "best", "tips", "for", "outlining", "/", "planning", "a", "novel", "?"], "text_t_seg": ["How", "do", "I", "best", "outline", "my", "novel", "?"], "sample_type": "ori", "rel_ids": [1673]} -{"id": 15, "sentence1": "What will happen if Donald Trump became the president of America ?", "sentence2": "What will happen now that President - elect Donald Trump has won the election ?", "text_q_seg": ["What", "will", "happen", "if", "Donald", "Trump", "became", "the", "president", "of", "America", "?"], "text_t_seg": ["What", "will", "happen", "now", "that", "President", "-", "elect", "Donald", "Trump", "has", "won", "the", "election", "?"], "sample_type": "ori", "rel_ids": [1674]} -{"id": 16, "sentence1": "Why did n't Ned Stark bring more men to the Tower of Joy ?", "sentence2": "Why did Ned Stark go to the Tower of Joy with so few men ? Why not bring a small guard ( say 20 more men ) of loyal and discreet northerners ?", "text_q_seg": ["Why", "did", "n't", "Ned", "Stark", "bring", "more", "men", "to", "the", "Tower", "of", "Joy", "?"], "text_t_seg": ["Why", "did", "Ned", "Stark", "go", "to", "the", "Tower", "of", "Joy", "with", "so", "few", "men", "?", "Why", "not", "bring", "a", "small", "guard", "(", "say", "20", "more", "men", ")", "of", "loyal", "and", "discreet", "northerners", "?"], "sample_type": "ori", "rel_ids": [1675]} -{"id": 17, "sentence1": "How do you get better grades ?", "sentence2": "How can I dramatically improve my grades ?", "text_q_seg": ["How", "do", "you", "get", "better", "grades", "?"], "text_t_seg": ["How", "can", "I", "dramatically", "improve", "my", "grades", "?"], "sample_type": "ori", "rel_ids": [1676]} -{"id": 18, "sentence1": "What is your new year resolution , short term and long term goal for 2017 ?", "sentence2": "What will be your New Year 's resolution for 2017 ?", "text_q_seg": ["What", "is", "your", "new", "year", "resolution", ",", "short", "term", "and", "long", "term", "goal", "for", "2017", "?"], "text_t_seg": ["What", "will", "be", "your", "New", "Year", "'s", "resolution", "for", "2017", "?"], "sample_type": "ori", "rel_ids": [1677]} -{"id": 19, "sentence1": "What will happen to the next Star Wars movies after Carrie Fisher 's death ?", "sentence2": "What will Carrie Fisher 's death mean for the next Star Wars movies ?", "text_q_seg": ["What", "will", "happen", "to", "the", "next", "Star", "Wars", "movies", "after", "Carrie", "Fisher", "'s", "death", "?"], "text_t_seg": ["What", "will", "Carrie", "Fisher", "'s", "death", "mean", "for", "the", "next", "Star", "Wars", "movies", "?"], "sample_type": "ori", "rel_ids": [1678]} -{"id": 20, "sentence1": "What is an analogy for a smooth ER ?", "sentence2": "What is an analogy for smooth ER ?", "text_q_seg": ["What", "is", "an", "analogy", "for", "a", "smooth", "ER", "?"], "text_t_seg": ["What", "is", "an", "analogy", "for", "smooth", "ER", "?"], "sample_type": "ori", "rel_ids": [1679]} -{"id": 21, "sentence1": "What is the best business to start in Bangalore ?", "sentence2": "What is the best business in Bangalore to start up with ?", "text_q_seg": ["What", "is", "the", "best", "business", "to", "start", "in", "Bangalore", "?"], "text_t_seg": ["What", "is", "the", "best", "business", "in", "Bangalore", "to", "start", "up", "with", "?"], "sample_type": "ori", "rel_ids": [1680]} -{"id": 22, "sentence1": "Why does gst bill so important ?", "sentence2": "What is the effect of GST bill on a common man ?", "text_q_seg": ["Why", "does", "gst", "bill", "so", "important", "?"], "text_t_seg": ["What", "is", "the", "effect", "of", "GST", "bill", "on", "a", "common", "man", "?"], "sample_type": "ori", "rel_ids": [1681]} -{"id": 23, "sentence1": "Which aircraft was superior - the Douglas DC8 or the Boeing 707 ?", "sentence2": "Was the Douglas DC8 a superior aircraft to the Boeing 707 ?", "text_q_seg": ["Which", "aircraft", "was", "superior", "-", "the", "Douglas", "DC8", "or", "the", "Boeing", "707", "?"], "text_t_seg": ["Was", "the", "Douglas", "DC8", "a", "superior", "aircraft", "to", "the", "Boeing", "707", "?"], "sample_type": "ori", "rel_ids": [1682]} -{"id": 24, "sentence1": "How can I expand my IQ ?", "sentence2": "What can I do to increase my IQ ?", "text_q_seg": ["How", "can", "I", "expand", "my", "IQ", "?"], "text_t_seg": ["What", "can", "I", "do", "to", "increase", "my", "IQ", "?"], "sample_type": "ori", "rel_ids": [1683]} -{"id": 25, "sentence1": "What does it mean when a girl take a day to reply to your text ?", "sentence2": "What does it mean when girls reply to a text a day after ?", "text_q_seg": ["What", "does", "it", "mean", "when", "a", "girl", "take", "a", "day", "to", "reply", "to", "your", "text", "?"], "text_t_seg": ["What", "does", "it", "mean", "when", "girls", "reply", "to", "a", "text", "a", "day", "after", "?"], "sample_type": "ori", "rel_ids": [1684]} -{"id": 26, "sentence1": "How can I stop myself from watching too much of porn ?", "sentence2": "How shall I stop watching porn ?", "text_q_seg": ["How", "can", "I", "stop", "myself", "from", "watching", "too", "much", "of", "porn", "?"], "text_t_seg": ["How", "shall", "I", "stop", "watching", "porn", "?"], "sample_type": "ori", "rel_ids": [1685]} -{"id": 27, "sentence1": "What will be the effect of banning 500 and 1000 Rs notes on real estate sector in India ? Can we expect sharp fall in prices in short / long term ?", "sentence2": "What will the real estate look like now after the 500 and 1000 scraping ?", "text_q_seg": ["What", "will", "be", "the", "effect", "of", "banning", "500", "and", "1000", "Rs", "notes", "on", "real", "estate", "sector", "in", "India", "?", "Can", "we", "expect", "sharp", "fall", "in", "prices", "in", "short", "/", "long", "term", "?"], "text_t_seg": ["What", "will", "the", "real", "estate", "look", "like", "now", "after", "the", "500", "and", "1000", "scraping", "?"], "sample_type": "ori", "rel_ids": [1686]} -{"id": 28, "sentence1": "Is it worth it to pay for PhD from my pocket ?", "sentence2": "Is it foolish to pay for your PhD out of your own pocket ?", "text_q_seg": ["Is", "it", "worth", "it", "to", "pay", "for", "PhD", "from", "my", "pocket", "?"], "text_t_seg": ["Is", "it", "foolish", "to", "pay", "for", "your", "PhD", "out", "of", "your", "own", "pocket", "?"], "sample_type": "ori", "rel_ids": [1687]} -{"id": 29, "sentence1": "What is the maximum file size that can be uploaded in Whatsapp ?", "sentence2": "What is the maximum file size on WhatsApp ?", "text_q_seg": ["What", "is", "the", "maximum", "file", "size", "that", "can", "be", "uploaded", "in", "Whatsapp", "?"], "text_t_seg": ["What", "is", "the", "maximum", "file", "size", "on", "WhatsApp", "?"], "sample_type": "ori", "rel_ids": [1688]} -{"id": 30, "sentence1": "What are the best ways to learn to cook ?", "sentence2": "How do I learn to cook ?", "text_q_seg": ["What", "are", "the", "best", "ways", "to", "learn", "to", "cook", "?"], "text_t_seg": ["How", "do", "I", "learn", "to", "cook", "?"], "sample_type": "ori", "rel_ids": [1689]} -{"id": 31, "sentence1": "What was first word spoken by human ?", "sentence2": "What is the first word ever spoken ?", "text_q_seg": ["What", "was", "first", "word", "spoken", "by", "human", "?"], "text_t_seg": ["What", "is", "the", "first", "word", "ever", "spoken", "?"], "sample_type": "ori", "rel_ids": [1690]} -{"id": 32, "sentence1": "Should I give my JEE Main exam offline or online ?", "sentence2": "Which mode is best for JEE MAIN 2017 online exam or offline ?", "text_q_seg": ["Should", "I", "give", "my", "JEE", "Main", "exam", "offline", "or", "online", "?"], "text_t_seg": ["Which", "mode", "is", "best", "for", "JEE", "MAIN", "2017", "online", "exam", "or", "offline", "?"], "sample_type": "ori", "rel_ids": [1691]} -{"id": 33, "sentence1": "Is literally infinite number of unique human DNAs possible ?", "sentence2": "What is the maximum number of genetically unique individuals that human genome allows ?", "text_q_seg": ["Is", "literally", "infinite", "number", "of", "unique", "human", "DNAs", "possible", "?"], "text_t_seg": ["What", "is", "the", "maximum", "number", "of", "genetically", "unique", "individuals", "that", "human", "genome", "allows", "?"], "sample_type": "ori", "rel_ids": [1692]} -{"id": 34, "sentence1": "What is motive of Mulayam Singh Yadav behind expelling Akhilesh Yadav from Samajwadi party ?", "sentence2": "Why did Mulayam Singh Yadav expel Akhilesh Yadav from the Samajwadi Party for 6 years ?", "text_q_seg": ["What", "is", "motive", "of", "Mulayam", "Singh", "Yadav", "behind", "expelling", "Akhilesh", "Yadav", "from", "Samajwadi", "party", "?"], "text_t_seg": ["Why", "did", "Mulayam", "Singh", "Yadav", "expel", "Akhilesh", "Yadav", "from", "the", "Samajwadi", "Party", "for", "6", "years", "?"], "sample_type": "ori", "rel_ids": [1693]} -{"id": 35, "sentence1": "Why do we need to philosophize ?", "sentence2": "Why do we need to philosophize with others ?", "text_q_seg": ["Why", "do", "we", "need", "to", "philosophize", "?"], "text_t_seg": ["Why", "do", "we", "need", "to", "philosophize", "with", "others", "?"], "sample_type": "ori", "rel_ids": [1694]} -{"id": 36, "sentence1": "Is there any way to recover e - mails that were deleted from a Gmail account ?", "sentence2": "Is there any way to retrieve my deleted emails from my Gmail account ?", "text_q_seg": ["Is", "there", "any", "way", "to", "recover", "e", "-", "mails", "that", "were", "deleted", "from", "a", "Gmail", "account", "?"], "text_t_seg": ["Is", "there", "any", "way", "to", "retrieve", "my", "deleted", "emails", "from", "my", "Gmail", "account", "?"], "sample_type": "ori", "rel_ids": [1695]} -{"id": 37, "sentence1": "How do I find my own gmail accounts list ?", "sentence2": "How can you find all of your Gmail accounts ?", "text_q_seg": ["How", "do", "I", "find", "my", "own", "gmail", "accounts", "list", "?"], "text_t_seg": ["How", "can", "you", "find", "all", "of", "your", "Gmail", "accounts", "?"], "sample_type": "ori", "rel_ids": [1696]} -{"id": 38, "sentence1": "Where can I get sparkling and well maintained cleaning service in Sydney ?", "sentence2": "Where can I get cleaning services in Sydney ?", "text_q_seg": ["Where", "can", "I", "get", "sparkling", "and", "well", "maintained", "cleaning", "service", "in", "Sydney", "?"], "text_t_seg": ["Where", "can", "I", "get", "cleaning", "services", "in", "Sydney", "?"], "sample_type": "ori", "rel_ids": [1697]} -{"id": 39, "sentence1": "Can Fast and Furious 7 gross $ 1 billion worldwide ?", "sentence2": "Will Furious 7 be the first movie in the franchise to gross a billion dollars ?", "text_q_seg": ["Can", "Fast", "and", "Furious", "7", "gross", "$", "1", "billion", "worldwide", "?"], "text_t_seg": ["Will", "Furious", "7", "be", "the", "first", "movie", "in", "the", "franchise", "to", "gross", "a", "billion", "dollars", "?"], "sample_type": "ori", "rel_ids": [1698]} -{"id": 40, "sentence1": "Which is the best book for learning language c++ ?", "sentence2": "What is a good book for learning the basics of C++ programming ?", "text_q_seg": ["Which", "is", "the", "best", "book", "for", "learning", "language", "c++", "?"], "text_t_seg": ["What", "is", "a", "good", "book", "for", "learning", "the", "basics", "of", "C++", "programming", "?"], "sample_type": "ori", "rel_ids": [1699]} -{"id": 41, "sentence1": "What will be Barack Obama 's legacy ?", "sentence2": "Based on what we know now , what will Barack Obama 's historical legacy be ?", "text_q_seg": ["What", "will", "be", "Barack", "Obama", "'s", "legacy", "?"], "text_t_seg": ["Based", "on", "what", "we", "know", "now", ",", "what", "will", "Barack", "Obama", "'s", "historical", "legacy", "be", "?"], "sample_type": "ori", "rel_ids": [1700]} -{"id": 42, "sentence1": "Why do so many people hate Hilary Clinton ?", "sentence2": "What are the reasons that people dislike Hillary Clinton ?", "text_q_seg": ["Why", "do", "so", "many", "people", "hate", "Hilary", "Clinton", "?"], "text_t_seg": ["What", "are", "the", "reasons", "that", "people", "dislike", "Hillary", "Clinton", "?"], "sample_type": "ori", "rel_ids": [1701]} -{"id": 43, "sentence1": "How do l see who viewed my videos on Instagram ?", "sentence2": "How can I see who viewed my video on Instagram but did n't like my video ?", "text_q_seg": ["How", "do", "l", "see", "who", "viewed", "my", "videos", "on", "Instagram", "?"], "text_t_seg": ["How", "can", "I", "see", "who", "viewed", "my", "video", "on", "Instagram", "but", "did", "n't", "like", "my", "video", "?"], "sample_type": "ori", "rel_ids": [1702]} -{"id": 44, "sentence1": "Why is that the sky is so blue ?", "sentence2": "Why is the sky is blue ?", "text_q_seg": ["Why", "is", "that", "the", "sky", "is", "so", "blue", "?"], "text_t_seg": ["Why", "is", "the", "sky", "is", "blue", "?"], "sample_type": "ori", "rel_ids": [1703]} -{"id": 45, "sentence1": "How can I learn English well in a short time ?", "sentence2": "How can I learn English in a short time ?", "text_q_seg": ["How", "can", "I", "learn", "English", "well", "in", "a", "short", "time", "?"], "text_t_seg": ["How", "can", "I", "learn", "English", "in", "a", "short", "time", "?"], "sample_type": "ori", "rel_ids": [1704]} -{"id": 46, "sentence1": "How can I stop eating junk and processed food addiction and stay healthy ?", "sentence2": "How do I stop my cravings for junk food ?", "text_q_seg": ["How", "can", "I", "stop", "eating", "junk", "and", "processed", "food", "addiction", "and", "stay", "healthy", "?"], "text_t_seg": ["How", "do", "I", "stop", "my", "cravings", "for", "junk", "food", "?"], "sample_type": "ori", "rel_ids": [1705]} -{"id": 47, "sentence1": "What are the movies one should see ?", "sentence2": "What are the greatest movies I have to see ?", "text_q_seg": ["What", "are", "the", "movies", "one", "should", "see", "?"], "text_t_seg": ["What", "are", "the", "greatest", "movies", "I", "have", "to", "see", "?"], "sample_type": "ori", "rel_ids": [1706]} -{"id": 48, "sentence1": "What is an accurate way to calculate your IQ ?", "sentence2": "What 's the most accurate way to test my IQ ?", "text_q_seg": ["What", "is", "an", "accurate", "way", "to", "calculate", "your", "IQ", "?"], "text_t_seg": ["What", "'s", "the", "most", "accurate", "way", "to", "test", "my", "IQ", "?"], "sample_type": "ori", "rel_ids": [1707]} -{"id": 49, "sentence1": "Is our PM Modi doing the correct thing with 500 and 1000 Rs notes ?", "sentence2": "What do you think about ban on Rs . 500 and Rs . 1000 currency notes ?", "text_q_seg": ["Is", "our", "PM", "Modi", "doing", "the", "correct", "thing", "with", "500", "and", "1000", "Rs", "notes", "?"], "text_t_seg": ["What", "do", "you", "think", "about", "ban", "on", "Rs", ".", "500", "and", "Rs", ".", "1000", "currency", "notes", "?"], "sample_type": "ori", "rel_ids": [1708]} -{"id": 50, "sentence1": "Why is the firm 's marginal cost curve equal supply curve ?", "sentence2": "How can supply curve tell about marginal cost ?", "text_q_seg": ["Why", "is", "the", "firm", "'s", "marginal", "cost", "curve", "equal", "supply", "curve", "?"], "text_t_seg": ["How", "can", "supply", "curve", "tell", "about", "marginal", "cost", "?"], "sample_type": "ori", "rel_ids": [1709]} -{"id": 1660, "sentence1": "Is there any reason that we should travel alone ?", "sentence2": "What are some reasons to travel alone ?", "text_q_seg": ["Is", "there", "any", "reason", "that", "we", "should", "travel", "alone", "?"], "text_t_seg": ["What", "are", "some", "reasons", "to", "travel", "alone", "?"], "sample_type": "disturb"} -{"id": 1661, "sentence1": "I am 25 year old guy and never had a girlfriend . Is this odd ?", "sentence2": "I am 25 years old . I have never had a girlfriend . Is something wrong with me ?", "text_q_seg": ["I", "am", "25", "year", "old", "guy", "and", "never", "had", "a", "girlfriend", ".", "Is", "this", "odd", "?"], "text_t_seg": ["I", "am", "25", "years", "old", ".", "I", "have", "never", "had", "a", "girlfriend", ".", "Is", "something", "wrong", "with", "me", "?"], "sample_type": "disturb"} -{"id": 1662, "sentence1": "what is a good answer on Quora that is helpful ?", "sentence2": "How do you write a good answer on Quora ?", "text_q_seg": ["what", "is", "a", "good", "answer", "on", "Quora", "that", "is", "helpful", "?"], "text_t_seg": ["How", "do", "you", "write", "a", "good", "answer", "on", "Quora", "?"], "sample_type": "disturb"} -{"id": 1663, "sentence1": "What was the most fatal battle in history ?", "sentence2": "What was the bloodiest battle in history ?", "text_q_seg": ["What", "was", "the", "most", "fatal", "battle", "in", "history", "?"], "text_t_seg": ["What", "was", "the", "bloodiest", "battle", "in", "history", "?"], "sample_type": "disturb"} -{"id": 1664, "sentence1": "What are your opions on demonetisation in India ?", "sentence2": "What do you think about the ban on 500 and 1000 denomination notes in India ?", "text_q_seg": ["What", "are", "your", "opions", "on", "demonetisation", "in", "India", "?"], "text_t_seg": ["What", "do", "you", "think", "about", "the", "ban", "on", "500", "and", "1000", "denomination", "notes", "in", "India", "?"], "sample_type": "disturb"} -{"id": 1665, "sentence1": "Is it a bad time to buy a condo or a house in the Bay Area in 2017 ?", "sentence2": "Is 2017 a good time to buy a house in Bay Area ?", "text_q_seg": ["Is", "it", "a", "bad", "time", "to", "buy", "a", "condo", "or", "a", "house", "in", "the", "Bay", "Area", "in", "2017", "?"], "text_t_seg": ["Is", "2017", "a", "good", "time", "to", "buy", "a", "house", "in", "Bay", "Area", "?"], "sample_type": "disturb"} -{"id": 1666, "sentence1": "What books should an aspiring entrepreneur read ?", "sentence2": "What are the top books an aspiring teen entrepreneur should read ?", "text_q_seg": ["What", "books", "should", "an", "aspiring", "entrepreneur", "read", "?"], "text_t_seg": ["What", "are", "the", "top", "books", "an", "aspiring", "teen", "entrepreneur", "should", "read", "?"], "sample_type": "disturb"} -{"id": 1667, "sentence1": "If universe is expanding infinitely and dark and vacuum energy are created as it expands … ?", "sentence2": "If universe can expand without limit and it creates dark / vacuum / gravitational energy with it , then is the potential energy infinite ?", "text_q_seg": ["If", "universe", "is", "expanding", "infinitely", "and", "dark", "and", "vacuum", "energy", "are", "created", "as", "it", "expands", "…", "?"], "text_t_seg": ["If", "universe", "can", "expand", "without", "limit", "and", "it", "creates", "dark", "/", "vacuum", "/", "gravitational", "energy", "with", "it", ",", "then", "is", "the", "potential", "energy", "infinite", "?"], "sample_type": "disturb"} -{"id": 1668, "sentence1": "Who 's the greatest influencer on your life that you have never met ?", "sentence2": "Who are people you have never met who have had the greatest influence on your life ?", "text_q_seg": ["Who", "'s", "the", "greatest", "influencer", "on", "your", "life", "that", "you", "have", "never", "met", "?"], "text_t_seg": ["Who", "are", "people", "you", "have", "never", "met", "who", "have", "had", "the", "greatest", "influence", "on", "your", "life", "?"], "sample_type": "disturb"} -{"id": 1669, "sentence1": "I 'm going to be US President in the future . What should I start doing now to achieve this ?", "sentence2": "I 'm 16 and I want to become the US president someday . What should I start doing ?", "text_q_seg": ["I", "'m", "going", "to", "be", "US", "President", "in", "the", "future", ".", "What", "should", "I", "start", "doing", "now", "to", "achieve", "this", "?"], "text_t_seg": ["I", "'m", "16", "and", "I", "want", "to", "become", "the", "US", "president", "someday", ".", "What", "should", "I", "start", "doing", "?"], "sample_type": "disturb"} -{"id": 1670, "sentence1": "For what reason did MS Dhoni leave captaincy of ODI & T-20 ?", "sentence2": "Why does M.S Dhoni left captaincy for ODI and T20 ?", "text_q_seg": ["For", "what", "reason", "did", "MS", "Dhoni", "leave", "captaincy", "of", "ODI", "&", "T-20", "?"], "text_t_seg": ["Why", "does", "M.S", "Dhoni", "left", "captaincy", "for", "ODI", "and", "T20", "?"], "sample_type": "disturb"} -{"id": 1671, "sentence1": "How to become an actuary ?", "sentence2": "What is the procedure of becoming an actuary ?", "text_q_seg": ["How", "to", "become", "an", "actuary", "?"], "text_t_seg": ["What", "is", "the", "procedure", "of", "becoming", "an", "actuary", "?"], "sample_type": "disturb"} -{"id": 1672, "sentence1": "Are there any smart ways to control emotions ?", "sentence2": "How can I control my emotions ?", "text_q_seg": ["Are", "there", "any", "smart", "ways", "to", "control", "emotions", "?"], "text_t_seg": ["How", "can", "I", "control", "my", "emotions", "?"], "sample_type": "disturb"} -{"id": 1673, "sentence1": "What are the best methods for outlining / planning a novel ?", "sentence2": "How do I best outline my novel ?", "text_q_seg": ["What", "are", "the", "best", "methods", "for", "outlining", "/", "planning", "a", "novel", "?"], "text_t_seg": ["How", "do", "I", "best", "outline", "my", "novel", "?"], "sample_type": "disturb"} -{"id": 1674, "sentence1": "What will happen if Donald Trump was elected the president of US ?", "sentence2": "What will happen now that President - elect Donald Trump has won the election ?", "text_q_seg": ["What", "will", "happen", "if", "Donald", "Trump", "was", "elected", "the", "president", "of", "US", "?"], "text_t_seg": ["What", "will", "happen", "now", "that", "President", "-", "elect", "Donald", "Trump", "has", "won", "the", "election", "?"], "sample_type": "disturb"} -{"id": 1675, "sentence1": "Why did Ned Stark bring very few men to the Tower of Joy ?", "sentence2": "Why did Ned Stark go to the Tower of Joy with so few men ? Why not bring a small guard ( say 20 more men ) of loyal and discreet northerners ?", "text_q_seg": ["Why", "did", "Ned", "Stark", "bring", "very", "few", "men", "to", "the", "Tower", "of", "Joy", "?"], "text_t_seg": ["Why", "did", "Ned", "Stark", "go", "to", "the", "Tower", "of", "Joy", "with", "so", "few", "men", "?", "Why", "not", "bring", "a", "small", "guard", "(", "say", "20", "more", "men", ")", "of", "loyal", "and", "discreet", "northerners", "?"], "sample_type": "disturb"} -{"id": 1676, "sentence1": "How do you get better grades ?", "sentence2": "How can I improve my grades ?", "text_q_seg": ["How", "do", "you", "get", "better", "grades", "?"], "text_t_seg": ["How", "can", "I", "improve", "my", "grades", "?"], "sample_type": "disturb"} -{"id": 1677, "sentence1": "What is your new year resolution , short term and long term goal for 2017 ?", "sentence2": "what will be your goals to reach in 2017", "text_q_seg": ["What", "is", "your", "new", "year", "resolution", ",", "short", "term", "and", "long", "term", "goal", "for", "2017", "?"], "text_t_seg": ["what", "will", "be", "your", "goals", "to", "reach", "in", "2017"], "sample_type": "disturb"} -{"id": 1678, "sentence1": "What will happen to the next Star Wars movies after Carrie Fisher 's death ?", "sentence2": "What will Carrie Fisher 's death mean for later Star Wars movies ?", "text_q_seg": ["What", "will", "happen", "to", "the", "next", "Star", "Wars", "movies", "after", "Carrie", "Fisher", "'s", "death", "?"], "text_t_seg": ["What", "will", "Carrie", "Fisher", "'s", "death", "mean", "for", "later", "Star", "Wars", "movies", "?"], "sample_type": "disturb"} -{"id": 1679, "sentence1": "Can you give me an analogy for a smooth ER ?", "sentence2": "What is an analogy for smooth ER ?", "text_q_seg": ["Can", "you", "give", "me", "an", "analogy", "for", "a", "smooth", "ER", "?"], "text_t_seg": ["What", "is", "an", "analogy", "for", "smooth", "ER", "?"], "sample_type": "disturb"} -{"id": 1680, "sentence1": "What is the best business to launch in Bangalore ?", "sentence2": "What is the best business in Bangalore to start up with ?", "text_q_seg": ["What", "is", "the", "best", "business", "to", "launch", "in", "Bangalore", "?"], "text_t_seg": ["What", "is", "the", "best", "business", "in", "Bangalore", "to", "start", "up", "with", "?"], "sample_type": "disturb"} -{"id": 1681, "sentence1": "Why does gst bill so important ?", "sentence2": "What is the impact of GST bill on a common man ?", "text_q_seg": ["Why", "does", "gst", "bill", "so", "important", "?"], "text_t_seg": ["What", "is", "the", "impact", "of", "GST", "bill", "on", "a", "common", "man", "?"], "sample_type": "disturb"} -{"id": 1682, "sentence1": "Which aircraft was better - the Douglas DC8 or the Boeing 707 ?", "sentence2": "Was the Douglas DC8 a superior aircraft to the Boeing 707 ?", "text_q_seg": ["Which", "aircraft", "was", "better", "-", "the", "Douglas", "DC8", "or", "the", "Boeing", "707", "?"], "text_t_seg": ["Was", "the", "Douglas", "DC8", "a", "superior", "aircraft", "to", "the", "Boeing", "707", "?"], "sample_type": "disturb"} -{"id": 1683, "sentence1": "How can I expand my IQ ?", "sentence2": "Are there any ways to increase my IQ ?", "text_q_seg": ["How", "can", "I", "expand", "my", "IQ", "?"], "text_t_seg": ["Are", "there", "any", "ways", "to", "increase", "my", "IQ", "?"], "sample_type": "disturb"} -{"id": 1684, "sentence1": "What does it mean when a girl take a day to reply to your text ?", "sentence2": "What does it imply when girls reply to a text a day after ?", "text_q_seg": ["What", "does", "it", "mean", "when", "a", "girl", "take", "a", "day", "to", "reply", "to", "your", "text", "?"], "text_t_seg": ["What", "does", "it", "imply", "when", "girls", "reply", "to", "a", "text", "a", "day", "after", "?"], "sample_type": "disturb"} -{"id": 1685, "sentence1": "How can I stop myself from watching too much of porn ?", "sentence2": "How shall I quit watching porn ?", "text_q_seg": ["How", "can", "I", "stop", "myself", "from", "watching", "too", "much", "of", "porn", "?"], "text_t_seg": ["How", "shall", "I", "quit", "watching", "porn", "?"], "sample_type": "disturb"} -{"id": 1686, "sentence1": "What will be the consequence of banning 500 and 1000 Rs notes on real estate sector in India ? Can we expect sharp fall in prices in short / long term ?", "sentence2": "What will the real estate look like now after the 500 and 1000 scraping ?", "text_q_seg": ["What", "will", "be", "the", "consequence", "of", "banning", "500", "and", "1000", "Rs", "notes", "on", "real", "estate", "sector", "in", "India", "?", "Can", "we", "expect", "sharp", "fall", "in", "prices", "in", "short", "/", "long", "term", "?"], "text_t_seg": ["What", "will", "the", "real", "estate", "look", "like", "now", "after", "the", "500", "and", "1000", "scraping", "?"], "sample_type": "disturb"} -{"id": 1687, "sentence1": "Is it worthwhile to pay for PhD from my pocket ?", "sentence2": "Is it foolish to pay for your PhD out of your own pocket ?", "text_q_seg": ["Is", "it", "worthwhile", "to", "pay", "for", "PhD", "from", "my", "pocket", "?"], "text_t_seg": ["Is", "it", "foolish", "to", "pay", "for", "your", "PhD", "out", "of", "your", "own", "pocket", "?"], "sample_type": "disturb"} -{"id": 1688, "sentence1": "What is the maximum file size that is allowed to be uploaded in Whatsapp ?", "sentence2": "What is the maximum file size on WhatsApp ?", "text_q_seg": ["What", "is", "the", "maximum", "file", "size", "that", "is", "allowed", "to", "be", "uploaded", "in", "Whatsapp", "?"], "text_t_seg": ["What", "is", "the", "maximum", "file", "size", "on", "WhatsApp", "?"], "sample_type": "disturb"} -{"id": 1689, "sentence1": "What are the best ways to learn to cook ?", "sentence2": "How can I learn to cook", "text_q_seg": ["What", "are", "the", "best", "ways", "to", "learn", "to", "cook", "?"], "text_t_seg": ["How", "can", "I", "learn", "to", "cook"], "sample_type": "disturb"} -{"id": 1690, "sentence1": "What was the first word uttered by human ?", "sentence2": "What is the first word ever spoken ?", "text_q_seg": ["What", "was", "the", "first", "word", "uttered", "by", "human", "?"], "text_t_seg": ["What", "is", "the", "first", "word", "ever", "spoken", "?"], "sample_type": "disturb"} -{"id": 1691, "sentence1": "Should I attend JEE Main exam offline or online ?", "sentence2": "Which mode is best for JEE MAIN 2017 online exam or offline ?", "text_q_seg": ["Should", "I", "attend", "JEE", "Main", "exam", "offline", "or", "online", "?"], "text_t_seg": ["Which", "mode", "is", "best", "for", "JEE", "MAIN", "2017", "online", "exam", "or", "offline", "?"], "sample_type": "disturb"} -{"id": 1692, "sentence1": "Is literally infinite number of unique human DNAs possible ?", "sentence2": "What is the maximum number of genetically unique human individuals ?", "text_q_seg": ["Is", "literally", "infinite", "number", "of", "unique", "human", "DNAs", "possible", "?"], "text_t_seg": ["What", "is", "the", "maximum", "number", "of", "genetically", "unique", "human", "individuals", "?"], "sample_type": "disturb"} -{"id": 1693, "sentence1": "What is motive of Mulayam Singh Yadav behind expelling Akhilesh Yadav from Samajwadi party ?", "sentence2": "What 's the reason for Mulayam Singh Yadav expelling Akhilesh Yadav from the Samajwadi Party for 6 years ?", "text_q_seg": ["What", "is", "motive", "of", "Mulayam", "Singh", "Yadav", "behind", "expelling", "Akhilesh", "Yadav", "from", "Samajwadi", "party", "?"], "text_t_seg": ["What", "'s", "the", "reason", "for", "Mulayam", "Singh", "Yadav", "expelling", "Akhilesh", "Yadav", "from", "the", "Samajwadi", "Party", "for", "6", "years", "?"], "sample_type": "disturb"} -{"id": 1694, "sentence1": "Why do we need to talk with eloquence ?", "sentence2": "Why do we need to philosophize with others ?", "text_q_seg": ["Why", "do", "we", "need", "to", "talk", "with", "eloquence", "?"], "text_t_seg": ["Why", "do", "we", "need", "to", "philosophize", "with", "others", "?"], "sample_type": "disturb"} -{"id": 1695, "sentence1": "How to recover e - mails that were deleted from a Gmail account ?", "sentence2": "Is there any way to retrieve my deleted emails from my Gmail account ?", "text_q_seg": ["How", "to", "recover", "e", "-", "mails", "that", "were", "deleted", "from", "a", "Gmail", "account", "?"], "text_t_seg": ["Is", "there", "any", "way", "to", "retrieve", "my", "deleted", "emails", "from", "my", "Gmail", "account", "?"], "sample_type": "disturb"} -{"id": 1696, "sentence1": "How to find my own gmail accounts list ?", "sentence2": "How can you find all of your Gmail accounts ?", "text_q_seg": ["How", "to", "find", "my", "own", "gmail", "accounts", "list", "?"], "text_t_seg": ["How", "can", "you", "find", "all", "of", "your", "Gmail", "accounts", "?"], "sample_type": "disturb"} -{"id": 1697, "sentence1": "Where can I get sparkling and well maintained cleaning service in Sydney ?", "sentence2": "Where are cleaning services provided in Sydney ?", "text_q_seg": ["Where", "can", "I", "get", "sparkling", "and", "well", "maintained", "cleaning", "service", "in", "Sydney", "?"], "text_t_seg": ["Where", "are", "cleaning", "services", "provided", "in", "Sydney", "?"], "sample_type": "disturb"} -{"id": 1698, "sentence1": "Can Fast and Furious 7 take $ 1 billion at the box office worldwide ?", "sentence2": "Will Furious 7 be the first movie in the franchise to gross a billion dollars ?", "text_q_seg": ["Can", "Fast", "and", "Furious", "7", "take", "$", "1", "billion", "at", "the", "box", "office", "worldwide", "?"], "text_t_seg": ["Will", "Furious", "7", "be", "the", "first", "movie", "in", "the", "franchise", "to", "gross", "a", "billion", "dollars", "?"], "sample_type": "disturb"} -{"id": 1699, "sentence1": "Is there a book suitable to learn language c++ ?", "sentence2": "What is a good book for learning the basics of C++ programming ?", "text_q_seg": ["Is", "there", "a", "book", "suitable", "to", "learn", "language", "c++", "?"], "text_t_seg": ["What", "is", "a", "good", "book", "for", "learning", "the", "basics", "of", "C++", "programming", "?"], "sample_type": "disturb"} -{"id": 1700, "sentence1": "What will be Barack Obama 's legacy when he leaves office ?", "sentence2": "Based on what we know now , what will Barack Obama 's historical legacy be ?", "text_q_seg": ["What", "will", "be", "Barack", "Obama", "'s", "legacy", "when", "he", "leaves", "office", "?"], "text_t_seg": ["Based", "on", "what", "we", "know", "now", ",", "what", "will", "Barack", "Obama", "'s", "historical", "legacy", "be", "?"], "sample_type": "disturb"} -{"id": 1701, "sentence1": "Why do n't people like Hilary Clinton ?", "sentence2": "What are the reasons that people dislike Hillary Clinton ?", "text_q_seg": ["Why", "do", "n't", "people", "like", "Hilary", "Clinton", "?"], "text_t_seg": ["What", "are", "the", "reasons", "that", "people", "dislike", "Hillary", "Clinton", "?"], "sample_type": "disturb"} -{"id": 1702, "sentence1": "How to see who viewed my videos on Instagram ?", "sentence2": "How can I see who viewed my video on Instagram but did n't like my video ?", "text_q_seg": ["How", "to", "see", "who", "viewed", "my", "videos", "on", "Instagram", "?"], "text_t_seg": ["How", "can", "I", "see", "who", "viewed", "my", "video", "on", "Instagram", "but", "did", "n't", "like", "my", "video", "?"], "sample_type": "disturb"} -{"id": 1703, "sentence1": "why is the sky so blue ?", "sentence2": "Why is the sky is blue ?", "text_q_seg": ["why", "is", "the", "sky", "so", "blue", "?"], "text_t_seg": ["Why", "is", "the", "sky", "is", "blue", "?"], "sample_type": "disturb"} -{"id": 1704, "sentence1": "How can I learn English well in a short time ?", "sentence2": "How can I learn English efficiently ?", "text_q_seg": ["How", "can", "I", "learn", "English", "well", "in", "a", "short", "time", "?"], "text_t_seg": ["How", "can", "I", "learn", "English", "efficiently", "?"], "sample_type": "disturb"} -{"id": 1705, "sentence1": "How can I stop eating junk and processed food addiction and stay healthy ?", "sentence2": "How to quit junk food ?", "text_q_seg": ["How", "can", "I", "stop", "eating", "junk", "and", "processed", "food", "addiction", "and", "stay", "healthy", "?"], "text_t_seg": ["How", "to", "quit", "junk", "food", "?"], "sample_type": "disturb"} -{"id": 1706, "sentence1": "What are the movies one should see ?", "sentence2": "What are the greatest movies I must see ?", "text_q_seg": ["What", "are", "the", "movies", "one", "should", "see", "?"], "text_t_seg": ["What", "are", "the", "greatest", "movies", "I", "must", "see", "?"], "sample_type": "disturb"} -{"id": 1707, "sentence1": "What is an accurate way to calculate your IQ ?", "sentence2": "How to test my IQ accurately ?", "text_q_seg": ["What", "is", "an", "accurate", "way", "to", "calculate", "your", "IQ", "?"], "text_t_seg": ["How", "to", "test", "my", "IQ", "accurately", "?"], "sample_type": "disturb"} -{"id": 1708, "sentence1": "Is our PM Modi doing the correct thing with 500 and 1000 Rs notes ?", "sentence2": "What is your view on the ban on Rs . 500 and Rs . 1000 currency notes ?", "text_q_seg": ["Is", "our", "PM", "Modi", "doing", "the", "correct", "thing", "with", "500", "and", "1000", "Rs", "notes", "?"], "text_t_seg": ["What", "is", "your", "view", "on", "the", "ban", "on", "Rs", ".", "500", "and", "Rs", ".", "1000", "currency", "notes", "?"], "sample_type": "disturb"} -{"id": 1709, "sentence1": "Why is the firm 's marginal cost curve equal supply curve ?", "sentence2": "How can supply curve reflect marginal cost ?", "text_q_seg": ["Why", "is", "the", "firm", "'s", "marginal", "cost", "curve", "equal", "supply", "curve", "?"], "text_t_seg": ["How", "can", "supply", "curve", "reflect", "marginal", "cost", "?"], "sample_type": "disturb"} diff --git a/examples/model_interpretation/download.sh b/examples/model_interpretation/download.sh deleted file mode 100755 index 7d98bfaceecc..000000000000 --- a/examples/model_interpretation/download.sh +++ /dev/null @@ -1,10 +0,0 @@ -wget https://paddlenlp.bj.bcebos.com/data/model_interpretation.tar -wait -tar -xvf model_interpretation.tar -wait -mv ./model_interpretation/vocab.char ./task/similarity/simnet/ -mv ./model_interpretation/vocab_QQP ./task/similarity/simnet/ -mv ./model_interpretation/simnet_vocab.txt ./task/similarity/simnet/ - -mv ./model_interpretation/vocab.sst2_train ./task/senti/rnn/ -mv ./model_interpretation/vocab.txt ./task/senti/rnn \ No newline at end of file diff --git a/examples/model_interpretation/evaluation/accuracy/cal_acc.py b/examples/model_interpretation/evaluation/accuracy/cal_acc.py deleted file mode 100644 index 93c32b46568d..000000000000 --- a/examples/model_interpretation/evaluation/accuracy/cal_acc.py +++ /dev/null @@ -1,92 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - This script includes code to calculating accuracy for results form textual similarity task -""" -import argparse -import json - - -def get_args(): - """ - get args - """ - parser = argparse.ArgumentParser("Acc eval") - parser.add_argument("--golden_path", required=True) - parser.add_argument("--pred_path", required=True) - parser.add_argument("--language", required=True, choices=["ch", "en"]) - - args = parser.parse_args() - return args - - -def load_from_file(args): - """ - load golden and pred data form file - :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list}, - golden_label: {sent_id, label}, pred_label: {sent_id, label} - """ - golden_f = open(args.golden_path, "r") - pred_f = open(args.pred_path, "r") - - golden_labels, pred_labels = {}, {} - - for golden_line in golden_f.readlines(): - golden_dict = json.loads(golden_line) - id = golden_dict["sent_id"] - golden_labels[id] = int(golden_dict["sent_label"]) - - for pred_line in pred_f.readlines(): - pred_dict = json.loads(pred_line) - id = pred_dict["id"] - pred_labels[id] = int(pred_dict["pred_label"]) - - result = {} - result["golden_labels"] = golden_labels - result["pred_labels"] = pred_labels - - return result - - -def cal_acc(golden_label, pred_label): - """ - The function actually calculate the accuracy. - """ - acc = 0.0 - for ids in pred_label: - if ids not in golden_label: - continue - if pred_label[ids] == golden_label[ids]: - acc += 1 - if len(golden_label): - acc /= len(golden_label) - return acc - - -def main(args): - """ - main function - """ - result = load_from_file(args) - golden_label = result["golden_labels"] - pred_label = result["pred_labels"] - - acc = cal_acc(golden_label, pred_label) - return acc, len(pred_label) - - -if __name__ == "__main__": - args = get_args() - acc, num = main(args) - print("total\tnum: %d\tacc: %.1f" % (num, acc * 100)) diff --git a/examples/model_interpretation/evaluation/accuracy/mrc_f1_evaluate.py b/examples/model_interpretation/evaluation/accuracy/mrc_f1_evaluate.py deleted file mode 100644 index 21ae6808c94a..000000000000 --- a/examples/model_interpretation/evaluation/accuracy/mrc_f1_evaluate.py +++ /dev/null @@ -1,265 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - This script is used to evaluate the performance of the mrc model (F1) -""" -from __future__ import print_function - -import argparse -import json -from collections import OrderedDict - -from paddlenlp.metrics.squad import squad_evaluate - - -def _tokenize_chinese_chars(text): - """ - :param text: input text, unicode string - :return: - tokenized text, list - """ - - def _is_chinese_char(cp): - """Checks whether CP is the codepoint of a CJK character.""" - # This defines a "chinese character" as anything in the CJK Unicode block: - # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) - # - # Note that the CJK Unicode block is NOT all Japanese and Korean characters, - # despite its name. The modern Korean Hangul alphabet is a different block, - # as is Japanese Hiragana and Katakana. Those alphabets are used to write - # space-separated words, so they are not treated specially and handled - # like the all of the other languages. - if ( - (cp >= 0x4E00 and cp <= 0x9FFF) - or (cp >= 0x3400 and cp <= 0x4DBF) # - or (cp >= 0x20000 and cp <= 0x2A6DF) # - or (cp >= 0x2A700 and cp <= 0x2B73F) # - or (cp >= 0x2B740 and cp <= 0x2B81F) # - or (cp >= 0x2B820 and cp <= 0x2CEAF) # - or (cp >= 0xF900 and cp <= 0xFAFF) - or (cp >= 0x2F800 and cp <= 0x2FA1F) # - ): # - return True - - return False - - output = [] - buff = "" - for char in text: - cp = ord(char) - if _is_chinese_char(cp) or char == "=": - if buff != "": - output.append(buff) - buff = "" - output.append(char) - else: - buff += char - - if buff != "": - output.append(buff) - - return output - - -def _normalize(in_str): - """ - normalize the input unicode string - """ - in_str = in_str.lower() - sp_char = [ - ":", - "_", - "`", - ",", - "。", - ":", - "?", - "!", - "(", - ")", - "“", - "”", - ";", - "’", - "《", - "》", - "……", - "·", - "、", - ",", - "「", - "」", - "(", - ")", - "-", - "~", - "『", - "』", - "|", - ] - out_segs = [] - for char in in_str: - if char in sp_char: - continue - else: - out_segs.append(char) - return "".join(out_segs) - - -def find_lcs(s1, s2): - """find the longest common subsequence between s1 ans s2""" - m = [[0 for i in range(len(s2) + 1)] for j in range(len(s1) + 1)] - max_len = 0 - p = 0 - for i in range(len(s1)): - for j in range(len(s2)): - if s1[i] == s2[j]: - m[i + 1][j + 1] = m[i][j] + 1 - if m[i + 1][j + 1] > max_len: - max_len = m[i + 1][j + 1] - p = i + 1 - return s1[p - max_len : p], max_len - - -def evaluate_ch(ref_ans, pred_ans): - """ - ref_ans: reference answers, dict - pred_ans: predicted answer, dict - return: - f1_score: averaged F1 score - em_score: averaged EM score - total_count: number of samples in the reference dataset - skip_count: number of samples skipped in the calculation due to unknown errors - """ - f1 = 0 - em = 0 - total_count = 0 - skip_count = 0 - for query_id in ref_ans: - sample = ref_ans[query_id] - total_count += 1 - answers = sample["sent_label"] - try: - prediction = pred_ans[query_id]["pred_label"] - except: - skip_count += 1 - continue - if prediction == "": - _f1 = 1.0 - _em = 1.0 - else: - _f1 = calc_f1_score([answers], prediction) - _em = calc_em_score([answers], prediction) - f1 += _f1 - em += _em - - f1_score = 100.0 * f1 / total_count - em_score = 100.0 * em / total_count - return f1_score, em_score, total_count, skip_count - - -def calc_f1_score(answers, prediction): - f1_scores = [] - for ans in answers: - ans_segs = _tokenize_chinese_chars(_normalize(ans)) - prediction_segs = _tokenize_chinese_chars(_normalize(prediction)) - if args.debug: - print(json.dumps(ans_segs, ensure_ascii=False)) - print(json.dumps(prediction_segs, ensure_ascii=False)) - lcs, lcs_len = find_lcs(ans_segs, prediction_segs) - if lcs_len == 0: - f1_scores.append(0) - continue - prec = 1.0 * lcs_len / len(prediction_segs) - rec = 1.0 * lcs_len / len(ans_segs) - f1 = (2 * prec * rec) / (prec + rec) - f1_scores.append(f1) - return max(f1_scores) - - -def calc_em_score(answers, prediction): - em = 0 - for ans in answers: - ans_ = _normalize(ans) - prediction_ = _normalize(prediction) - if ans_ == prediction_: - em = 1 - break - return em - - -def read_dataset(file_path): - f = open(file_path, "r") - golden = {} - for l in f.readlines(): - ins = json.loads(l) - golden[ins["sent_id"]] = ins - f.close() - return golden - - -def read_model_prediction(file_path): - f = open(file_path, "r") - predict = {} - for l in f.readlines(): - ins = json.loads(l) - predict[ins["id"]] = ins - f.close() - return predict - - -def read_temp(file_path): - with open(file_path) as f1: - result = json.loads(f1.read()) - return result - - -def get_args(): - parser = argparse.ArgumentParser("mrc baseline performance eval") - parser.add_argument("--golden_path", help="dataset file") - parser.add_argument("--pred_file", help="model prediction file") - parser.add_argument("--language", help="the language of the model") - parser.add_argument("--debug", action="store_true", help="debug mode") - args = parser.parse_args() - return args - - -if __name__ == "__main__": - args = get_args() - - if args.language == "ch": - ref_ans = read_dataset(args.golden_path) - pred_ans = read_model_prediction(args.pred_file) - F1, EM, TOTAL, SKIP = evaluate_ch(ref_ans, pred_ans) - - output_result = OrderedDict() - output_result["F1"] = "%.3f" % F1 - output_result["EM"] = "%.3f" % EM - output_result["TOTAL"] = TOTAL - output_result["SKIP"] = SKIP - print(json.dumps(output_result)) - else: - ref_ans = read_dataset(args.golden_path) - pred_ans = read_temp(args.pred_file) - res = [] - for i in ref_ans: - ins = ref_ans[i] - ins["id"] = str(ins["sent_id"]) - ins["answers"] = [ins["sent_label"]] - if ins["answers"] == [""]: - ins["is_impossible"] = True - else: - ins["is_impossible"] = False - res.append(ins) - squad_evaluate(examples=res, preds=pred_ans) diff --git a/examples/model_interpretation/evaluation/accuracy/run_acc.sh b/examples/model_interpretation/evaluation/accuracy/run_acc.sh deleted file mode 100755 index cfa26fa204f0..000000000000 --- a/examples/model_interpretation/evaluation/accuracy/run_acc.sh +++ /dev/null @@ -1,31 +0,0 @@ -### - # This script evaluates plausibility of the results generated by our models -### - -TASK=senti -if [[ $TASK == "mrc" ]]; then - MODELS=("roberta_base" "roberta_large") - MODES=("attention" "integrated_gradient") -else - MODELS=("lstm" "roberta_base" "roberta_large") - MODES=("attention" "integrated_gradient" "lime") -fi - -for BASE_MODEL in ${MODELS[*]}; -do - for INTER_MODE in ${MODES[*]}; - do - for LANGUAGE in "ch" "en"; - do - GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv - PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} - - echo $BASE_MODEL$'_'$INTER_MODE$'_'$LANGUAGE - - python3 ./cal_acc.py \ - --language $LANGUAGE \ - --golden_path $GOLDEN_PATH \ - --pred_path $PRED_PATH - done - done -done \ No newline at end of file diff --git a/examples/model_interpretation/evaluation/accuracy/run_mrc_f1.sh b/examples/model_interpretation/evaluation/accuracy/run_mrc_f1.sh deleted file mode 100755 index 204bc6b4c207..000000000000 --- a/examples/model_interpretation/evaluation/accuracy/run_mrc_f1.sh +++ /dev/null @@ -1,29 +0,0 @@ -### - # This script is used to evaluate the performance of the mrc model (F1) -### -MODELS=("roberta_base" "roberta_large") -MODES=("attention" "integrated_gradient") - -for BASE_MODEL in ${MODELS[*]}; -do - for INTER_MODE in ${MODES[*]}; - do - for LANGUAGE in "en" "ch"; - do - echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} - - GOLDEN_PATH=../golden/mrc_${LANGUAGE}.tsv - if [[ $LANGUAGE == "ch" ]]; then - PRED_FILE=../../rationale_extraction/evaluation_data/mrc/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} - else - PRED_FILE=../../task/mrc/output/mrc_en.${BASE_MODEL}/predict_ans - fi - - python3 mrc_f1_evaluate.py \ - --golden_path $GOLDEN_PATH \ - --pred_file $PRED_FILE \ - --language $LANGUAGE - done - done -done - diff --git a/examples/model_interpretation/evaluation/consistency/cal_map.py b/examples/model_interpretation/evaluation/consistency/cal_map.py deleted file mode 100644 index a6ed80d8058a..000000000000 --- a/examples/model_interpretation/evaluation/consistency/cal_map.py +++ /dev/null @@ -1,141 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -This script includes code to calculating MAP score for results form -sentiment analysis, textual similarity, and mrc task -""" -import argparse -import json -import math -import os - - -def get_args(): - parser = argparse.ArgumentParser("map eval") - parser.add_argument("--pred_path", required=True) - parser.add_argument("--golden_path", required=True) - parser.add_argument("--language", type=str, required=True, help="language that the model is built for") - args = parser.parse_args() - return args - - -def evids_load(args, path): - golden_f = open(args.golden_path, "r") - golden = {} - ins_num = 0 - for golden_line in golden_f.readlines(): - line = json.loads(golden_line) - if line["sample_type"] == "disturb": - ins_num += 1 - golden[line["sent_id"]] = line - - evids = {} - with open(path, "r") as f: - for line in f.readlines(): - dic = json.loads(line) - dic["sample_type"] = golden[dic["id"]]["sample_type"] - if "rel_ids" in golden[dic["id"]]: - dic["rel_ids"] = golden[dic["id"]]["rel_ids"] - evids[dic["id"]] = dic - return evids, ins_num - - -def _calc_MAP_by_bin(top_p, length_adv, adv_attriRank_list, ori_attriRank_list): - """ - This is our old way to calculate MAP, - which follows equation two in consistency section of README - """ - hits = 0 - sum_precs = 0.0 - length_t = math.ceil(length_adv * top_p) - adv_t = adv_attriRank_list[:length_t] - for char_idx, char in enumerate(adv_t): - if char in ori_attriRank_list[: char_idx + 1]: - hits += 1 - sum_precs += hits / (char_idx + 1) - if length_t > 0: - sum_precs /= length_t - return sum_precs - - -def _calc_MAP_by_bin_paper(top_p, length_adv, adv_attriRank_list, ori_attriRank_list): - """ - This function calculates MAP using the equation in our paper, - which follows equation one in consistency section of README - """ - total_precs = 0.0 - for i in range(length_adv): - hits = 0.0 - i += 1 - adv_t = adv_attriRank_list[:i] - for char_idx, char in enumerate(adv_t): - if char in ori_attriRank_list[:i]: - hits += 1 - hits = hits / i - total_precs += hits - if length_adv == 0: - return 0 - return total_precs / length_adv - - -def _calc_map(evids, key, ins_num): - t_map = 0.0 - - adv_num = 0 - ori_num = 0 - for ori_idx in evids: - if evids[ori_idx]["sample_type"] == "ori": - ori = evids[ori_idx] - ori_num += 1 - # One original instance can be related to several disturbed instance - for adv_idx in evids[ori_idx]["rel_ids"]: - if adv_idx in evids: - adv_num += 1 - adv = evids[adv_idx] - ori_attriRank_list = list(ori["rationale_token"][key]) - adv_attriRank_list = list(adv["rationale_token"][key]) - length_adv = len(adv_attriRank_list) - - sum_precs = _calc_MAP_by_bin_paper(1, length_adv, adv_attriRank_list, ori_attriRank_list) - t_map += sum_precs - - return t_map / ins_num, ori_num + adv_num - - -def cal_MAP(args, pred_path, la): - evids, ins_num = evids_load(args, pred_path) - if not evids: - print(pred_path + " file empty!") - return 0 - first_key = list(evids.keys())[0] - t_map = 0 - num = 0 - for i in range(len(evids[first_key]["rationale"])): - t_map_tmp, num_tmp = _calc_map(evids, i, ins_num) - t_map += t_map_tmp - num += num_tmp - t_map /= len(evids[first_key]["rationale"]) - num /= len(evids[first_key]["rationale"]) - print("total\t%d\t%.1f" % (num, 100 * t_map)) - return 0 - - -if __name__ == "__main__": - args = get_args() - la = args.language - pred_path = args.pred_path - if os.path.exists(pred_path): - cal_MAP(args, pred_path, la) - else: - print("Prediction file does not exists!") diff --git a/examples/model_interpretation/evaluation/consistency/run_map.sh b/examples/model_interpretation/evaluation/consistency/run_map.sh deleted file mode 100755 index 8ed9f114c5a2..000000000000 --- a/examples/model_interpretation/evaluation/consistency/run_map.sh +++ /dev/null @@ -1,31 +0,0 @@ -### - # This script evaluates consistency of the results generated by our models -### - -TASK=senti -if [[ $TASK == "mrc" ]]; then - MODELS=("roberta_base" "roberta_large") - MODES=("attention" "integrated_gradient") -else - MODELS=("lstm" "roberta_base" "roberta_large") - MODES=("attention" "integrated_gradient" "lime") -fi - -for BASE_MODEL in ${MODELS[*]}; -do - for INTER_MODE in ${MODES[*]}; - do - for LANGUAGE in "ch" "en"; - do - echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} - GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv - PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} - - python3 ./cal_map.py \ - --golden_path $GOLDEN_PATH \ - --pred_path $PRED_PATH \ - --language $LANGUAGE - - done - done -done \ No newline at end of file diff --git a/examples/model_interpretation/evaluation/faithfulness/newp_analysis.py b/examples/model_interpretation/evaluation/faithfulness/newp_analysis.py deleted file mode 100644 index f4ad0e56f236..000000000000 --- a/examples/model_interpretation/evaluation/faithfulness/newp_analysis.py +++ /dev/null @@ -1,78 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - This script includes code to calculating NewP score for results form - sentiment analysis, textual similarity, and mrc task -""" -import argparse -import json - -import numpy as np - - -def get_args(): - """ - get args - """ - parser = argparse.ArgumentParser("NewP eval") - - parser.add_argument("--pred_path", required=True) - parser.add_argument("--golden_path", required=True) - - args = parser.parse_args() - return args - - -def data_load(args): - """ - load result data from file - """ - pred_path = args.pred_path - golden_path = args.golden_path - - with open(pred_path, "r") as f_text: - pred_list = [] - for line in f_text.readlines(): - line_dict = json.loads(line) - pred_list.append(line_dict) - - with open(golden_path, "r") as f_text: - gold_list = {} - for line in f_text.readlines(): - line_dict = json.loads(line) - gold_list[line_dict["sent_id"]] = line_dict - return pred_list, gold_list - - -def analysis(args, instance, gold_list): - """ - Analysis result according to result data - """ - New_P_list = [] - for ins in instance: - golden_label = ins["pred_label"] - text_correct = 1 if ins["rationale_pred"] == golden_label else 0 - text_exclusive_correct = 1 if ins["no_rationale_pred"] == golden_label else 0 - New_P_correct = 1 if (text_correct == 1 and text_exclusive_correct == 0) else 0 - New_P_list.append(New_P_correct) - - total_New_P = np.sum(New_P_list) / len(gold_list) if len(gold_list) else 0 - - print("total\t%d\t%.1f" % (len(New_P_list), 100 * total_New_P)) - - -if __name__ == "__main__": - args = get_args() - pred_list, gold_list = data_load(args) - analysis(args, pred_list, gold_list) diff --git a/examples/model_interpretation/evaluation/faithfulness/run_newp.sh b/examples/model_interpretation/evaluation/faithfulness/run_newp.sh deleted file mode 100755 index 5110ea61ff71..000000000000 --- a/examples/model_interpretation/evaluation/faithfulness/run_newp.sh +++ /dev/null @@ -1,30 +0,0 @@ -### - # This script evaluates faithfulness of the results generated by our models -### - -TASK=senti -if [[ $TASK == "mrc" ]]; then - MODELS=("roberta_base" "roberta_large") - MODES=("attention" "integrated_gradient") -else - MODELS=("lstm" "roberta_base" "roberta_large") - MODES=("attention" "integrated_gradient" "lime") -fi - -for BASE_MODEL in ${MODELS[*]}; -do - for INTER_MODE in ${MODES[*]}; - do - for LANGUAGE in "ch" "en"; - do - GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv - PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} - - echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} - - python3 ./newp_analysis.py \ - --pred_path $PRED_PATH \ - --golden_path $GOLDEN_PATH - done - done -done \ No newline at end of file diff --git a/examples/model_interpretation/evaluation/plausibility/eval_mrc.py b/examples/model_interpretation/evaluation/plausibility/eval_mrc.py deleted file mode 100644 index b3bc04a5ba5b..000000000000 --- a/examples/model_interpretation/evaluation/plausibility/eval_mrc.py +++ /dev/null @@ -1,112 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - This script includes code to calculating F1 score for results form mrc task -""" -import argparse -import json - - -def get_args(): - parser = argparse.ArgumentParser("F1 eval") - - parser.add_argument("--golden_path", required=True) - parser.add_argument("--pred_path", required=True) - parser.add_argument("--language", required=True, choices=["ch", "en"]) - - args = parser.parse_args() - return args - - -def load_from_file(args): - """ - Load golden and pred data form file - :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list}, - golden_label: {sent_id, label}, pred_label: {sent_id, label} - """ - golden_f = open(args.golden_path, "r") - pred_f = open(args.pred_path, "r") - - golden_raw_rationale, pred_rationale = {}, {} - - for golden_line in golden_f.readlines(): - golden_dict = json.loads(golden_line) - sent_id = golden_dict["sent_id"] - golden_raw_rationale[sent_id] = [int(x) for x in golden_dict["rationales"]] - - for pred_line in pred_f.readlines(): - pred_dict = json.loads(pred_line) - senti_id = pred_dict["id"] - pred_rationale[senti_id] = pred_dict["rationale"][0] - - return golden_raw_rationale, pred_rationale - - -def _f1(_p, _r): - if _p == 0 or _r == 0: - return 0 - return 2 * _p * _r / (_p + _r) - - -def calc_f1(golden_evid, pred_evid): - tp = set(pred_evid) & set(golden_evid) - prec = len(tp) / len(pred_evid) if len(pred_evid) else 0 - rec = len(tp) / len(golden_evid) if len(golden_evid) else 0 - f1 = _f1(prec, rec) - return f1 - - -def calc_model_f1(golden_dict, pred_dict): - """ - :param golden_dict: dict - :param pred_dict: dict - :return: macro-f1, micro-f1 - """ - - scores = {} - - for s_id in pred_dict.keys(): - if s_id not in golden_dict: - continue - golden_evid = golden_dict[s_id] - pred_evid = pred_dict[s_id] - - tp = set(golden_evid) & set(pred_evid) - prec = len(tp) / len(pred_evid) if len(pred_evid) else 0 - rec = len(tp) / len(golden_evid) if len(golden_evid) else 0 - f1 = _f1(prec, rec) - scores[s_id] = { - "tp_count": len(tp), - "pred_count": len(pred_evid), - "golden_count": len(golden_evid), - "prec": prec, - "rec": rec, - "f1": f1, - } - - macro_f1 = sum(score["f1"] for score in scores.values()) / len(golden_dict) if len(golden_dict) else 0 - - return macro_f1, scores - - -def main(args): - golden_raw, pred_raw = load_from_file(args) - macro_f1, scores = calc_model_f1(golden_raw, pred_raw) - return macro_f1, len(golden_raw), scores - - -if __name__ == "__main__": - args = get_args() - macro_f1, num, scores = main(args) - print("total\tnum: %d\tmacor_f1: %.1f" % (num, macro_f1 * 100)) diff --git a/examples/model_interpretation/evaluation/plausibility/eval_senti.py b/examples/model_interpretation/evaluation/plausibility/eval_senti.py deleted file mode 100644 index 449755cf972c..000000000000 --- a/examples/model_interpretation/evaluation/plausibility/eval_senti.py +++ /dev/null @@ -1,178 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - This script includes code to calculating F1 score for results form sentiment analysis task -""" -import argparse -import json - - -def get_args(): - parser = argparse.ArgumentParser("F1 eval") - - parser.add_argument("--language", required=True, choices=["en", "ch"]) - parser.add_argument("--golden_path", required=True) - parser.add_argument("--pred_path", required=True) - - args = parser.parse_args() - return args - - -def load_from_file(args): - """ - Load golden and pred data form file - :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list}, - golden_label: {sent_id, label}, pred_label: {sent_id, label} - """ - golden_f = open(args.golden_path, "r") - pred_f = open(args.pred_path, "r") - - golden_raw_rationale, golden_label, pred_rationale, pred_label = {}, {}, {}, {} - - for golden_line in golden_f.readlines(): - golden_dict = json.loads(golden_line) - sent_id = golden_dict["sent_id"] - golden_raw_rationale[sent_id] = [] - for x in golden_dict["rationales"]: - temp = [int(y) for y in x] - golden_raw_rationale[sent_id].append(temp) - golden_label[sent_id] = int(golden_dict["sent_label"]) - - for pred_line in pred_f.readlines(): - pred_dict = json.loads(pred_line) - senti_id = pred_dict["id"] - pred_rationale[senti_id] = pred_dict["rationale"][0] - pred_label[senti_id] = int(pred_dict["pred_label"]) - - golden_f.close() - pred_f.close() - return golden_raw_rationale, pred_rationale, golden_label, pred_label - - -def _f1(_p, _r): - if _p == 0 or _r == 0: - return 0 - return 2 * _p * _r / (_p + _r) - - -def calc_f1(golden_evid, pred_evid): - tp = set(pred_evid) & set(golden_evid) - prec = len(tp) / len(pred_evid) if len(pred_evid) else 0 - rec = len(tp) / len(golden_evid) if len(golden_evid) else 0 - f1 = _f1(prec, rec) - return f1 - - -def combine(cur_max_f1, union_set, golden_evid, pred_evid): - """ - Args: - cur_max_f1 float: 当前最大f1 - union_set set(): 已合并集合 - golden_evid list(): 标注证据 - pred_evid list(): 预测证据 - """ - if len(union_set & set(golden_evid)) < len(golden_evid) and calc_f1(golden_evid, pred_evid) > 0: - new_union_set = union_set | set(golden_evid) - new_f1 = calc_f1(new_union_set, pred_evid) - if new_f1 > cur_max_f1: # 若union_set合并golden_evid后f1未超过cur_max_f1,则不更新union_set - cur_max_f1 = new_f1 - union_set = new_union_set - - return cur_max_f1, union_set - - -def pick_max_golden_evid(golden_raw, pred_raw): - """ - 从golden_evids中找出与pred_evid f1最大的golden_evid - """ - golden_dict = {} - err_rationale = [] - - for s_id in pred_raw.keys(): - if s_id not in golden_raw: - continue - golden_evids = golden_raw[s_id] - pred_evid = pred_raw[s_id] - max_f1 = 0 - - # 找f1最大的单条golden_evid - for golden_evid in golden_evids: - f1 = calc_f1(golden_evid, pred_evid) - if f1 > max_f1: - max_f1 = f1 - golden_dict[s_id] = golden_evid - - # 找f1最大的组合golden_evid - for start_id in range(len(golden_evids) - 1): - union_set = set() - cur_max_f1 = 0 - for id in range(start_id, len(golden_evids)): - golden_evid = golden_evids[id] - cur_max_f1, union_set = combine(cur_max_f1, union_set, golden_evid, pred_evid) - - if cur_max_f1 > max_f1: - max_f1 = cur_max_f1 - golden_dict[s_id] = list(union_set) - - if max_f1 == 0: - golden_dict[s_id] = [] - err_rationale.append(s_id) - - return golden_dict - - -def calc_model_f1(golden_dict, pred_dict, golden_len): - """ - :param golden_dict: dict - :param pred_dict: dict - :return: macro-f1, micro-f1 - """ - - scores = {} - - for s_id in pred_dict.keys(): - if s_id not in golden_dict: - continue - golden_evid = golden_dict[s_id] - pred_evid = pred_dict[s_id] - - tp = set(golden_evid) & set(pred_evid) - prec = len(tp) / len(pred_evid) if len(pred_evid) else 0 - rec = len(tp) / len(golden_evid) if len(golden_evid) else 0 - f1 = _f1(prec, rec) - scores[s_id] = { - "tp_count": len(tp), - "pred_count": len(pred_evid), - "golden_count": len(golden_evid), - "prec": prec, - "rec": rec, - "f1": f1, - } - - macro_f1 = (sum(score["f1"] for score in scores.values()) / golden_len) if golden_len else 0 - - return macro_f1, scores - - -def main(args): - golden_raw, pred_raw, golden_label, pred_label = load_from_file(args) - golden_dict = pick_max_golden_evid(golden_raw, pred_raw) - macro_f1, scores = calc_model_f1(golden_dict, pred_raw, len(golden_raw)) - return macro_f1, len(golden_raw) - - -if __name__ == "__main__": - args = get_args() - macro_f1, num = main(args) - print("num\t%.2f\tmacor_f1: %.1f" % (num, macro_f1 * 100)) diff --git a/examples/model_interpretation/evaluation/plausibility/eval_similarity.py b/examples/model_interpretation/evaluation/plausibility/eval_similarity.py deleted file mode 100644 index 0307248514bd..000000000000 --- a/examples/model_interpretation/evaluation/plausibility/eval_similarity.py +++ /dev/null @@ -1,133 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - This script includes code to calculating F1 score for results form textual similarity task -""" -import argparse -import json - - -def get_args(): - """ - get args - """ - parser = argparse.ArgumentParser("F1 eval") - parser.add_argument("--golden_path", required=True) - parser.add_argument("--pred_path", required=True) - parser.add_argument("--language", required=True, choices=["ch", "en"]) - - args = parser.parse_args() - return args - - -def load_from_file(args): - """ - Load golden and pred data form file - :return: golden_raw: {sent_id, rationales_lists}, pred_raw: {sent_id, rationales_list}, - golden_label: {sent_id, label}, pred_label: {sent_id, label} - """ - golden_f = open(args.golden_path, "r") - pred_f = open(args.pred_path, "r") - - golden_q_rationales, golden_t_rationales = {}, {} - pred_q_rationales, pred_t_rationales = {}, {} - golden_labels, pred_labels = {}, {} - - for golden_line in golden_f.readlines(): - golden_dict = json.loads(golden_line) - id = golden_dict["sent_id"] - # golden_rationale id - golden_q_rationales[id] = [int(x) for x in golden_dict["rationale_q_idx"]] - golden_t_rationales[id] = [int(x) for x in golden_dict["rationale_t_idx"]] - golden_labels[id] = int(golden_dict["sent_label"]) - - for pred_line in pred_f.readlines(): - pred_dict = json.loads(pred_line) - id = pred_dict["id"] - pred_q_rationales[id] = pred_dict["rationale"][0] - pred_t_rationales[id] = pred_dict["rationale"][1] - pred_labels[id] = int(pred_dict["pred_label"]) - - result = {} - result["golden_q_rationales"] = golden_q_rationales - result["golden_t_rationales"] = golden_t_rationales - result["pred_q_rationales"] = pred_q_rationales - result["pred_t_rationales"] = pred_t_rationales - result["golden_labels"] = golden_labels - result["pred_labels"] = pred_labels - - return result - - -def _f1(_p, _r): - if _p == 0 or _r == 0: - return 0 - return 2 * _p * _r / (_p + _r) - - -def calc_model_f1(golden_a_rationales, golden_b_rationales, pred_a_rationales, pred_b_rationales): - """ - :param golden_dict: dict - :param pred_dict: dict - :return: macro-f1, micro-f1 - """ - - scores = {} - - for id in pred_a_rationales.keys(): - golden_a_ratioanl = golden_a_rationales[id] - pred_a_rationale = pred_a_rationales[id] - tp_a = set(golden_a_ratioanl) & set(pred_a_rationale) - prec_a = len(tp_a) / len(pred_a_rationale) if len(pred_a_rationale) else 0 - rec_a = len(tp_a) / len(golden_a_ratioanl) if len(golden_a_ratioanl) else 0 - f1_a = _f1(prec_a, rec_a) - - golden_b_rationale = golden_b_rationales[id] - pred_b_rationale = pred_b_rationales[id] - tp_b = set(golden_b_rationale) & set(pred_b_rationale) - prec_b = len(tp_b) / len(pred_b_rationale) if len(pred_b_rationale) else 0 - rec_b = len(tp_b) / len(golden_b_rationale) if len(golden_b_rationale) else 0 - f1_b = _f1(prec_b, rec_b) - - scores[id] = { - "tp_count": (len(tp_a) + len(tp_b)) / 2, - "pred_count": (len(pred_a_rationale) + len(pred_b_rationale)) / 2, - "golden_count": (len(golden_a_ratioanl) + len(golden_b_rationale)) / 2, - "prec": (prec_a + prec_b) / 2, - "rec": (rec_a + rec_b) / 2, - "f1": (f1_a + f1_b) / 2, - } - - macro_f1 = ( - sum(score["f1"] for score in scores.values()) / len(golden_a_rationales) if len(golden_a_rationales) else 0 - ) - - return macro_f1, scores - - -def main(args): - result = load_from_file(args) - golden_a_rationales = result["golden_q_rationales"] - golden_b_rationales = result["golden_t_rationales"] - pred_a_rationales = result["pred_q_rationales"] - pred_b_rationales = result["pred_t_rationales"] - - macro_f1, scores = calc_model_f1(golden_a_rationales, golden_b_rationales, pred_a_rationales, pred_b_rationales) - return macro_f1, len(scores) - - -if __name__ == "__main__": - args = get_args() - macro_f1, num = main(args) - print("total\tnum: %d\tmacor_f1: %.1f" % (num, macro_f1 * 100)) diff --git a/examples/model_interpretation/evaluation/plausibility/run_f1.sh b/examples/model_interpretation/evaluation/plausibility/run_f1.sh deleted file mode 100755 index 8d5bd2e7a9f2..000000000000 --- a/examples/model_interpretation/evaluation/plausibility/run_f1.sh +++ /dev/null @@ -1,34 +0,0 @@ -### - # This script evaluates plausibility of the results generated by our models -### - -TASK=senti -if [[ $TASK == "mrc" ]]; then - MODELS=("roberta_base" "roberta_large") - MODES=("attention" "integrated_gradient") -else - MODELS=("lstm" "roberta_base" "roberta_large") - MODES=("attention" "integrated_gradient" "lime") -fi - -for BASE_MODEL in ${MODELS[*]}; -do - for INTER_MODE in ${MODES[*]}; - do - for LANGUAGE in "ch" "en"; - do - GOLDEN_PATH=../golden/${TASK}_${LANGUAGE}.tsv - PRED_PATH=../../rationale_extraction/evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} - - SAVE_PATH=res/ - [ -d $SAVE_PATH ] || mkdir -p $SAVE_PATH - - echo $BASE_MODEL$'_'$INTER_MODE$'_'$LANGUAGE - - python3 ./eval_${TASK}.py \ - --language $LANGUAGE \ - --golden_path $GOLDEN_PATH \ - --pred_path $PRED_PATH - done - done -done \ No newline at end of file diff --git a/examples/model_interpretation/imgs/equation1.png b/examples/model_interpretation/imgs/equation1.png deleted file mode 100644 index e1db9780248d..000000000000 Binary files a/examples/model_interpretation/imgs/equation1.png and /dev/null differ diff --git a/examples/model_interpretation/imgs/equation2.png b/examples/model_interpretation/imgs/equation2.png deleted file mode 100644 index fbb26c60e35a..000000000000 Binary files a/examples/model_interpretation/imgs/equation2.png and /dev/null differ diff --git a/examples/model_interpretation/imgs/equation3.png b/examples/model_interpretation/imgs/equation3.png deleted file mode 100644 index bf4f28f7c48a..000000000000 Binary files a/examples/model_interpretation/imgs/equation3.png and /dev/null differ diff --git a/examples/model_interpretation/imgs/equation4.png b/examples/model_interpretation/imgs/equation4.png deleted file mode 100644 index a4743a67deb4..000000000000 Binary files a/examples/model_interpretation/imgs/equation4.png and /dev/null differ diff --git a/examples/model_interpretation/imgs/equation5.png b/examples/model_interpretation/imgs/equation5.png deleted file mode 100644 index 75bbe3be4ad5..000000000000 Binary files a/examples/model_interpretation/imgs/equation5.png and /dev/null differ diff --git a/examples/model_interpretation/imgs/example1.png b/examples/model_interpretation/imgs/example1.png deleted file mode 100644 index f0b7dda4dfef..000000000000 Binary files a/examples/model_interpretation/imgs/example1.png and /dev/null differ diff --git a/examples/model_interpretation/imgs/structure.png b/examples/model_interpretation/imgs/structure.png deleted file mode 100644 index b7573e09ba02..000000000000 Binary files a/examples/model_interpretation/imgs/structure.png and /dev/null differ diff --git a/examples/model_interpretation/punctuations b/examples/model_interpretation/punctuations deleted file mode 100644 index 11d057b89103..000000000000 --- a/examples/model_interpretation/punctuations +++ /dev/null @@ -1,82 +0,0 @@ -” -。 -, -∈ -] -√ - -! -( -≥ -【 -“ -「 -÷ -《 -】 -! -ˊ -」 -. -_ -@ -~ -– -〕 -∶ -) -’ -℃ -》 -〈 -→ -、 -+ -| -; -: -∠ -' -‘ -, -? -× -△ -- -• -· -— -° -> -′ -● -; -… -" -Ⅱ -/ -< -+ -= -^ -Ⅰ -? -[ -﹑ -﹐ -* -〔 -~ -: -( -) -〉 -◎ -= -- -\ -% -% -& -≠ -. \ No newline at end of file diff --git a/examples/model_interpretation/rationale_extraction/available_gpu.py b/examples/model_interpretation/rationale_extraction/available_gpu.py deleted file mode 100644 index e05ecd3c666a..000000000000 --- a/examples/model_interpretation/rationale_extraction/available_gpu.py +++ /dev/null @@ -1,46 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific l -"""print available_gpu id, using nvgpu -""" - -import logging -import traceback - -import nvgpu - -logging.basicConfig( - level=logging.DEBUG, - format="%(levelname)s: %(asctime)s %(filename)s" " [%(funcName)s:%(lineno)d][%(process)d] %(message)s", - datefmt="%m-%d %H:%M:%S", - filename=None, - filemode="a", -) - -if __name__ == "__main__": - from argparse import ArgumentParser - - try: - arg_parser = ArgumentParser(description="print available_gpu id, using nvgpu") - arg_parser.add_argument("-b", "--best", default=None, type=int, help="output best N") - args = arg_parser.parse_args() - - if args.best is not None: - gpus = sorted(nvgpu.gpu_info(), key=lambda x: (x["mem_used"], x["index"])) - ids = [x["index"] for x in gpus] - print(",".join(ids[: args.best])) - else: - print(",".join(nvgpu.available_gpus())) - - except Exception: - traceback.print_exc() - exit(-1) diff --git a/examples/model_interpretation/rationale_extraction/generate.sh b/examples/model_interpretation/rationale_extraction/generate.sh deleted file mode 100755 index d72b20bda984..000000000000 --- a/examples/model_interpretation/rationale_extraction/generate.sh +++ /dev/null @@ -1,57 +0,0 @@ -TASK=similarity - -if [[ $TASK == "mrc" ]]; then - MODELS=("roberta_base" "roberta_large") - MODES=("attention" "integrated_gradient") -else - MODELS=("roberta_large" "roberta_base" "lstm") - MODES=("lime" "attention" "integrated_gradient") -fi - -for BASE_MODEL in ${MODELS[*]}; -do - for INTER_MODE in ${MODES[*]}; - do - for LANGUAGE in "ch" "en"; - do - if [[ $LANGUAGE == "ch" ]]; then - if [[ $TASK == "senti" ]]; then - RATIO_DIC="[0.311]" - elif [[ $TASK == "similarity" ]]; then - RATIO_DIC="[0.701,0.709]" - elif [[ $TASK == "mrc" ]]; then - RATIO_DIC="[0.096]" - fi - elif [[ $LANGUAGE == "en" ]]; then - if [[ $TASK == "senti" ]]; then - RATIO_DIC="[0.192]" - elif [[ $TASK == "similarity" ]]; then - RATIO_DIC="[0.511,0.505]" - elif [[ $TASK == "mrc" ]]; then - RATIO_DIC="[0.102]" - fi - fi - echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} - - PRED_PATH=../task/${TASK}/output/${TASK}_${LANGUAGE}.${BASE_MODEL}/interpret.${INTER_MODE} - SAVE_PATH=./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} - [ -d $SAVE_PATH ] || mkdir -p $SAVE_PATH - - python3 ./newp_text_generate.py \ - --pred_path $PRED_PATH \ - --save_path $SAVE_PATH \ - --task $TASK \ - --language $LANGUAGE \ - --ratio $RATIO_DIC - wait - - sh ./run_2_pred_${TASK}_per.sh $BASE_MODEL $INTER_MODE $LANGUAGE - wait - - sh ./generate_evaluation_data.sh $BASE_MODEL $INTER_MODE $LANGUAGE $TASK - wait - - echo ${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}_finished - done - done -done diff --git a/examples/model_interpretation/rationale_extraction/generate_evaluation_data.py b/examples/model_interpretation/rationale_extraction/generate_evaluation_data.py deleted file mode 100644 index 162b7fb00f70..000000000000 --- a/examples/model_interpretation/rationale_extraction/generate_evaluation_data.py +++ /dev/null @@ -1,113 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import json - - -def get_args(): - parser = argparse.ArgumentParser("generate data") - - parser.add_argument("--pred_path", required=True) - parser.add_argument("--data_dir", required=True) - parser.add_argument("--data_dir2", required=True) - parser.add_argument("--save_path", required=True) - parser.add_argument("--inter_mode", required=True) - parser.add_argument("--base_model", required=True) - parser.add_argument("--language", required=True) - - args = parser.parse_args() - return args - - -def evids_load(path): - evids = [] - with open(path, "r") as f: - for line in f.readlines(): - dic = json.loads(line) - evids.append(dic) - return evids - - -def dataLoad(args): - base_path = args.data_dir + "/" - text_path = base_path + "rationale_text/dev/dev" - text_exclusive_path = base_path + "rationale_exclusive_text/dev/dev" - - with open(text_path, "r") as f_text: - text_dict_list = {} - for line in f_text.readlines(): - line_dict = json.loads(line) - text_dict_list[line_dict["id"]] = line_dict - - with open(text_exclusive_path, "r") as f_exclusive_text: - text_exclusive_dict_list = {} - for line in f_exclusive_text.readlines(): - line_dict = json.loads(line) - text_exclusive_dict_list[line_dict["id"]] = line_dict - - base_path = args.data_dir2 + "/" - text_path = base_path + "rationale_text/dev/dev" - text_exclusive_path = base_path + "rationale_exclusive_text/dev/dev" - - with open(text_path, "r") as f_text: - text_dict_list2 = {} - for line in f_text.readlines(): - line_dict = json.loads(line) - text_dict_list2[line_dict["id"]] = line_dict - - with open(text_exclusive_path, "r") as f_exclusive_text: - text_exclusive_dict_list2 = {} - for line in f_exclusive_text.readlines(): - line_dict = json.loads(line) - text_exclusive_dict_list2[line_dict["id"]] = line_dict - - return text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2 - - -def r_data_generation( - args, evids, text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2 -): - save_path = args.save_path - f_save = open(save_path, "w") - - res_data = [] - for ins in evids: - temp = {} - temp["id"] = ins["id"] - temp["pred_label"] = ins["pred_label"] - temp["rationale"] = text_dict_list2[ins["id"]]["context_idx"] - temp["no_rationale"] = text_exclusive_dict_list2[ins["id"]]["context_idx"] - if len(temp["rationale"]) > 1 and args.inter_mode != "lime" and not (args.base_model.startswith("roberta")): - for i in range(len(temp["rationale"][1])): - temp["rationale"][1][i] -= len(temp["rationale"][0]) + len(temp["no_rationale"][0]) - for i in range(len(temp["no_rationale"][1])): - temp["no_rationale"][1][i] -= len(temp["rationale"][0]) + len(temp["no_rationale"][0]) - temp["rationale_pred"] = text_dict_list[ins["id"]]["pred_label"] - temp["no_rationale_pred"] = text_exclusive_dict_list[ins["id"]]["pred_label"] - temp["rationale_token"] = text_dict_list2[ins["id"]]["context_token"] - - res_data.append(temp) - - f_save.write(json.dumps(temp, ensure_ascii=False) + "\n") - f_save.close() - - -if __name__ == "__main__": - args = get_args() - text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2 = dataLoad(args) - evids = evids_load(args.pred_path) - r_data_generation( - args, evids, text_dict_list, text_exclusive_dict_list, text_dict_list2, text_exclusive_dict_list2 - ) diff --git a/examples/model_interpretation/rationale_extraction/generate_evaluation_data.sh b/examples/model_interpretation/rationale_extraction/generate_evaluation_data.sh deleted file mode 100755 index fa26d3beb8f9..000000000000 --- a/examples/model_interpretation/rationale_extraction/generate_evaluation_data.sh +++ /dev/null @@ -1,23 +0,0 @@ -### - # This script concatenates results from previous running to generate a formated result for evaluation use -### - -BASE_MODEL=$1 -INTER_MODE=$2 -LANGUAGE=$3 -TASK=$4 - -PRED_PATH=../task/${TASK}/output/${TASK}_${LANGUAGE}.${BASE_MODEL}/interpret.${INTER_MODE} -SAVE_PATH=./evaluation_data/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} - -SAVE_DIR=./evaluation_data/${TASK}/ -[ -d $SAVE_DIR ] || mkdir -p $SAVE_DIR - -python3 generate_evaluation_data.py \ - --data_dir ./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} \ - --data_dir2 ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE} \ - --pred_path $PRED_PATH \ - --save_path $SAVE_PATH \ - --inter_mode $INTER_MODE \ - --base_model $BASE_MODEL \ - --language $LANGUAGE \ No newline at end of file diff --git a/examples/model_interpretation/rationale_extraction/mrc_pred.py b/examples/model_interpretation/rationale_extraction/mrc_pred.py deleted file mode 100644 index 2868c86b1240..000000000000 --- a/examples/model_interpretation/rationale_extraction/mrc_pred.py +++ /dev/null @@ -1,207 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import functools -import json -import os -import sys -import time -from pathlib import Path - -import paddle - -from paddlenlp.data import Dict, Pad -from paddlenlp.transformers.roberta.tokenizer import ( - RobertaBPETokenizer, - RobertaTokenizer, -) - -sys.path.append("../task/mrc") -from saliency_map.squad import RCInterpret, compute_prediction # noqa: E402 - -sys.path.append("..") -from roberta.modeling import RobertaForQuestionAnswering # noqa: E402 - -sys.path.remove("..") -sys.path.remove("../task/mrc") -sys.path.append("../..") -from model_interpretation.utils import ( # noqa: E402 - convert_tokenizer_res_to_old_version, -) - -sys.path.remove("../..") - - -def get_args(): - parser = argparse.ArgumentParser("mrc predict with roberta") - parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large"]) - parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") - parser.add_argument( - "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" - ) - parser.add_argument("--batch_size", type=int, default=32, help="batchsize") - parser.add_argument("--epoch", type=int, default=3, help="epoch") - parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") - parser.add_argument("--warmup_proportion", type=float, default=0.1) - parser.add_argument("--lr", type=float, default=5e-5, help="learning rate") - parser.add_argument("--eval", action="store_true") - parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") - parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") - parser.add_argument( - "--use_amp", - action="store_true", - help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", - ) - parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") - parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") - parser.add_argument( - "--doc_stride", - type=int, - default=128, - help="When splitting up a long document into chunks, how much stride to take between chunks.", - ) - parser.add_argument("--language", type=str, required=True, help="language that the model based on") - parser.add_argument("--input_data", type=str, required=True) - args = parser.parse_args() - return args - - -def map_fn_DuCheckList(examples, args, tokenizer): - # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results - # in one example possible giving several features when a context is long, each of those features having a - # context that overlaps a bit the context of the previous feature. - # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is - # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. - contexts = [examples[i]["context"] for i in range(len(examples))] - questions = [examples[i]["question"] for i in range(len(examples))] - - tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len) - tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) - - # For validation, there is no need to compute start and end positions - for i, tokenized_example in enumerate(tokenized_examples): - # Grab the sequence corresponding to that example (to know what is the context and what is the question). - sequence_ids = tokenized_example["token_type_ids"] - - # One example can give several spans, this is the index of the example containing this span of text. - sample_index = tokenized_example["overflow_to_sample"] - tokenized_examples[i]["example_id"] = examples[sample_index]["id"] - - # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token - # position is part of the context or not. - if args.language == "ch": - tokenized_examples[i]["offset_mapping"] = [ - (o if sequence_ids[k] == 1 else None) for k, o in enumerate(tokenized_example["offset_mapping"]) - ] - else: - n = tokenized_example["offset_mapping"].index((0, 0), 1) + 2 # context start position - m = len(tokenized_example["offset_mapping"]) - 1 # context end position + 1 - tokenized_examples[i]["offset_mapping"] = [ - (o if n <= k <= m else None) for k, o in enumerate(tokenized_example["offset_mapping"]) - ] - - return tokenized_examples - - -def load_data(path): - data = {} - f = open(path, "r") - for line in f.readlines(): - line_split = json.loads(line) - data[line_split["id"]] = line_split - f.close() - return data - - -def init_roberta_var(args): - if args.language == "ch": - tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) - else: - tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) - - model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained) - map_fn = functools.partial(map_fn_DuCheckList, args=args, tokenizer=tokenizer) - dev_ds = RCInterpret().read(os.path.join(args.data_dir, "dev")) - # dev_ds = load_dataset('squad', splits='dev_v2', data_files=None) - dev_ds.map(map_fn, batched=True) - dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) - batchify_fn = lambda samples, fn=Dict( - { - "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), - "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), - } - ): fn(samples) - - dev_dataloader = paddle.io.DataLoader( - dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True - ) - - return model, tokenizer, dev_dataloader, dev_ds - - -@paddle.no_grad() -def evaluate(model, data_loader, args): - model.eval() - - all_start_logits = [] - all_end_logits = [] - tic_eval = time.time() - - for batch in data_loader: - input_ids, token_type_ids = batch - loss, start_logits_tensor, end_logits_tensor, cls_logits = model(input_ids, token_type_ids) - for idx in range(start_logits_tensor.shape[0]): - if len(all_start_logits) % 1000 == 0 and len(all_start_logits): - print("Processing example: %d" % len(all_start_logits)) - print("time per 1000:", time.time() - tic_eval) - tic_eval = time.time() - - all_start_logits.append(start_logits_tensor.numpy()[idx]) - all_end_logits.append(end_logits_tensor.numpy()[idx]) - - all_predictions, all_nbest_json, scores_diff_json, all_feature_index = compute_prediction( - data_loader.dataset.data, - data_loader.dataset.new_data, - (all_start_logits, all_end_logits), - True, - 20, - args.max_seq_len, - 0.0, - ) - - # Can also write all_nbest_json and scores_diff_json files if needed - with open(os.path.join(args.output_dir, "dev"), "w") as f: - for id in all_predictions: - temp = {} - temp["id"] = int(id) - temp["pred_label"] = all_predictions[id] - temp["pred_feature"] = all_feature_index[id] - f.write(json.dumps(temp, ensure_ascii=False) + "\n") - - -if __name__ == "__main__": - args = get_args() - if args.base_model.startswith("roberta"): - model, tokenizer, dataloader, dev_ds = init_roberta_var(args) - else: - raise ValueError("unsupported base model name.") - - with paddle.amp.auto_cast(enable=args.use_amp): - - sd = paddle.load(args.init_checkpoint) - model.set_dict(sd) - print("load model from %s" % args.init_checkpoint) - - evaluate(model, dataloader, args) diff --git a/examples/model_interpretation/rationale_extraction/newp_text_generate.py b/examples/model_interpretation/rationale_extraction/newp_text_generate.py deleted file mode 100644 index 28e8e98157d8..000000000000 --- a/examples/model_interpretation/rationale_extraction/newp_text_generate.py +++ /dev/null @@ -1,269 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import json -import math -import os - - -def get_args(): - parser = argparse.ArgumentParser("generate data") - - parser.add_argument("--pred_path", required=True) - parser.add_argument("--save_path", required=True) - parser.add_argument("--language", required=True) - parser.add_argument("--task", required=True) - parser.add_argument("--ratio", type=str, required=True) - - args = parser.parse_args() - return args - - -def evids_load(path): - evids = [] - with open(path, "r") as f: - for line in f.readlines(): - dic = json.loads(line) - evids.append(dic) - return evids - - -def generate_for_senti(args, evid_dict, ratio): - r = {} - ex_r = {} - - label = evid_dict["pred_label"] - char_attri = list(evid_dict["char_attri"].keys()) - length = len(char_attri) - - rationale_ratio = ratio[0] - toprationale_text, toprationale_exclusive_text = [], [] - - keys = [int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]] - keys.sort() - for key in keys: - toprationale_text.append(evid_dict["char_attri"][str(key)][0].strip()) - - keys = [int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]] - keys.sort() - for key in keys: - toprationale_exclusive_text.append(evid_dict["char_attri"][str(key)][0].strip()) - - if args.language == "en": - toprationale_text = " ".join(toprationale_text) - toprationale_exclusive_text = " ".join(toprationale_exclusive_text) - else: - toprationale_text = "".join(toprationale_text) - toprationale_exclusive_text = "".join(toprationale_exclusive_text) - - if len(toprationale_text) == 0: - toprationale_text = "['UNK']" - if len(toprationale_exclusive_text) == 0: - toprationale_exclusive_text = "['UNK']" - - r["id"] = evid_dict["id"] - r["context"] = toprationale_text - r["context_idx"] = [[int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]]] - r["context_token"] = [[evid_dict["char_attri"][x][0] for x in char_attri[: math.ceil(length * rationale_ratio)]]] - r["label"] = label - ex_r["id"] = evid_dict["id"] - ex_r["context"] = toprationale_exclusive_text - ex_r["context_idx"] = [[int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]]] - ex_r["context_token"] = [ - [evid_dict["char_attri"][x][0] for x in char_attri[math.ceil(length * rationale_ratio) :]] - ] - ex_r["label"] = label - return r, ex_r - - -def generate_for_similarity(args, evid_dict, ratio): - r = {} - ex_r = {} - q_rationale_ratio = ratio[0] - t_rationale_ratio = ratio[1] - - label = evid_dict["pred_label"] - # query - q_char_attri = list(evid_dict["query_char_attri"].keys()) - q_length = len(q_char_attri) - - q_topR_Rtext, q_topR_noRtext = [], [] - keys = [int(x) for x in q_char_attri[: math.ceil(q_length * q_rationale_ratio)]] - keys.sort() - for key in keys: - q_topR_Rtext.append(evid_dict["query_char_attri"][str(key)][0].strip()) - - keys = [int(x) for x in q_char_attri[math.ceil(q_length * q_rationale_ratio) :]] - keys.sort() - for key in keys: - q_topR_noRtext.append(evid_dict["query_char_attri"][str(key)][0].strip()) - - if args.language == "ch": - q_topR_Rtext = "".join(q_topR_Rtext) - q_topR_noRtext = "".join(q_topR_noRtext) - else: - q_topR_Rtext = " ".join(q_topR_Rtext) - q_topR_noRtext = " ".join(q_topR_noRtext) - - if len(q_topR_Rtext) == 0: - q_topR_Rtext = "['UNK']" - if len(q_topR_noRtext) == 0: - q_topR_noRtext = "['UNK']" - - # title - t_char_attri = list(evid_dict["title_char_attri"].keys()) - t_length = len(t_char_attri) - - t_topR_Rtext, t_topR_noRtext = [], [] - keys = [int(x) for x in t_char_attri[: math.ceil(t_length * t_rationale_ratio)]] - keys.sort() - for key in keys: - t_topR_Rtext.append(evid_dict["title_char_attri"][str(key)][0]) - - keys = [int(x) for x in t_char_attri[math.ceil(t_length * t_rationale_ratio) :]] - keys.sort() - for key in keys: - t_topR_noRtext.append(evid_dict["title_char_attri"][str(key)][0]) - - if args.language == "ch": - t_topR_Rtext = "".join(t_topR_Rtext) - t_topR_noRtext = "".join(t_topR_noRtext) - else: - t_topR_Rtext = " ".join(t_topR_Rtext) - t_topR_noRtext = " ".join(t_topR_noRtext) - - if len(t_topR_Rtext) == 0: - t_topR_Rtext = "['UNK']" - if len(t_topR_noRtext) == 0: - t_topR_noRtext = "['UNK']" - - r["id"] = evid_dict["id"] - r["context"] = [q_topR_Rtext, t_topR_Rtext] - r["context_idx"] = [ - [int(x) for x in q_char_attri[: math.ceil(q_length * q_rationale_ratio)]], - [int(x) for x in t_char_attri[: math.ceil(t_length * t_rationale_ratio)]], - ] - r["context_token"] = [ - [evid_dict["query_char_attri"][x][0] for x in q_char_attri[: math.ceil(q_length * q_rationale_ratio)]], - [evid_dict["title_char_attri"][x][0] for x in t_char_attri[: math.ceil(t_length * t_rationale_ratio)]], - ] - r["label"] = label - ex_r["id"] = evid_dict["id"] - ex_r["context"] = [q_topR_noRtext, t_topR_noRtext] - ex_r["context_idx"] = [ - [int(x) for x in q_char_attri[math.ceil(q_length * q_rationale_ratio) :]], - [int(x) for x in t_char_attri[math.ceil(t_length * t_rationale_ratio) :]], - ] - ex_r["context_token"] = [ - [evid_dict["query_char_attri"][x][0] for x in q_char_attri[math.ceil(q_length * q_rationale_ratio) :]], - [evid_dict["title_char_attri"][x][0] for x in t_char_attri[math.ceil(t_length * t_rationale_ratio) :]], - ] - ex_r["label"] = label - return r, ex_r - - -def generate_for_MRC(args, evid_dict, ratio): - id = evid_dict["id"] - question = evid_dict["question"] - char_attri = list(evid_dict["char_attri"].keys()) - length = len(char_attri) - - rationale_ratio = ratio[0] - toprationale_text, toprationale_exclusive_text = [], [] - keys = [int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]] - keys.sort() - for key in keys: - toprationale_text.append(evid_dict["char_attri"][str(key)][0].strip()) - - keys = [int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]] - keys.sort() - for key in keys: - toprationale_exclusive_text.append(evid_dict["char_attri"][str(key)][0].strip()) - - if args.language == "en": - toprationale_text = " ".join(toprationale_text) - toprationale_exclusive_text = " ".join(toprationale_exclusive_text) - else: - toprationale_text = "".join(toprationale_text) - toprationale_exclusive_text = "".join(toprationale_exclusive_text) - - if len(toprationale_text) == 0: - toprationale_text = "['UNK']" - if len(toprationale_exclusive_text) == 0: - toprationale_exclusive_text = "['UNK']" - - data_R_dict, Rdata_noR_dict = {}, {} - - data_R_dict["id"] = id - data_R_dict["title"] = "" - data_R_dict["context"] = toprationale_text - data_R_dict["question"] = question - data_R_dict["answers"] = [""] - data_R_dict["answer_starts"] = [-1] - data_R_dict["is_impossible"] = False - data_R_dict["context_idx"] = [[int(x) for x in char_attri[: math.ceil(length * rationale_ratio)]]] - data_R_dict["context_token"] = [ - [evid_dict["char_attri"][x][0] for x in char_attri[: math.ceil(length * rationale_ratio)]] - ] - - Rdata_noR_dict["id"] = id - Rdata_noR_dict["title"] = "" - Rdata_noR_dict["context"] = toprationale_exclusive_text - Rdata_noR_dict["question"] = question - Rdata_noR_dict["answers"] = [""] - Rdata_noR_dict["answer_starts"] = [-1] - Rdata_noR_dict["is_impossible"] = False - Rdata_noR_dict["context_idx"] = [[int(x) for x in char_attri[math.ceil(length * rationale_ratio) :]]] - Rdata_noR_dict["context_token"] = [ - [evid_dict["char_attri"][x][0] for x in char_attri[math.ceil(length * rationale_ratio) :]] - ] - - return data_R_dict, Rdata_noR_dict - - -def r_text_generation(evids, args): - print("num: {}".format(len(evids))) - - f_rationale_path = os.path.join(args.save_path, "rationale_text/dev") - f_rationale_exclusive_path = os.path.join(args.save_path, "rationale_exclusive_text/dev") - - if not os.path.exists(f_rationale_path): - os.makedirs(f_rationale_path) - if not os.path.exists(f_rationale_exclusive_path): - os.makedirs(f_rationale_exclusive_path) - - f_rationale = open(os.path.join(f_rationale_path, "dev"), "w") - f_rationale_exclusive = open(os.path.join(f_rationale_exclusive_path, "dev"), "w") - - rationale_ratio = json.loads(args.ratio) - for id, evid_dict in enumerate(evids): - if args.task == "senti": - data_R_dict, Rdata_noR_dict = generate_for_senti(args, evid_dict, rationale_ratio) - elif args.task == "similarity": - data_R_dict, Rdata_noR_dict = generate_for_similarity(args, evid_dict, rationale_ratio) - elif args.task == "mrc": - data_R_dict, Rdata_noR_dict = generate_for_MRC(args, evid_dict, rationale_ratio) - f_rationale.write(json.dumps(data_R_dict, ensure_ascii=False) + "\n") - f_rationale_exclusive.write(json.dumps(Rdata_noR_dict, ensure_ascii=False) + "\n") - - f_rationale.close() - f_rationale_exclusive.close() - - -if __name__ == "__main__": - args = get_args() - - evids = evids_load(args.pred_path) - r_text_generation(evids, args) diff --git a/examples/model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh b/examples/model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh deleted file mode 100755 index c672ca7a1b60..000000000000 --- a/examples/model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh +++ /dev/null @@ -1,48 +0,0 @@ -### - # This script generates mrc predictions for texts contains rationales only and contains non-rationales only -### -export CUDA_VISIBLE_DEVICES=`python ./available_gpu.py --best 1` -export PYTHONPATH=./:$PYTHONPATH - -BASE_MODEL=$1 -INTER_MODE=$2 -LANGUAGE=$3 -TASK=mrc - -for RATIONAL_TYPE in "rationale_text" "rationale_exclusive_text"; -do - if [[ $LANGUAGE == "ch" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-wwm-ext - CKPT=../task/${TASK}/models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin # 3 epoch - - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-wwm-ext-large - CKPT=../task/${TASK}/models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin # 3 epoch - fi - elif [[ $LANGUAGE == "en" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-base - CKPT=../task/${TASK}/models/roberta_base_squad2_20211113_104225/ckpt.bin - - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-large - CKPT=../task/${TASK}/models/roberta_large_squad2_20211113_111300/ckpt.bin - fi - fi - - OUTPUT=./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev - [ -d $OUTPUT ] || mkdir -p $OUTPUT - set -x - python3 ./mrc_pred.py \ - --input_data ../data/${TASK}_${LANGUAGE} \ - --base_model $BASE_MODEL \ - --data_dir ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev \ - --output_dir $OUTPUT \ - --from_pretrained $FROM_PRETRAIN \ - --batch_size 1 \ - --init_checkpoint $CKPT \ - --n-samples 300 \ - --doc_stride 128 \ - --language $LANGUAGE -done diff --git a/examples/model_interpretation/rationale_extraction/run_2_pred_senti_per.sh b/examples/model_interpretation/rationale_extraction/run_2_pred_senti_per.sh deleted file mode 100755 index 06dfea7790d8..000000000000 --- a/examples/model_interpretation/rationale_extraction/run_2_pred_senti_per.sh +++ /dev/null @@ -1,62 +0,0 @@ -### - # This script generates sentiment predictions for texts contains rationales only and contains non-rationales only -### - -export CUDA_VISIBLE_DEVICES=`python ./available_gpu.py --best 1` -export PYTHONPATH=./:$PYTHONPATH - -BASE_MODEL=$1 -INTER_MODE=$2 -LANGUAGE=$3 -TASK=senti - -FROM_PRETRAIN='test' -VOCAB_PATH='test' -for RATIONAL_TYPE in "rationale_text" "rationale_exclusive_text"; -do - if [[ $LANGUAGE == "en" ]]; then - - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-base - CKPT=../task/${TASK}/pretrained_models/saved_model_en/roberta_base_20220318_185322/model_10000/model_state.pdparams - #CKPT=../../../${TASK}/pretrained_models/saved_model_en/roberta_base_20211206_164443/model_10000/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-large - CKPT=../task/${TASK}/pretrained_models/saved_model_en/roberta_large_20220318_183813/model_4000/model_state.pdparams - #CKPT=../../../${TASK}/pretrained_models/saved_model_en/roberta_large_20211207_174631/model_4000/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - VOCAB_PATH=../task/${TASK}/rnn/vocab.sst2_train - CKPT=../task/${TASK}/rnn/checkpoints_en/final.pdparams - fi - - elif [[ $LANGUAGE == "ch" ]]; then - - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN='roberta-wwm-ext' - CKPT=../task/${TASK}/pretrained_models/saved_model_ch/roberta_base_20220318_155933/model_900/model_state.pdparams - #CKPT=../../../${TASK}/pretrained_models/saved_model_ch/roberta_base_20211206_180737/model_900/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN='roberta-wwm-ext-large' - CKPT=../task/${TASK}/pretrained_models/saved_model_ch/roberta_large_20220318_170123/model_900/model_state.pdparams - #CKPT=../../../${TASK}/pretrained_models/saved_model_ch/roberta_large_20211207_143351/model_900/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - VOCAB_PATH=../task/${TASK}/rnn/vocab.txt - CKPT=../task/${TASK}/rnn/checkpoints_ch/final.pdparams - fi - fi - - OUTPUT=./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev - [ -d $OUTPUT ] || mkdir -p $OUTPUT - set -x - python3 ./sentiment_pred.py \ - --base_model $BASE_MODEL \ - --data_dir ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev \ - --output_dir $OUTPUT \ - --vocab_path $VOCAB_PATH \ - --from_pretrained $FROM_PRETRAIN \ - --batch_size 1 \ - --init_checkpoint $CKPT \ - --inter_mode $INTER_MODE \ - --n-samples 200 \ - --language $LANGUAGE -done \ No newline at end of file diff --git a/examples/model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh b/examples/model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh deleted file mode 100755 index 9f0fecd865b7..000000000000 --- a/examples/model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh +++ /dev/null @@ -1,54 +0,0 @@ -### - # This script generates textual similarity predictions for texts contains rationales only and contains non-rationales only -### -export CUDA_VISIBLE_DEVICES=`python ./available_gpu.py --best 1` -export PYTHONPATH=./:$PYTHONPATH - -BASE_MODEL=$1 -INTER_MODE=$2 -LANGUAGE=$3 -TASK=similarity - -for RATIONAL_TYPE in "rationale_text" "rationale_exclusive_text"; -do - if [[ $LANGUAGE == "en" ]]; then - - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-base - CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_base_20211109_205245/model_54000/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-large - CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_large_20211109_205649/model_46000/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - FROM_PRETRAIN=../task/${TASK}/skep_ernie_1.0_large_ch - CKPT=../task/${TASK}/simnet/checkpoints_${LANGUAGE}/final.pdparams - fi - - elif [[ $LANGUAGE == "ch" ]]; then - - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN='roberta-wwm-ext' - CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_base_20211018_104038/model_11400/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN='roberta-wwm-ext-large' - CKPT=../task/${TASK}/pretrained_models/saved_model_${LANGUAGE}/roberta_large_20211018_152833/model_22000/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - FROM_PRETRAIN='skep_ernie_1.0_large_ch' - CKPT=../task/${TASK}/simnet/checkpoints_${LANGUAGE}/final.pdparams - fi - fi - - OUTPUT=./prediction/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev - [ -d $OUTPUT ] || mkdir -p $OUTPUT - set -x - python3 similarity_pred.py \ - --base_model $BASE_MODEL \ - --data_dir ./rationale/${TASK}/${BASE_MODEL}_${INTER_MODE}_${LANGUAGE}/${RATIONAL_TYPE}/dev \ - --output_dir $OUTPUT \ - --from_pretrained $FROM_PRETRAIN \ - --batch_size 1 \ - --max_seq_len 256 \ - --init_checkpoint $CKPT \ - --inter_mode $INTER_MODE \ - --language $LANGUAGE -done \ No newline at end of file diff --git a/examples/model_interpretation/rationale_extraction/sentiment_pred.py b/examples/model_interpretation/rationale_extraction/sentiment_pred.py deleted file mode 100644 index 4ab1397ed304..000000000000 --- a/examples/model_interpretation/rationale_extraction/sentiment_pred.py +++ /dev/null @@ -1,255 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import json -import os -import sys -from functools import partial -from pathlib import Path - -import paddle -from tqdm import tqdm - -from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab -from paddlenlp.datasets import DatasetBuilder -from paddlenlp.transformers.roberta.tokenizer import ( - RobertaBPETokenizer, - RobertaTokenizer, -) - -sys.path.append("../task/senti") -from rnn.model import BiLSTMAttentionModel, SelfInteractiveAttention # noqa: E402 -from rnn.utils import CharTokenizer, convert_example # noqa: E402 - -sys.path.append("..") -from roberta.modeling import RobertaForSequenceClassification # noqa: E402 - -sys.path.remove("..") -sys.path.remove("../task/senti") -sys.path.append("../..") -from model_interpretation.utils import ( # noqa: E402 - convert_tokenizer_res_to_old_version, -) - -sys.path.remove("../..") - - -def get_args(): - parser = argparse.ArgumentParser("sentiment analysis prediction") - - parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"]) - parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") - parser.add_argument( - "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" - ) - parser.add_argument("--batch_size", type=int, default=1, help="batchsize") - parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") - parser.add_argument("--eval", action="store_true") - - parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") - parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") - parser.add_argument( - "--use_amp", - action="store_true", - help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", - ) - parser.add_argument( - "--inter_mode", - type=str, - default="attention", - choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"], - help="appoint the mode of interpretable.", - ) - parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") - parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") - parser.add_argument("--start_id", type=int, default=0) - parser.add_argument("--vocab_path", type=str) - parser.add_argument("--language", type=str, required=True, help="Language that the model is built for") - args = parser.parse_args() - return args - - -class SentiData(DatasetBuilder): - def _read(self, filename, language): - with open(filename, "r", encoding="utf8") as f: - for line in f.readlines(): - line_split = json.loads(line) - yield {"id": line_split["id"], "context": line_split["context"]} - - -def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None): - """ - Creats dataloader. - - Args: - dataset(obj:`paddle.io.Dataset`): Dataset instance. - trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. - mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. - batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. - batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging - the sample list, None for only stack each fields of sample in axis - 0(same as :attr::`np.stack(..., axis=0)`). - - Returns: - dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. - """ - if trans_fn: - dataset = dataset.map(trans_fn) - - shuffle = True if mode == "train" else False - if mode == "train": - sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) - else: - sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) - dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn) - return dataloader - - -def map_fn_senti(examples, tokenizer, language): - print("load data %d" % len(examples)) - - contexts = [example["context"] for example in examples] - tokenized_examples = tokenizer(contexts, max_seq_len=args.max_seq_len) - tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) - - return tokenized_examples - - -def truncate_offset(seg, start_offset, end_offset): - seg_len = len(seg) - for n in range(len(start_offset) - 1, -1, -1): - if start_offset[n] < seg_len: - end_offset[n] = seg_len - break - start_offset.pop(n) - end_offset.pop(n) - - -def init_lstm_var(args): - vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") - tokenizer = CharTokenizer(vocab, args.language, "../punctuations") - padding_idx = vocab.token_to_idx.get("[PAD]", 0) - - trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=True, language=args.language) - - # init attention layer - lstm_hidden_size = 196 - attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size) - model = BiLSTMAttentionModel( - attention_layer=attention, - vocab_size=len(tokenizer.vocab), - lstm_hidden_size=lstm_hidden_size, - num_classes=2, - padding_idx=padding_idx, - ) - - # Reads data and generates mini-batches. - dev_ds = SentiData().read(os.path.join(args.data_dir, "dev"), args.language) - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=padding_idx), # input_ids - Stack(dtype="int64"), # seq len - ): [data for data in fn(samples)] - - dev_loader = create_dataloader( - dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn - ) - - return model, tokenizer, dev_loader - - -def init_roberta_var(args): - tokenizer = None - if args.language == "ch": - tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) - else: - tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) - model = RobertaForSequenceClassification.from_pretrained( - args.from_pretrained, - hidden_dropout_prob=0, - attention_probs_dropout_prob=0, - dropout=0, - num_labels=2, - name="", - return_inter_score=True, - ) - - map_fn = partial(map_fn_senti, tokenizer=tokenizer, language=args.language) - - dev_ds = SentiData().read(os.path.join(args.data_dir, "dev"), args.language) - dev_ds.map(map_fn, batched=True) - dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) - batchify_fn = lambda samples, fn=Dict( - { - "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), - "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), - } - ): fn(samples) - - dataloader = paddle.io.DataLoader( - dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True - ) - - return model, tokenizer, dataloader - - -if __name__ == "__main__": - args = get_args() - if args.base_model.startswith("roberta"): - model, tokenizer, dataloader = init_roberta_var(args) - - elif args.base_model == "lstm": - model, tokenizer, dataloader = init_lstm_var(args) - else: - raise ValueError("unsupported base model name.") - - with paddle.amp.auto_cast(enable=args.use_amp), open(str(args.output_dir) + "/dev", "w") as out_handle: - # Load model - sd = paddle.load(args.init_checkpoint) - model.set_dict(sd) - model.train() # 为了取梯度,加载模型时dropout设为0 - print("load model from %s" % args.init_checkpoint) - - get_sub_word_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word))) - - for step, d in tqdm(enumerate(dataloader)): - if step + 1 < args.start_id: - continue - - result = {} - if args.base_model.startswith("roberta"): - input_ids, token_type_ids = d - fwd_args = [input_ids, token_type_ids] - fwd_kwargs = {} - - tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:-1].tolist()) # list - - elif args.base_model == "lstm": - input_ids, seq_lens = d - fwd_args = [input_ids, seq_lens] - fwd_kwargs = {} - tokens = [tokenizer.vocab.idx_to_token[input_id] for input_id in input_ids.tolist()[0]] - - result["id"] = dataloader.dataset.data[step]["id"] - - probs, atts, embedded = model.forward_interpet(*fwd_args, **fwd_kwargs) - pred_label = paddle.argmax(probs, axis=-1).tolist()[0] - - result["pred_label"] = pred_label - result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()] - if args.language == "en": - result["context"] = tokenizer.convert_tokens_to_string(tokens) - else: - result["context"] = "".join(tokens) - out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") diff --git a/examples/model_interpretation/rationale_extraction/similarity_pred.py b/examples/model_interpretation/rationale_extraction/similarity_pred.py deleted file mode 100644 index c6771189b1ee..000000000000 --- a/examples/model_interpretation/rationale_extraction/similarity_pred.py +++ /dev/null @@ -1,229 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import json -import os -import sys -from functools import partial -from pathlib import Path - -import paddle -from tqdm import tqdm - -from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab -from paddlenlp.datasets import DatasetBuilder -from paddlenlp.transformers.roberta.tokenizer import ( - RobertaBPETokenizer, - RobertaTokenizer, -) - -sys.path.append("..") -from roberta.modeling import RobertaForSequenceClassification # noqa: E402 - -sys.path.remove("..") -from simnet.model import SimNet # noqa: E402 -from simnet.utils import CharTokenizer, preprocess_data # noqa: E402 - -sys.path.remove("../task/similarity") -sys.path.append("../..") -from model_interpretation.utils import ( # noqa: E402 - convert_tokenizer_res_to_old_version, -) - -sys.path.remove("../..") - - -def get_args(): - parser = argparse.ArgumentParser("textual similarity prediction") - - parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"]) - parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") - parser.add_argument( - "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" - ) - parser.add_argument("--batch_size", type=int, default=1, help="batchsize") - parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") - - parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") - parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") - parser.add_argument( - "--use_amp", - action="store_true", - help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", - ) - parser.add_argument( - "--inter_mode", - type=str, - default="attention", - choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"], - help="appoint the mode of interpretable.", - ) - parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") - parser.add_argument("--language", type=str, required=True) - args = parser.parse_args() - return args - - -class SimilarityData(DatasetBuilder): - def _read(self, filename): - with open(filename, "r", encoding="utf8") as f: - for line in f.readlines(): - line_split = json.loads(line) - if args.language == "ch": - yield { - "id": line_split["id"], - "query": line_split["context"][0], - "title": line_split["context"][1], - } - else: - yield { - "id": line_split["id"], - "sentence1": line_split["context"][0], - "sentence2": line_split["context"][1], - } - - -def map_fn_senti(examples, tokenizer): - print("load data %d" % len(examples)) - if args.language == "ch": - query = "query" - title = "title" - else: - query = "sentence1" - title = "sentence2" - queries = [example[query] for example in examples] - titles = [example[title] for example in examples] - tokenized_examples = tokenizer(queries, titles, max_seq_len=args.max_seq_len) - tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) - - return tokenized_examples - - -def init_roberta_var(args): - if args.language == "ch": - tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) - else: - tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) - model = RobertaForSequenceClassification.from_pretrained( - args.from_pretrained, - hidden_dropout_prob=0, - attention_probs_dropout_prob=0, - dropout=0, - num_labels=2, - name="", - return_inter_score=True, - ) - - map_fn = partial(map_fn_senti, tokenizer=tokenizer) - - dev_ds = SimilarityData().read(os.path.join(args.data_dir, "dev")) - dev_ds.map(map_fn, batched=True) - dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) - batchify_fn = lambda samples, fn=Dict( - { - "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), - "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), - } - ): fn(samples) - - dataloader = paddle.io.DataLoader( - dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True - ) - - return model, tokenizer, dataloader, dev_ds - - -def init_lstm_var(args): - if args.language == "ch": - vocab = Vocab.load_vocabulary("../task/similarity/simnet/vocab.char", unk_token="[UNK]", pad_token="[PAD]") - else: - vocab = Vocab.load_vocabulary("../task/similarity/simnet/vocab_QQP", unk_token="[UNK]", pad_token="[PAD]") - - tokenizer = CharTokenizer(vocab, args.language, "../punctuations") - model = SimNet(network="lstm", vocab_size=len(vocab), num_classes=2) - - dev_ds = SimilarityData().read(os.path.join(args.data_dir, "dev")) - dev_examples = preprocess_data(dev_ds.data, tokenizer, language=args.language) - batches = [dev_examples[idx : idx + args.batch_size] for idx in range(0, len(dev_examples), args.batch_size)] - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # query_ids - Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # title_ids - Stack(dtype="int64"), # query_seq_lens - Stack(dtype="int64"), # title_seq_lens - ): [data for data in fn(samples)] - - return model, tokenizer, batches, batchify_fn, vocab, dev_ds - - -if __name__ == "__main__": - args = get_args() - if args.base_model.startswith("roberta"): - model, tokenizer, dataloader, dev_ds = init_roberta_var(args) - - elif args.base_model == "lstm": - model, tokenizer, dataloader, batchify_fn, vocab, dev_ds = init_lstm_var(args) - else: - raise ValueError("unsupported base model name.") - - with paddle.amp.auto_cast(enable=args.use_amp), open(str(args.output_dir) + "/dev", "w") as out_handle: - # Load model - sd = paddle.load(args.init_checkpoint) - model.set_dict(sd) - model.train() # 为了取梯度,加载模型时dropout设为0 - print("load model from %s" % args.init_checkpoint) - - for step, d in tqdm(enumerate(dataloader)): - - result = {} - if args.base_model.startswith("roberta"): - input_ids, token_type_ids = d - fwd_args = [input_ids, token_type_ids] - fwd_kwargs = {} - - SEP_idx = input_ids.tolist()[0].index(tokenizer.sep_token_id) - q_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:SEP_idx].tolist()) # list - if args.language == "ch": - t_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, SEP_idx + 1 : -1].tolist()) # list - else: - t_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, SEP_idx + 2 : -1].tolist()) # list - - elif args.base_model == "lstm": - query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(d) - query_ids = paddle.to_tensor(query_ids) - title_ids = paddle.to_tensor(title_ids) - query_seq_lens = paddle.to_tensor(query_seq_lens) - title_seq_lens = paddle.to_tensor(title_seq_lens) - - fwd_args = [query_ids, title_ids, query_seq_lens, title_seq_lens] - fwd_kwargs = {} - q_tokens = [vocab._idx_to_token[idx] for idx in query_ids.tolist()[0]] - t_tokens = [vocab._idx_to_token[idx] for idx in title_ids.tolist()[0]] - - result["id"] = dev_ds.data[step]["id"] - - probs, atts, embedded = model.forward_interpret(*fwd_args, **fwd_kwargs) - pred_label = paddle.argmax(probs, axis=-1).tolist()[0] - - result["pred_label"] = pred_label - result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()] - if args.language == "ch": - result["query"] = "".join(q_tokens) - result["title"] = "".join(t_tokens) - else: - result["query"] = tokenizer.convert_tokens_to_string(q_tokens) - result["title"] = tokenizer.convert_tokens_to_string(t_tokens) - - out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") diff --git a/examples/model_interpretation/requirements.txt b/examples/model_interpretation/requirements.txt deleted file mode 100644 index 6a6e0abed457..000000000000 --- a/examples/model_interpretation/requirements.txt +++ /dev/null @@ -1,5 +0,0 @@ -nvgpu>=0.9.0 -regex>=2021.11.10 -spacy>=2.3.7 -tqdm>=4.62.3 -visualdl>=2.2.2 diff --git a/examples/model_interpretation/task/README.md b/examples/model_interpretation/task/README.md deleted file mode 100644 index 03f1edca0dc8..000000000000 --- a/examples/model_interpretation/task/README.md +++ /dev/null @@ -1,19 +0,0 @@ -### 基线模型预测 -#### 情感分析: - 预测:model_interpretation/rationale_extraction/sentiment_pred.py - 参数设置参考:model_interpretation/rationale_extraction/run_2_pred_senti_per.sh (参数涉及模型、文件等路径,以及语言的,请根据实际情况进行修改) -#### 文本相似度: - 预测:model_interpretation/rationale_extraction/similarity_pred.py - 参数设置参考:model_interpretation/rationale_extraction/run_2_pred_similarity_per.sh(参数涉及模型、文件等路径,以及语言的,请根据实际情况进行修改) -#### 阅读理解: - 预测:model_interpretation/rationale_extraction/mrc_pred.py - 参数设置参考:model_interpretation/rationale_extraction/run_2_pred_mrc_per.sh(参数涉及模型、文件等路径,以及语言的,请根据实际情况进行修改) -### 三个任务的基线模型训练 -#### 情感分析 - RoBERTa:model_interpretation/task/senti/pretrained_models/run_train.sh - LSTM:model_interpretation/task/senti/rnn/lstm_train.sh -#### 文本相似度 - RoBERTa:model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh - LSTM:model_interpretation/task/similarity/simnet/lstm_train.sh -#### 阅读理解 - RoBERTa:model_interpretation/task/mrc/run_train_rc.sh diff --git a/examples/model_interpretation/task/mrc/roberta/modeling.py b/examples/model_interpretation/task/mrc/roberta/modeling.py deleted file mode 100644 index 4b376e43dabc..000000000000 --- a/examples/model_interpretation/task/mrc/roberta/modeling.py +++ /dev/null @@ -1,719 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import sys - -import paddle -import paddle.nn as nn - -from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model - -sys.path.append("../..") -from task.transformer import TransformerEncoder, TransformerEncoderLayer # noqa: E402 - -sys.path.remove("../..") - -__all__ = [ - "RobertaModel", - "RobertaPretrainedModel", - "RobertaForSequenceClassification", - "RobertaForTokenClassification", - "RobertaForQuestionAnswering", -] - - -class RobertaEmbeddings(nn.Layer): - r""" - Include embeddings from word, position and token_type embeddings. - """ - - def __init__( - self, - vocab_size, - hidden_size=768, - hidden_dropout_prob=0.1, - max_position_embeddings=512, - type_vocab_size=16, - pad_token_id=0, - ): - super(RobertaEmbeddings, self).__init__() - self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id) - self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size) - self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size) - self.layer_norm = nn.LayerNorm(hidden_size) - self.dropout = nn.Dropout(hidden_dropout_prob) - - def forward(self, input_ids, token_type_ids=None, position_ids=None): - if position_ids is None: - # maybe need use shape op to unify static graph and dynamic graph - ones = paddle.ones_like(input_ids, dtype="int64") - seq_length = paddle.cumsum(ones, axis=-1) - position_ids = seq_length - ones - position_ids.stop_gradient = True - if token_type_ids is None: - token_type_ids = paddle.zeros_like(input_ids, dtype="int64") - - input_embedings = self.word_embeddings(input_ids) - position_embeddings = self.position_embeddings(position_ids) - token_type_embeddings = self.token_type_embeddings(token_type_ids) - - embeddings = input_embedings + position_embeddings + token_type_embeddings - embeddings = self.layer_norm(embeddings) - embeddings = self.dropout(embeddings) - return embeddings - - -class RobertaPooler(nn.Layer): - def __init__(self, hidden_size): - super(RobertaPooler, self).__init__() - self.dense = nn.Linear(hidden_size, hidden_size) - self.activation = nn.Tanh() - - def forward(self, hidden_states): - # We "pool" the model by simply taking the hidden state corresponding - # to the first token. - first_token_tensor = hidden_states[:, 0] - pooled_output = self.dense(first_token_tensor) - pooled_output = self.activation(pooled_output) - return pooled_output - - -class RobertaPretrainedModel(PretrainedModel): - r""" - An abstract class for pretrained RoBerta models. It provides RoBerta related - `model_config_file`, `pretrained_resource_files_map`, `resource_files_names`, - `pretrained_init_configuration`, `base_model_prefix` for downloading and - loading pretrained models. - Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details. - - """ - - model_config_file = "model_config.json" - pretrained_init_configuration = { - "roberta-wwm-ext": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 768, - "initializer_range": 0.02, - "intermediate_size": 3072, - "max_position_embeddings": 512, - "num_attention_heads": 12, - "num_hidden_layers": 12, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - "roberta-wwm-ext-large": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 1024, - "initializer_range": 0.02, - "intermediate_size": 4096, - "max_position_embeddings": 512, - "num_attention_heads": 16, - "num_hidden_layers": 24, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - "rbt3": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 768, - "initializer_range": 0.02, - "intermediate_size": 3072, - "max_position_embeddings": 512, - "num_attention_heads": 12, - "num_hidden_layers": 3, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - "rbtl3": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 1024, - "initializer_range": 0.02, - "intermediate_size": 4096, - "max_position_embeddings": 512, - "num_attention_heads": 16, - "num_hidden_layers": 3, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - } - resource_files_names = {"model_state": "model_state.pdparams"} - pretrained_resource_files_map = { - "model_state": { - "roberta-wwm-ext": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_base/roberta_chn_base.pdparams", - "roberta-wwm-ext-large": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams", - "rbt3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbt3/rbt3_chn_large.pdparams", - "rbtl3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbtl3/rbtl3_chn_large.pdparams", - } - } - base_model_prefix = "roberta" - - def _init_weights(self, layer): - """Initialization hook""" - if isinstance(layer, (nn.Linear, nn.Embedding)): - # only support dygraph, use truncated_normal and make it inplace - # and configurable later - layer.weight.set_value( - paddle.tensor.normal( - mean=0.0, - std=self.initializer_range - if hasattr(self, "initializer_range") - else self.roberta.config["initializer_range"], - shape=layer.weight.shape, - ) - ) - elif isinstance(layer, nn.LayerNorm): - layer._epsilon = 1e-12 - - -@register_base_model -class RobertaModel(RobertaPretrainedModel): - r""" - The bare Roberta Model outputting raw hidden-states. - - This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`. - Refer to the superclass documentation for the generic methods. - - This model is also a Paddle `paddle.nn.Layer `__ subclass. Use it as a regular Paddle Layer - and refer to the Paddle documentation for all matter related to general usage and behavior. - - Args: - vocab_size (int): - Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size of token embedding matrix. - Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RobertaModel`. - hidden_size (int, optional): - Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`. - num_hidden_layers (int, optional): - Number of hidden layers in the Transformer encoder. Defaults to `12`. - num_attention_heads (int, optional): - Number of attention heads for each attention layer in the Transformer encoder. - Defaults to `12`. - intermediate_size (int, optional): - Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors - to ff layers are firstly projected from `hidden_size` to `intermediate_size`, - and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`. - Defaults to `3072`. - hidden_act (str, optional): - The non-linear activation function in the feed-forward layer. - ``"gelu"``, ``"relu"`` and any other paddle supported activation functions - are supported. Defaults to ``"gelu"``. - hidden_dropout_prob (float, optional): - The dropout probability for all fully connected layers in the embeddings and encoder. - Defaults to `0.1`. - attention_probs_dropout_prob (float, optional): - The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target. - Defaults to `0.1`. - max_position_embeddings (int, optional): - The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input - sequence. Defaults to `512`. - type_vocab_size (int, optional): - The vocabulary size of the `token_type_ids` passed when calling `~transformers.RobertaModel`. - Defaults to `2`. - initializer_range (float, optional): - The standard deviation of the normal initializer. Defaults to 0.02. - - .. note:: - A normal_initializer initializes weight matrices as normal distributions. - See :meth:`RobertaPretrainedModel._init_weights()` for how weights are initialized in `RobertaModel`. - - pad_token_id(int, optional): - The index of padding token in the token vocabulary. - Defaults to `0`. - """ - - def __init__( - self, - vocab_size, - hidden_size=768, - num_hidden_layers=12, - num_attention_heads=12, - intermediate_size=3072, - hidden_act="gelu", - hidden_dropout_prob=0.1, - attention_probs_dropout_prob=0.1, - max_position_embeddings=512, - type_vocab_size=16, - initializer_range=0.01, - layer_norm_eps=1e-12, - pad_token_id=0, - ): - super(RobertaModel, self).__init__() - self.pad_token_id = pad_token_id - self.initializer_range = initializer_range - self.embeddings = RobertaEmbeddings( - vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings, type_vocab_size, pad_token_id - ) - encoder_layer = TransformerEncoderLayer( - hidden_size, - num_attention_heads, - intermediate_size, - dropout=hidden_dropout_prob, - activation=hidden_act, - attn_dropout=attention_probs_dropout_prob, - act_dropout=0, - ) - self.encoder = TransformerEncoder(encoder_layer, num_hidden_layers) - self.pooler = RobertaPooler(hidden_size) - - def forward( - self, - input_ids, - token_type_ids=None, - position_ids=None, - attention_mask=None, - noise=None, - i=None, - n_samples=None, - ): - r""" - Args: - input_ids (Tensor): - Indices of input sequence tokens in the vocabulary. They are - numerical representations of tokens that build the input sequence. - It's data type should be `int64` and has a shape of [batch_size, sequence_length]. - token_type_ids (Tensor, optional): - Segment token indices to indicate first and second portions of the inputs. - Indices can be either 0 or 1: - - - 0 corresponds to a **sentence A** token, - - 1 corresponds to a **sentence B** token. - - It's data type should be `int64` and has a shape of [batch_size, sequence_length]. - Defaults to None, which means no segment embeddings is added to token embeddings. - position_ids (Tensor, optional): - Indices of positions of each input sequence tokens in the position embeddings. - Selected in the range ``[0, max_position_embeddings - 1]``. - It's data type should be `int64` and has a shape of [batch_size, sequence_length]. - Defaults to `None`. - attention_mask (Tensor, optional): - Mask used in multi-head attention to avoid performing attention to some unwanted positions, - usually the paddings or the subsequent positions. - Its data type can be int, float and bool. - When the data type is bool, the `masked` tokens have `False` values and the others have `True` values. - When the data type is int, the `masked` tokens have `0` values and the others have `1` values. - When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values. - It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`. - For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], - [batch_size, num_attention_heads, sequence_length, sequence_length]. - Defaults to `None`, which means nothing needed to be prevented attention to. - - Returns: - tuple: Returns tuple (`sequence_output`, `pooled_output`). - - With the fields: - - - sequence_output (Tensor): - Sequence of hidden-states at the last layer of the model. - It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size]. - - - pooled_output (Tensor): - The output of first token (`[CLS]`) in sequence. - We "pool" the model by simply taking the hidden state corresponding to the first token. - Its data type should be float32 and its shape is [batch_size, hidden_size]. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaModel, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaModel.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - sequence_output, pooled_output = model(**inputs) - - """ - if attention_mask is None: - attention_mask = paddle.unsqueeze( - (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2] - ) - # CLS: 101; SEP: 102; PAD: 0 - baseline_ids = paddle.to_tensor( - [101] + [0] * (input_ids.shape[1] - 2) + [102], - dtype=input_ids.dtype, - place=input_ids.place, - stop_gradient=input_ids.stop_gradient, - ) - - embedding_output = self.embeddings( - input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids - ) - baseline_embedding_output = self.embeddings( - input_ids=baseline_ids, position_ids=position_ids, token_type_ids=token_type_ids - ) - - if noise is not None: - if noise.upper() == "GAUSSIAN": - pass - # stdev_spread = 0.15 - # stdev = stdev_spread * (orig_embedded.max() - orig_embedded.min()).numpy() - # noise = paddle.to_tensor(np.random.normal(0, stdev, orig_embedded.shape).astype(np.float32), - # stop_gradient=False) - # orig_embedded = orig_embedded + noise - if noise.upper() == "INTEGRATED": - embedding_output = baseline_embedding_output + i / (n_samples - 1) * ( - embedding_output - baseline_embedding_output - ) - else: - raise ValueError("unsupported noise method: %s" % (noise)) - - # encoder_outputs = self.encoder(embedding_output, attention_mask) - encoder_outputs, att_weights_list = self.encoder(embedding_output, attention_mask) # interpret - sequence_output = encoder_outputs - pooled_output = self.pooler(sequence_output) - return sequence_output, pooled_output, att_weights_list, embedding_output - - -class RobertaForQuestionAnswering(RobertaPretrainedModel): - r""" - Roberta Model with a linear layer on top of the hidden-states output to - compute `span_start_logits` and `span_end_logits`, designed for question-answering tasks like SQuAD. - - Args: - roberta (:class:`RobertaModel`): - An instance of RobertaModel. - dropout (float, optional): - The dropout probability for output of Roberta. - If None, use the same value as `hidden_dropout_prob` of `RobertaModel` - instance `roberta`. Defaults to `None`. - """ - - def __init__(self, roberta, dropout=None): - super(RobertaForQuestionAnswering, self).__init__() - self.roberta = roberta # allow roberta to be config - self.classifier = nn.Linear(self.roberta.config["hidden_size"], 2) - self.classifier_cls = nn.Linear(self.roberta.config["hidden_size"], 2) - self.criterion = CrossEntropyLossForChecklist() - - # def forward(self, input_ids, token_type_ids=None): - def forward(self, *args, **kwargs): - r""" - Args: - input_ids (Tensor): - See :class:`RobertaModel`. - token_type_ids (Tensor, optional): - See :class:`RobertaModel`. - position_ids (Tensor, optional): - See :class:`RobertaModel`. - attention_mask (Tensor, optional): - See :class:`RobertaModel`. - - Returns: - tuple: Returns tuple (`start_logits`, `end_logits`). - - With the fields: - - - `start_logits` (Tensor): - A tensor of the input token classification logits, indicates the start position of the labelled span. - Its data type should be float32 and its shape is [batch_size, sequence_length]. - - - `end_logits` (Tensor): - A tensor of the input token classification logits, indicates the end position of the labelled span. - Its data type should be float32 and its shape is [batch_size, sequence_length]. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - logits = model(**inputs) - - """ - start_pos = kwargs.pop("start_pos", None) - end_pos = kwargs.pop("end_pos", None) - cls_label = kwargs.pop("labels", None) - - # sequence_output, pooled_output, _, _ = self.roberta( - # input_ids, - # token_type_ids=token_type_ids, - # position_ids=None, - # attention_mask=None) - # print(kwargs) - sequence_output, pooled_output, _, _ = self.roberta(*args, **kwargs) - - logits = self.classifier(sequence_output) # (bsz, seq, 2) - logits = paddle.transpose(logits, perm=[2, 0, 1]) # (2, bsz, seq) - start_logits, end_logits = paddle.unstack(x=logits, axis=0) - cls_logits = self.classifier_cls(pooled_output) - - if start_pos is not None and end_pos is not None: - if len(start_pos.shape) != 1: - start_pos = start_pos.squeeze() - if len(end_pos.shape) != 1: - end_pos = end_pos.squeeze() - loss = self.criterion((start_logits, end_logits, cls_logits), (start_pos, end_pos, cls_label)) - else: - loss = None - - # return start_logit, end_logits - return loss, start_logits, end_logits, cls_logits - - def forward_interpret(self, *args, **kwargs): - r""" - Args: - input_ids (Tensor): - See :class:`RobertaModel`. - token_type_ids (Tensor, optional): - See :class:`RobertaModel`. - position_ids (Tensor, optional): - See :class:`RobertaModel`. - attention_mask (Tensor, optional): - See :class:`RobertaModel`. - - Returns: - tuple: Returns tuple (`start_logits`, `end_logits`). - - With the fields: - - - `start_logits` (Tensor): - A tensor of the input token classification logits, indicates the start position of the labelled span. - Its data type should be float32 and its shape is [batch_size, sequence_length]. - - - `end_logits` (Tensor): - A tensor of the input token classification logits, indicates the end position of the labelled span. - Its data type should be float32 and its shape is [batch_size, sequence_length]. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - logits = model(**inputs) - - """ - start_pos = kwargs.pop("start_pos", None) - end_pos = kwargs.pop("end_pos", None) - cls_label = kwargs.pop("labels", None) - - # sequence_output, pooled_output, _, _ = self.roberta( - # input_ids, - # token_type_ids=token_type_ids, - # position_ids=None, - # attention_mask=None) - # print(kwargs) - sequence_output, pooled_output, att_weights_list, embedding_output = self.roberta(*args, **kwargs) - - logits = self.classifier(sequence_output) # (bsz, seq, 2) - logits = paddle.transpose(logits, perm=[2, 0, 1]) # (2, bsz, seq) - start_logits, end_logits = paddle.unstack(x=logits, axis=0) - cls_logits = self.classifier_cls(pooled_output) - - if start_pos is not None and end_pos is not None: - if len(start_pos.shape) != 1: - start_pos = start_pos.squeeze() - if len(end_pos.shape) != 1: - end_pos = end_pos.squeeze() - loss = self.criterion((start_logits, end_logits, cls_logits), (start_pos, end_pos, cls_label)) - else: - loss = None - - # return start_logit, end_logits - return loss, start_logits, end_logits, cls_logits, att_weights_list, embedding_output - - -class CrossEntropyLossForChecklist(nn.Layer): - def __init__(self): - super(CrossEntropyLossForChecklist, self).__init__() - - def forward(self, y, label): - start_logits, end_logits, cls_logits = y # [(bsz, seq), (bsz, seq), (bsz, 2)] - start_position, end_position, answerable_label = label # [(bsz), (bsz), (bsz)] - - start_position = paddle.unsqueeze(start_position, axis=-1) - end_position = paddle.unsqueeze(end_position, axis=-1) - answerable_label = paddle.unsqueeze(answerable_label, axis=-1) - - start_loss = nn.functional.cross_entropy(input=start_logits, label=start_position, soft_label=False) - end_loss = nn.functional.cross_entropy(input=end_logits, label=end_position, soft_label=False) - cls_loss = nn.functional.cross_entropy(input=cls_logits, label=answerable_label, soft_label=False) - - mrc_loss = (start_loss + end_loss) / 2 - loss = (mrc_loss + cls_loss) / 2 - return loss - - -class RobertaForSequenceClassification(RobertaPretrainedModel): - r""" - Roberta Model with a linear layer on top of the output layer, - designed for sequence classification/regression tasks like GLUE tasks. - - Args: - roberta (:class:`RobertaModel`): - An instance of `RobertaModel`. - num_classes (int, optional): - The number of classes. Defaults to `2`. - dropout (float, optional): - The dropout probability for output of Roberta. - If None, use the same value as `hidden_dropout_prob` - of `RobertaModel` instance `roberta`. Defaults to `None`. - """ - - def __init__(self, roberta, num_classes=2, dropout=None): - super(RobertaForSequenceClassification, self).__init__() - self.num_classes = num_classes - self.roberta = roberta # allow roberta to be config - self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) - self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) - self.softmax = nn.Softmax() - - def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): - r""" - Args: - input_ids (Tensor): - See :class:`RobertaModel`. - token_type_ids (Tensor, optional): - See :class:`RobertaModel`. - position_ids (Tensor, optional): - See :class:`RobertaModel`. - attention_mask (Tensor, optional): - See :class:`RobertaModel`. - - Returns: - Tensor: Returns tensor `logits`, a tensor of the input text classification logits. - Its data type should be float32 and it has a shape of [batch_size, num_classes]. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - logits = model(**inputs) - - """ - _, pooled_output, _, _ = self.roberta( - input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask - ) - - pooled_output = self.dropout(pooled_output) - logits = self.classifier(pooled_output) - return logits - - def forward_interpet( - self, - input_ids, - token_type_ids=None, - position_ids=None, - attention_mask=None, - noise=None, - i=None, - n_samples=None, - ): - _, pooled_output, att_weights_list, embedding_output = self.roberta( - input_ids, - token_type_ids=token_type_ids, - position_ids=position_ids, - attention_mask=attention_mask, - noise=noise, - i=i, - n_samples=n_samples, - ) - - pooled_output = self.dropout(pooled_output) - logits = self.classifier(pooled_output) - probs = self.softmax(logits) - - return probs, att_weights_list, embedding_output - - -class RobertaForTokenClassification(RobertaPretrainedModel): - r""" - Roberta Model with a linear layer on top of the hidden-states output layer, - designed for token classification tasks like NER tasks. - - Args: - roberta (:class:`RobertaModel`): - An instance of `RobertaModel`. - num_classes (int, optional): - The number of classes. Defaults to `2`. - dropout (float, optional): - The dropout probability for output of Roberta. - If None, use the same value as `hidden_dropout_prob` - of `RobertaModel` instance `roberta`. Defaults to `None`. - """ - - def __init__(self, roberta, num_classes=2, dropout=None): - super(RobertaForTokenClassification, self).__init__() - self.num_classes = num_classes - self.roberta = roberta # allow roberta to be config - self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) - self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) - - def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): - r""" - Args: - input_ids (Tensor): - See :class:`RobertaModel`. - token_type_ids (Tensor, optional): - See :class:`RobertaModel`. - position_ids (Tensor, optional): - See :class:`RobertaModel`. - attention_mask (Tensor, optional): - See :class:`RobertaModel`. - - Returns: - Tensor: Returns tensor `logits`, a tensor of the input token classification logits. - Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaForTokenClassification.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - logits = model(**inputs) - - """ - sequence_output, _ = self.roberta( - input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask - ) - - sequence_output = self.dropout(sequence_output) - logits = self.classifier(sequence_output) - return logits diff --git a/examples/model_interpretation/task/mrc/run_1_predict_rc.sh b/examples/model_interpretation/task/mrc/run_1_predict_rc.sh deleted file mode 100755 index 1039a2e08546..000000000000 --- a/examples/model_interpretation/task/mrc/run_1_predict_rc.sh +++ /dev/null @@ -1,51 +0,0 @@ -### - # This file contains script to run prediction of a specific baseline model and language on given input data - # The result of this script will be used to evaluate the performance of the baseline model -### - -export CUDA_VISIBLE_DEVICES=7 -export PYTHONPATH=./:$PYTHONPATH - -LANGUAGE=ch # LANGUAGE choose in [en, ch] -BASE_MODEL=roberta_base # BASE_MODEL choose in [roberta_base, roberta_large] - -if [[ $LANGUAGE == "ch" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-wwm-ext - CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin # 3epoch - #CKPT=models/roberta_base_ch_20211220_202953/ckpt.bin #new fine_tune - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-wwm-ext-large - # CKPT=models/ernie_large_DuReader-Checklist_20211007_163424/ckpt.bin # 3 epoch F1: 63.465 EM: 52.832 - # CKPT=models/ernie_large_DuReader-Checklist_20211009_115837/ckpt.bin # 4 epoch F1: 63.323 EM: 52.920 - # CKPT=models/ernie_large_DuReader-Checklist_20211009_142730/ckpt.bin # 3 epoch F1: 66.613 EM: 57.168 - CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin - #CKPT=models/roberta_large_ch_20211220_203809/ckpt.bin #new fine_tune - fi -elif [[ $LANGUAGE == "en" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-base - CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin - #CKPT=models/roberta_base_en_20211221_201720/ckpt.bin #new fine_tune - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-large - CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin - #CKPT=models/roberta_large_en_20211223_114421/ckpt.bin #new fine_tune - fi -fi - -OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL} -[ -d $OUTPUT ] || mkdir -p $OUTPUT -set -x -python3 ./saliency_map/rc_prediction.py \ - --base_model $BASE_MODEL \ - --data_dir ../../data/mrc_${LANGUAGE} \ - --from_pretrained $FROM_PRETRAIN \ - --init_checkpoint $CKPT \ - --output_dir $OUTPUT \ - --n-samples 300 \ - --doc_stride 128 \ - --language $LANGUAGE \ - --max_seq_len 384 \ - --batch_size 32 \ - --epoch 2 \ No newline at end of file diff --git a/examples/model_interpretation/task/mrc/run_1_predict_rc_all.sh b/examples/model_interpretation/task/mrc/run_1_predict_rc_all.sh deleted file mode 100755 index b504072d49bd..000000000000 --- a/examples/model_interpretation/task/mrc/run_1_predict_rc_all.sh +++ /dev/null @@ -1,57 +0,0 @@ -### - # This file contains script to run predictions of all baseline models and languages on given input data - # The result of this script will be used to evaluate the performance of the baseline model -### - -export CUDA_VISIBLE_DEVICES=4 -export PYTHONPATH=./:$PYTHONPATH - -for BASE_MODEL in "roberta_base" "roberta_large"; -do - for LANGUAGE in "ch" "en"; - do - if [[ $LANGUAGE == "ch" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-wwm-ext - CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin # 3epoch - #CKPT=models/roberta_base_ch_20211220_202953/ckpt.bin #new fine_tune - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-wwm-ext-large - # CKPT=models/ernie_large_DuReader-Checklist_20211007_163424/ckpt.bin # 3 epoch F1: 63.465 EM: 52.832 - # CKPT=models/ernie_large_DuReader-Checklist_20211009_115837/ckpt.bin # 4 epoch F1: 63.323 EM: 52.920 - # CKPT=models/ernie_large_DuReader-Checklist_20211009_142730/ckpt.bin # 3 epoch F1: 66.613 EM: 57.168 - CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin - #CKPT=models/roberta_large_ch_20211220_203809/ckpt.bin #new fine_tune - fi - elif [[ $LANGUAGE == "en" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-base - CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin - #CKPT=models/roberta_base_en_20211221_201720/ckpt.bin #new fine_tune - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-large - CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin - #CKPT=models/roberta_large_en_20211223_114421/ckpt.bin #new fine_tune - fi - fi - - OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL} - [ -d $OUTPUT ] || mkdir -p $OUTPUT - set -x - - if [[ ! -f ${OUTPUT}/predict_feature_index ]]; then - python3 ./saliency_map/rc_prediction.py \ - --base_model $BASE_MODEL \ - --data_dir ../../data/mrc_${LANGUAGE} \ - --from_pretrained $FROM_PRETRAIN \ - --init_checkpoint $CKPT \ - --output_dir $OUTPUT \ - --n-samples 300 \ - --doc_stride 128 \ - --language $LANGUAGE \ - --max_seq_len 384 \ - --batch_size 32 \ - --epoch 2 - fi - done -done \ No newline at end of file diff --git a/examples/model_interpretation/task/mrc/run_2_inter_rc.sh b/examples/model_interpretation/task/mrc/run_2_inter_rc.sh deleted file mode 100755 index 5f038bfcaa98..000000000000 --- a/examples/model_interpretation/task/mrc/run_2_inter_rc.sh +++ /dev/null @@ -1,53 +0,0 @@ -### - # This file contains script to generate saliency map of a specific baseline model and language on given input data - # The result of this script will be used to evaluate the interpretive performance of the baseline model -### - -export CUDA_VISIBLE_DEVICES=4 -export PYTHONPATH=./:$PYTHONPATH - -TASK=mrc -LANGUAGE=en # LANGUAGE choose in [ch, en] -BASE_MODEL=roberta_base # BASE_MODEL choose in [roberta_base, roberta_large] -INTER_MODE=integrated_gradient # INTER_MODE choice in [attention, integrated_gradient] -START=0 - -if [[ $LANGUAGE == "ch" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-wwm-ext - CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin # 3 epoch - - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-wwm-ext-large - CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin # 3 epoch - fi -elif [[ $LANGUAGE == "en" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-base - CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin - - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-large - CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin - fi -fi - - -OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL} -[ -d $OUTPUT ] || mkdir -p $OUTPUT -set -x -python3 ./saliency_map/rc_interpretable.py \ - --ans_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_ans\ - --ans_idx_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_feature_index\ - --base_model $BASE_MODEL \ - --data_dir ../../data/mrc_${LANGUAGE} \ - --from_pretrained $FROM_PRETRAIN \ - --batch_size 1 \ - --init_checkpoint $CKPT \ - --inter_mode $INTER_MODE\ - --output_dir $OUTPUT \ - --n-samples 300 \ - --doc_stride 128 \ - --start_step $START \ - --language $LANGUAGE \ - --num_classes 2 \ No newline at end of file diff --git a/examples/model_interpretation/task/mrc/run_2_inter_rc_all.sh b/examples/model_interpretation/task/mrc/run_2_inter_rc_all.sh deleted file mode 100755 index 5908512f7ba9..000000000000 --- a/examples/model_interpretation/task/mrc/run_2_inter_rc_all.sh +++ /dev/null @@ -1,61 +0,0 @@ -### - # This file contains script to generate saliency map of all baseline models and languages on given input data - # The result of this script will be used to evaluate the interpretive performance of the baseline model -### - -export CUDA_VISIBLE_DEVICES=6 -export PYTHONPATH=./:$PYTHONPATH - -START=0 -TASK=mrc -for BASE_MODEL in "roberta_base" "roberta_large"; -do - for INTER_MODE in "attention" "integrated_gradient"; - do - for LANGUAGE in "ch" "en"; - do - if [[ $LANGUAGE == "ch" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-wwm-ext - CKPT=models/roberta_base_DuReader-Checklist_20211022_095011/ckpt.bin # 3 epoch - - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-wwm-ext-large - CKPT=models/roberta_large_DuReader-Checklist_20211022_095359/ckpt.bin # 3 epoch - fi - elif [[ $LANGUAGE == "en" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-base - CKPT=models/roberta_base_squad2_20211113_104225/ckpt.bin - - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-large - CKPT=models/roberta_large_squad2_20211113_111300/ckpt.bin - fi - fi - - - OUTPUT=./output/mrc_${LANGUAGE}.${BASE_MODEL} - [ -d $OUTPUT ] || mkdir -p $OUTPUT - set -x - - if [[ ! -f ${OUTPUT}/interpret.${INTER_MODE} ]]; then - python3 ./saliency_map/rc_interpretable.py \ - --ans_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_ans\ - --ans_idx_path ./output/${TASK}_${LANGUAGE}.${BASE_MODEL}/predict_feature_index\ - --base_model $BASE_MODEL \ - --data_dir ../../data/mrc_${LANGUAGE} \ - --from_pretrained $FROM_PRETRAIN \ - --batch_size 1 \ - --init_checkpoint $CKPT \ - --inter_mode $INTER_MODE\ - --output_dir $OUTPUT \ - --n-samples 300 \ - --doc_stride 128 \ - --start_step $START \ - --language $LANGUAGE\ - --num_classes 2 - fi - done - done -done \ No newline at end of file diff --git a/examples/model_interpretation/task/mrc/run_train_rc.sh b/examples/model_interpretation/task/mrc/run_train_rc.sh deleted file mode 100755 index ff7d95db9342..000000000000 --- a/examples/model_interpretation/task/mrc/run_train_rc.sh +++ /dev/null @@ -1,51 +0,0 @@ -### - # This script is used to run fine-tunning of mrc roberta models. -### - -export CUDA_VISIBLE_DEVICES=7 -export PYTHONPATH=.:$PYTHONPATH - -LANGUAGE=ch # LANGUAGE choose in [ch, en] -BASE_MODEL=roberta_base # chooices [roberta_base, roberta_large] - -[ -d "logs" ] || mkdir -p "logs" -set -x - -if [[ $LANGUAGE == "ch" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-wwm-ext - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-wwm-ext-large - fi - EPOCH=3 - BSZ=2 - LR=3e-5 - MAX_SEQLEN=512 - DATA=DuReader-Checklist -elif [[ $LANGUAGE == 'en' ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-base - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-large - fi - EPOCH=2 - BSZ=16 - LR=5e-6 - MAX_SEQLEN=384 - DATA=squad2 -fi - -timestamp=`date +"%Y%m%d_%H%M%S"` -python3 saliency_map/rc_finetune.py \ - --train_data_dir ./data/$DATA/train/train.json \ - --dev_data_dir ./data/$DATA/dev/dev.json \ - --max_steps -1 \ - --from_pretrained $FROM_PRETRAIN \ - --epoch $EPOCH \ - --bsz $BSZ \ - --lr $LR \ - --max_seq_len $MAX_SEQLEN \ - --save_dir models/${BASE_MODEL}_${LANGUAGE}_${timestamp} \ - --language $LANGUAGE \ - --init_checkpoint models/${BASE_MODEL}_${LANGUAGE}_${timestamp}/ckpt.bin >> logs/log_${BASE_MODEL}_$timestamp 2>&1 - \ No newline at end of file diff --git a/examples/model_interpretation/task/mrc/saliency_map/rc_finetune.py b/examples/model_interpretation/task/mrc/saliency_map/rc_finetune.py deleted file mode 100644 index f676df12d781..000000000000 --- a/examples/model_interpretation/task/mrc/saliency_map/rc_finetune.py +++ /dev/null @@ -1,280 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import logging -import os -import re -import sys -import time -from pathlib import Path - -import paddle -from paddle.io import DataLoader -from roberta.modeling import RobertaForQuestionAnswering -from saliency_map.utils import create_if_not_exists, get_warmup_and_linear_decay -from squad import DuReaderChecklist -from visualdl import LogWriter - -from paddlenlp.data import Dict, Pad, Stack -from paddlenlp.transformers.roberta.tokenizer import ( - RobertaBPETokenizer, - RobertaTokenizer, -) - -sys.path.append("../../..") -from model_interpretation.utils import ( # noqa: E402 - convert_tokenizer_res_to_old_version, -) - -sys.path.remove("../../..") - -log = logging.getLogger(__name__) -log.setLevel(logging.DEBUG) -logging.getLogger().setLevel(logging.DEBUG) - - -def get_args(): - parser = argparse.ArgumentParser("mrc task with roberta") - parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") - parser.add_argument( - "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" - ) - parser.add_argument( - "--doc_stride", - type=int, - default=128, - help="When splitting up a long document into chunks, how much stride to take between chunks.", - ) - parser.add_argument("--bsz", type=int, default=32, help="batchsize") - parser.add_argument("--epoch", type=int, default=3, help="epoch") - parser.add_argument("--train_data_dir", type=str, required=True, help="train data file") - parser.add_argument("--dev_data_dir", type=str, required=True, help="develop data file") - parser.add_argument( - "--max_steps", type=int, required=True, help="max_train_steps, set this to EPOCH * NUM_SAMPLES / BATCH_SIZE" - ) - parser.add_argument("--warmup_proportion", type=float, default=0.1) - parser.add_argument("--lr", type=float, default=5e-5, help="learning rate") - parser.add_argument("--save_dir", type=Path, required=True, help="model output directory") - parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") - parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") - parser.add_argument( - "--use_amp", - action="store_true", - help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", - ) - parser.add_argument("--language", type=str, required=True, help="language that the model based on") - args = parser.parse_args() - return args - - -def map_fn_DuCheckList_finetune(examples): - # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results - # in one example possible giving several features when a context is long, each of those features having a - # context that overlaps a bit the context of the previous feature. - questions = [examples[i]["question"] for i in range(len(examples))] - contexts = [examples[i]["context"] + examples[i]["title"] for i in range(len(examples))] - - tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len) - tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) - - for i, tokenized_example in enumerate(tokenized_examples): - - # We will label impossible answers with the index of the CLS token. - input_ids = tokenized_example["input_ids"] # list(seq) - cls_index = input_ids.index(tokenizer.cls_token_id) - - # Grab the sequence corresponding to that example (to know what is the context and what is the question). - sequence_ids = tokenized_example["token_type_ids"] # list(seq) - - # The offset mappings will give us a map from token to character position in the original context. This will - # help us compute the start_positions and end_positions. - offsets = tokenized_example["offset_mapping"] # list(seq) - - # One example can give several spans, this is the index of the example containing this span of text. - sample_index = tokenized_example["overflow_to_sample"] # int - if args.language == "ch": - answers = examples[sample_index]["answers"] # list - answer_starts = examples[sample_index]["answer_starts"] # list - else: - example = examples[sample_index] - example["question_len"] = len(example["question"].split()) - example["context_len"] = len(example["context"].split()) - - answers = example["answers"] # list - answer_starts = example["answer_starts"] # list - - # If no answers are given, set the cls_index as answer. - if len(answer_starts) == 0: - tokenized_examples[i]["start_positions"] = cls_index - tokenized_examples[i]["end_positions"] = cls_index - tokenized_examples[i]["answerable_label"] = 0 - else: - # Start/end character index of the answer in the text. - start_char = answer_starts[0] - end_char = start_char + len(answers[0]) - if args.language == "en": - # Start token index of the current span in the text. - token_start_index = 0 - while not (offsets[token_start_index] == (0, 0) and offsets[token_start_index + 1] == (0, 0)): - token_start_index += 1 - token_start_index += 2 - - # End token index of the current span in the text. - token_end_index = len(input_ids) - 2 - else: - # Start token index of the current span in the text. - token_start_index = 0 - while sequence_ids[token_start_index] != 1: - token_start_index += 1 - - # End token index of the current span in the text. - token_end_index = len(input_ids) - 2 - while sequence_ids[token_end_index] != 1: - token_end_index -= 1 - - # Detect if the answer is out of the span (in which case this feature is labeled with the CLS index). - if not (offsets[token_start_index][0] <= start_char and offsets[token_end_index][1] >= end_char): - tokenized_examples[i]["start_positions"] = cls_index - tokenized_examples[i]["end_positions"] = cls_index - tokenized_examples[i]["answerable_label"] = 0 - else: - # Otherwise move the token_start_index and token_end_index to the two ends of the answer. - # Note: we could go after the last offset if the answer is the last word (edge case). - while token_start_index < len(offsets) and offsets[token_start_index][0] <= start_char: - token_start_index += 1 - tokenized_examples[i]["start_positions"] = token_start_index - 1 - while offsets[token_end_index][1] >= end_char: - token_end_index -= 1 - tokenized_examples[i]["end_positions"] = token_end_index + 1 - tokenized_examples[i]["answerable_label"] = 1 - - return tokenized_examples - - -if __name__ == "__main__": - args = get_args() - - if args.language == "ch": - tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) - else: - tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) - model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained, num_classes=2) - - train_ds = DuReaderChecklist().read(args.train_data_dir) - dev_ds = DuReaderChecklist().read(args.dev_data_dir) - - train_ds.map(map_fn_DuCheckList_finetune, batched=True) - dev_ds.map(map_fn_DuCheckList_finetune, batched=True) - - log.debug("train set: %d" % len(train_ds)) - log.debug("dev set: %d" % len(dev_ds)) - - train_batch_sampler = paddle.io.DistributedBatchSampler(train_ds, batch_size=args.bsz, shuffle=True) - dev_batch_sample = paddle.io.DistributedBatchSampler(dev_ds, batch_size=args.bsz, shuffle=False) - - batchify_fn = lambda samples, fn=Dict( - { - "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), - "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), - "start_positions": Stack(dtype="int64"), - "end_positions": Stack(dtype="int64"), - "answerable_label": Stack(dtype="int64"), - } - ): fn(samples) - - train_data_loader = DataLoader( - dataset=train_ds, batch_sampler=train_batch_sampler, collate_fn=batchify_fn, return_list=True - ) - dev_data_loader = DataLoader( - dataset=dev_ds, batch_sampler=dev_batch_sample, collate_fn=batchify_fn, return_list=True - ) - - max_steps = args.max_steps if args.max_steps > 0 else len(train_data_loader) * args.epoch - lr_scheduler = paddle.optimizer.lr.LambdaDecay( - args.lr, get_warmup_and_linear_decay(max_steps, int(args.warmup_proportion * max_steps)) - ) - - param_name_to_exclue_from_weight_decay = re.compile(r".*layer_norm_scale|.*layer_norm_bias|.*b_0") - - opt = paddle.optimizer.AdamW( - lr_scheduler, - parameters=model.parameters(), - weight_decay=args.wd, - apply_decay_param_fun=lambda n: not param_name_to_exclue_from_weight_decay.match(n), - grad_clip=paddle.nn.ClipGradByGlobalNorm(1.0) if args.language == "ch" else None, - ) - - scaler = paddle.amp.GradScaler(enable=args.use_amp) - - with LogWriter(logdir=str(create_if_not_exists(args.save_dir / "vdl"))) as log_writer: - with paddle.amp.auto_cast(enable=args.use_amp): - max_acc = 0.0 - log.debug("start training...") - for epoch in range(args.epoch): - s_time = time.time() - for step, d in enumerate(train_data_loader, start=1): - # input_ids: paddle.Tensor(bsz, seq) - # token_type_ids: paddle.Tensor(bsz, seq) - # start_positions: paddle.Tensor(bsz) - # end_positions: paddle.Tensor(bsz) - # answerable_label: paddle.Tensor(bsz) - input_ids, token_type_ids, start_positions, end_positions, answerable_label = d - loss, _, _, _ = model( - input_ids=input_ids, - token_type_ids=token_type_ids, - start_pos=start_positions, - end_pos=end_positions, - labels=answerable_label, - ) - loss = scaler.scale(loss) - loss.backward() - scaler.minimize(opt, loss) - opt.clear_grad() - lr_scheduler.step() - - if step % 100 == 0: - _lr = lr_scheduler.get_lr() - time_cost = time.time() - s_time - s_time = time.time() - if args.use_amp: - _l = (loss / scaler._scale).numpy() - msg = "[epoch-%d step-%d] train loss %.5f lr %.3e scaling %.3e" % ( - epoch, - step, - _l, - _lr, - scaler._scale.numpy(), - ) - else: - _l = loss.numpy() - msg = "[epoch-%d step-%d] train loss %.5f lr %.3e time_cost: %.1fs" % ( - epoch, - step, - _l, - _lr, - time_cost, - ) - log.debug(msg) - log_writer.add_scalar("loss", _l, step=step) - log_writer.add_scalar("lr", _lr, step=step) - - if step % 1000 == 0: - if args.save_dir is not None: - paddle.save(model.state_dict(), os.path.join(args.save_dir, "ckpt.bin")) - log.debug("save model!") - - if args.save_dir is not None: - paddle.save(model.state_dict(), os.path.join(args.save_dir, "ckpt.bin")) - log.debug("save model!") diff --git a/examples/model_interpretation/task/mrc/saliency_map/rc_interpretable.py b/examples/model_interpretation/task/mrc/saliency_map/rc_interpretable.py deleted file mode 100644 index 7df2bc45d51f..000000000000 --- a/examples/model_interpretation/task/mrc/saliency_map/rc_interpretable.py +++ /dev/null @@ -1,497 +0,0 @@ -# !/usr/bin/env python3 -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import collections -import json -import logging -import os -import sys -from functools import partial -from pathlib import Path - -import paddle -from roberta.modeling import RobertaForQuestionAnswering -from squad import RCInterpret -from tqdm import tqdm - -from paddlenlp.data import Dict, Pad, Stack -from paddlenlp.transformers.roberta.tokenizer import ( - RobertaBPETokenizer, - RobertaTokenizer, -) - -sys.path.append("../../..") -from model_interpretation.utils import ( # noqa: E402 - convert_tokenizer_res_to_old_version, - match, -) - -sys.path.remove("../../..") - -log = logging.getLogger(__name__) -log.setLevel(logging.DEBUG) -logging.getLogger().setLevel(logging.DEBUG) - - -def get_args(): - parser = argparse.ArgumentParser("mrc task with roberta") - parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large"]) - parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") - parser.add_argument( - "--max_seq_len", type=int, default=512, help="max sentence length, should not greater than 512" - ) - parser.add_argument("--batch_size", type=int, default=32, help="batchsize") - parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") - parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") - parser.add_argument( - "--use_amp", - action="store_true", - help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", - ) - parser.add_argument( - "--inter_mode", - type=str, - default="attention", - choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"], - help="appoint the mode of interpretable.", - ) - parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") - parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") - parser.add_argument( - "--doc_stride", - type=int, - default=128, - help="When splitting up a long document into chunks, how much stride to take between chunks.", - ) - parser.add_argument("--start_step", type=int, default=0, help="start from which instance") - parser.add_argument("--language", type=str, required=True, help="language that the model based on") - parser.add_argument( - "--ans_path", - type=str, - required=True, - help="the path of the file which stores the predicted answer from last step", - ) - parser.add_argument( - "--ans_idx_path", - type=str, - required=True, - help="the path of the file which stores the predicted answer index from last step", - ) - parser.add_argument("--num_classes", type=int, required=True, help="number of class") - args = parser.parse_args() - return args - - -def truncate_offset(seg, start_offset, end_offset): - seg_len = len(seg) - for n in range(len(start_offset) - 1, -1, -1): - if start_offset[n] < seg_len: - end_offset[n] = seg_len - break - start_offset.pop(n) - end_offset.pop(n) - - -def map_fn_DuCheckList(examples, args, tokenizer): - # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results - # in one example possible giving several features when a context is long, each of those features having a - # context that overlaps a bit the context of the previous feature. - if args.language == "en": - questions = [ - examples[i]["question"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples)) - ] - contexts = [ - examples[i]["context"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples)) - ] - else: - questions = [examples[i]["question"] for i in range(len(examples))] - contexts = [examples[i]["context"] for i in range(len(examples))] - tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len) - tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) - - log.debug("\nexample: %d" % len(examples)) - log.debug("feature: %d\n" % len(tokenized_examples)) - - # For validation, there is no need to compute start and end positions - for i, tokenized_example in enumerate(tokenized_examples): - # Grab the sequence corresponding to that example (to know what is the context and what is the question). - # One example can give several spans, this is the index of the example containing this span of text. - sample_index = tokenized_example["overflow_to_sample"] - tokenized_examples[i]["example_id"] = examples[sample_index]["id"] - tokenized_examples[i]["question"] = examples[sample_index]["question"] - tokenized_examples[i]["context"] = examples[sample_index]["context"] - tokenized_examples[i]["sent_token"] = examples[sample_index]["sent_token"] - - return tokenized_examples - - -def init_roberta_var(args): - if args.language == "ch": - tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) - else: - tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) - - model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained, num_classes=args.num_classes) - map_fn = partial(map_fn_DuCheckList, args=args, tokenizer=tokenizer) - dev_ds = RCInterpret().read(args.data_dir) - - dev_ds.map(map_fn, batched=True) - dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) - batchify_fn = lambda samples, fn=Dict( - { - "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), - "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), - "offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id), - "overflow_to_sample": Stack(dtype="int32"), - } - ): fn(samples) - - dev_dataloader = paddle.io.DataLoader( - dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True - ) - - return model, tokenizer, dev_dataloader, dev_ds - - -def ch_per_example( - args, - scores_in_one_example, - prev_context_tokens, - dev_ds, - prev_example_idx, - ans_dic, - ans_idx_dic, - offset, - out_handle, -): - total_score = scores_in_one_example[-1] - assert len(prev_context_tokens) == len(total_score) - token_score_dict = [] - for idx in range(len(total_score)): - token_score_dict.append([idx, offset[idx], total_score[idx]]) - - prev_example = dev_ds.data[prev_example_idx] - char_attribution_dict = match( - prev_example["context"] + prev_example["title"], prev_example["sent_token"], token_score_dict - ) - result["id"] = prev_example["id"] - result["question"] = prev_example["question"] - result["title"] = prev_example["title"] - result["context"] = prev_example["context"] + prev_example["title"] - result["pred_label"] = ans_dic[str(result["id"])] - result["pred_feature"] = ans_idx_dic[str(result["id"])] - - result["char_attri"] = collections.OrderedDict() - for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True): - result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] - - out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") - - -def en_per_example(inter_score, result, ans_dic, ans_idx_dic, offset, out_handle): - sorted_token = [] - for i in range(len(inter_score)): - sorted_token.append([i, offset[i], inter_score[i]]) - char_attribution_dict = match(result["context"], result["sent_token"], sorted_token) - - result["pred_label"] = ans_dic[str(result["id"])] - result["pred_feature"] = ans_idx_dic[str(result["id"])] - result["char_attri"] = collections.OrderedDict() - for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True): - result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] - result.pop("sent_token") - - out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") - - -def load_pred_data(ans_path, ans_idx_path): - f = open(ans_path, "r") - ans_dic = json.loads(f.read()) - f.close() - f = open(ans_idx_path, "r") - ans_idx_dic = json.loads(f.read()) - f.close() - return ans_dic, ans_idx_dic - - -def extract_attention_scores( - args, - model, - result, - fwd_args, - fwd_kwargs, - prev_example_idx, - example_idx, - prev_context_tokens, - scores_in_one_example, - dev_ds, - ans_dic, - ans_idx_dic, - context_tokens, - offset, - prev_offset, - out_handle, -): - with paddle.no_grad(): - # start_logits: (bsz, seq); end_logits: (bsz, seq); cls_logits: (bsz, 2) - # attention: list((bsz, head, seq, seq) * 12); embedded: (bsz, seq, emb) - _, start_logits, end_logits, cls_logits, attentions, embedded = model.forward_interpret( - *fwd_args, **fwd_kwargs - ) - - # Attention score equals to the mean of attention of each token in the question - attentions = attentions[-1][:, :, 1:SEP_idx, :].mean(2).mean(1) # attentions: (bsz, seq_len) - context_score = attentions[0, SEP_idx + add_idx : -1] # context_score: Tensor(context) - context_norm_score = context_score / context_score.sum(-1) - - if args.language == "ch": - if prev_example_idx is None or prev_example_idx == example_idx: - scores_in_one_example.append(context_norm_score.numpy().tolist()) - else: - ch_per_example( - args, - scores_in_one_example, - prev_context_tokens, - dev_ds, - prev_example_idx, - ans_dic, - ans_idx_dic, - prev_offset, - out_handle, - ) - scores_in_one_example = [context_norm_score.numpy().tolist()] - prev_example_idx = example_idx - prev_context_tokens = context_tokens - prev_offset = offset - else: - en_per_example(context_norm_score, result, ans_dic, ans_idx_dic, offset, out_handle) - return prev_example_idx, prev_context_tokens, scores_in_one_example, prev_offset - - -def extract_integrated_gradient_scores( - args, - dev_ds, - model, - result, - fwd_args, - fwd_kwargs, - SEP_idx, - add_idx, - prev_example_idx, - example_idx, - scores_in_one_example, - prev_context_tokens, - ans_dic, - ans_idx_dic, - context_tokens, - offset, - prev_offset, - out_handle, -): - embedded_grads_list = [] # [Tensor(1, seq_len, embed_size)] - with open(os.path.join(args.output_dir, "predict_feature_index"), "r") as f_feature_index: - feature_index_dict = json.load(f_feature_index) - example = dev_ds.data[example_idx] - example_id = example["id"] - start_index, end_index = feature_index_dict[str(example_id)] - - for i in range(args.n_samples): - # embedded_start_grad - # start_logits: (bsz, seq); embedded: (bsz, seq, emb) - _, start_logits, _, _, _, embedded = model.forward_interpret( - *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples - ) - - start_logit = start_logits[:, start_index].sum() - start_logit.backward(retain_graph=False) - embedded_start_grad = embedded.grad - model.clear_gradients() - # embedded_end_grad - # end_logits: (bsz, seq); embedded: (bsz, seq, emb) - _, _, end_logits, _, _, embedded = model.forward_interpret( - *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples - ) - end_logit = end_logits[:, end_index].sum() - end_logit.backward(retain_graph=False) - embedded_end_grad = embedded.grad - model.clear_gradients() - - embedded_grad = (embedded_start_grad + embedded_end_grad) / 2 - embedded_grads_list.append(embedded_grad) - - if i == 0: - baseline_embedded = embedded # Tensor(1, seq_len, embed_size) - elif i == args.n_samples - 1: - pred_embedded = embedded # Tensor(1, seq_len, embed_size) - - embedded_grads_tensor = paddle.to_tensor( - embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True - ) - - trapezoidal_grads = ( - embedded_grads_tensor[1:] + embedded_grads_tensor[:-1] - ) / 2 # Tensor(n_samples-1, 1, seq_len, embed_size) - integral_grads = trapezoidal_grads.sum(0) / trapezoidal_grads.shape[0] # Tensor(1, seq_len, embed_size)xw - - inter_score = (pred_embedded - baseline_embedded) * integral_grads # Tensor(1, seq_len, embed_size) - inter_score = inter_score.sum(-1) # Tensor(1, seq_len) - inter_score.stop_gradient = True - - context_score = inter_score[0, SEP_idx + add_idx : -1] - context_norm_score = context_score / context_score.sum(-1) - if args.language == "ch": - if prev_example_idx is None or prev_example_idx == example_idx: - scores_in_one_example.append(context_norm_score.numpy().tolist()) - else: - ch_per_example( - args, - scores_in_one_example, - prev_context_tokens, - dev_ds, - prev_example_idx, - ans_dic, - ans_idx_dic, - prev_offset, - out_handle, - ) - scores_in_one_example = [context_norm_score.numpy().tolist()] - prev_example_idx = example_idx - prev_context_tokens = context_tokens - prev_offset = offset - else: - en_per_example(context_norm_score, result, ans_dic, ans_idx_dic, offset, out_handle) - return prev_example_idx, prev_context_tokens, scores_in_one_example, prev_offset - - -if __name__ == "__main__": - args = get_args() - if args.language == "ch": - add_idx = 1 - else: - add_idx = 2 - - ans_dic, ans_idx_dic = load_pred_data(args.ans_path, args.ans_idx_path) - if args.base_model.startswith("roberta"): - model, tokenizer, dataloader, dev_ds = init_roberta_var(args) - else: - raise ValueError("unsupported base model name.") - - with paddle.amp.auto_cast(enable=args.use_amp), open( - os.path.join(args.output_dir, "interpret" + f".{args.inter_mode}"), "w" - ) as out_handle: - - sd = paddle.load(args.init_checkpoint) - model.set_dict(sd) - log.debug("load model from %s" % args.init_checkpoint) - - err_total = [] - lime_score_total = [] - lime_relative_err_total = [] - lime_err_total = [] - - # Second forward: evidence extraction - scores_in_one_example = [] - prev_example_idx = None - prev_context_tokens = None - prev_offset = None - - get_subword_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word))) - for step, d in tqdm(enumerate(dataloader)): - if step < args.start_step: - continue - - model.train() - - result = {} - input_ids, segment_ids, offset_map, example_idx = d - fwd_args = [input_ids, segment_ids] - fwd_kwargs = {} - - SEP_idx = input_ids.numpy()[0].tolist().index(tokenizer.sep_token_id) - context_ids = input_ids[0, SEP_idx + add_idx : -1] - offset = offset_map[0, SEP_idx + add_idx : -1] - context_tokens = tokenizer.convert_ids_to_tokens(context_ids.numpy().tolist()) - - if args.language == "en": - example = dev_ds.data[step] - result["id"] = example["id"] - result["question"] = example["question"] - result["title"] = example["title"] - result["context"] = example["context"] + example["title"] - result["sent_token"] = example["sent_token"] - - if args.inter_mode == "attention": - prev_example_idx, prev_context_tokens, scores_in_one_example, prev_offset = extract_attention_scores( - args, - model, - result, - fwd_args, - fwd_kwargs, - prev_example_idx, - example_idx, - prev_context_tokens, - scores_in_one_example, - dev_ds, - ans_dic, - ans_idx_dic, - context_tokens, - offset, - prev_offset, - out_handle, - ) - - elif args.inter_mode == "integrated_gradient": - ( - prev_example_idx, - prev_context_tokens, - scores_in_one_example, - prev_offset, - ) = extract_integrated_gradient_scores( - args, - dev_ds, - model, - result, - fwd_args, - fwd_kwargs, - SEP_idx, - add_idx, - prev_example_idx, - example_idx, - scores_in_one_example, - prev_context_tokens, - ans_dic, - ans_idx_dic, - context_tokens, - offset, - prev_offset, - out_handle, - ) - else: - raise KeyError(f"Unkonwn interpretable mode: {args.inter_mode}") - - # Deal with last example - if args.language == "ch": - - feature = dev_ds.new_data[-1] - input_ids = feature["input_ids"] - SEP_idx = input_ids.index(tokenizer.sep_token_id) - context_ids = input_ids[SEP_idx + 1 : -1] - offset = feature["offset_mapping"][SEP_idx + 1 : -1] - context_tokens = tokenizer.convert_ids_to_tokens(context_ids) - - ch_per_example( - args, scores_in_one_example, context_tokens, dev_ds, -1, ans_dic, ans_idx_dic, offset, out_handle - ) diff --git a/examples/model_interpretation/task/mrc/saliency_map/rc_prediction.py b/examples/model_interpretation/task/mrc/saliency_map/rc_prediction.py deleted file mode 100644 index c1557de19d10..000000000000 --- a/examples/model_interpretation/task/mrc/saliency_map/rc_prediction.py +++ /dev/null @@ -1,195 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import json -import logging -import os -import sys -import time -from functools import partial -from pathlib import Path - -import paddle -from roberta.modeling import RobertaForQuestionAnswering -from squad import RCInterpret, compute_prediction - -from paddlenlp.data import Dict, Pad -from paddlenlp.transformers.roberta.tokenizer import ( - RobertaBPETokenizer, - RobertaTokenizer, -) - -sys.path.append("../../..") -from model_interpretation.utils import ( # noqa: E402 - convert_tokenizer_res_to_old_version, -) - -sys.path.remove("../../..") - -log = logging.getLogger(__name__) -log.setLevel(logging.DEBUG) -logging.getLogger().setLevel(logging.DEBUG) - - -def get_args(): - parser = argparse.ArgumentParser("mrc task with roberta") - parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large"]) - parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") - parser.add_argument( - "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" - ) - parser.add_argument("--batch_size", type=int, default=32, help="batchsize") - parser.add_argument("--epoch", type=int, default=3, help="epoch") - parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") - parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") - parser.add_argument( - "--use_amp", - action="store_true", - help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", - ) - parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") - parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") - parser.add_argument( - "--doc_stride", - type=int, - default=128, - help="When splitting up a long document into chunks, how much stride to take between chunks.", - ) - parser.add_argument("--language", type=str, required=True, help="language that the model based on") - args = parser.parse_args() - return args - - -def map_fn_DuCheckList(examples, args, tokenizer): - # Tokenize our examples with truncation and maybe padding, but keep the overflows using a stride. This results - # in one example possible giving several features when a context is long, each of those features having a - # context that overlaps a bit the context of the previous feature. - # NOTE: Almost the same functionality as HuggingFace's prepare_train_features function. The main difference is - # that HugggingFace uses ArrowTable as basic data structure, while we use list of dictionary instead. - if args.language == "en": - contexts = [ - examples[i]["context"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples)) - ] - questions = [ - examples[i]["question"].encode("ascii", errors="replace").decode("UTF-8") for i in range(len(examples)) - ] - else: - contexts = [examples[i]["context"] for i in range(len(examples))] - questions = [examples[i]["question"] for i in range(len(examples))] - - tokenized_examples = tokenizer(questions, contexts, stride=args.doc_stride, max_seq_len=args.max_seq_len) - tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) - - # For validation, there is no need to compute start and end positions - for i, tokenized_example in enumerate(tokenized_examples): - # Grab the sequence corresponding to that example (to know what is the context and what is the question). - sequence_ids = tokenized_example["token_type_ids"] - - # One example can give several spans, this is the index of the example containing this span of text. - sample_index = tokenized_example["overflow_to_sample"] - tokenized_examples[i]["example_id"] = examples[sample_index]["id"] - - # Set to None the offset_mapping that are not part of the context so it's easy to determine if a token - # position is part of the context or not. - if args.language == "ch": - tokenized_examples[i]["offset_mapping"] = [ - (o if sequence_ids[k] == 1 else None) for k, o in enumerate(tokenized_example["offset_mapping"]) - ] - else: - n = tokenized_example["offset_mapping"].index((0, 0), 1) + 2 # context start position - m = len(tokenized_example["offset_mapping"]) - 1 # context end position + 1 - tokenized_examples[i]["offset_mapping"] = [ - (o if n <= k <= m else None) for k, o in enumerate(tokenized_example["offset_mapping"]) - ] - return tokenized_examples - - -def init_roberta_var(args): - if args.language == "ch": - tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) - else: - tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) - - model = RobertaForQuestionAnswering.from_pretrained(args.from_pretrained) - map_fn = partial(map_fn_DuCheckList, args=args, tokenizer=tokenizer) - dev_ds = RCInterpret().read(args.data_dir) - dev_ds.map(map_fn, batched=True) - dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) - batchify_fn = lambda samples, fn=Dict( - { - "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), - "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), - } - ): fn(samples) - - dev_dataloader = paddle.io.DataLoader( - dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True - ) - - return model, tokenizer, dev_dataloader, dev_ds - - -@paddle.no_grad() -def evaluate(model, data_loader, args): - model.eval() - - all_start_logits = [] - all_end_logits = [] - tic_eval = time.time() - - for batch in data_loader: - input_ids, token_type_ids = batch - loss, start_logits_tensor, end_logits_tensor, cls_logits = model(input_ids, token_type_ids) - for idx in range(start_logits_tensor.shape[0]): - if len(all_start_logits) % 1000 == 0 and len(all_start_logits): - log.debug("Processing example: %d" % len(all_start_logits)) - log.debug("time per 1000:%.1f" % (time.time() - tic_eval)) - tic_eval = time.time() - - all_start_logits.append(start_logits_tensor.numpy()[idx]) - all_end_logits.append(end_logits_tensor.numpy()[idx]) - - all_predictions, all_nbest_json, scores_diff_json, all_feature_index = compute_prediction( - data_loader.dataset.data, - data_loader.dataset.new_data, - (all_start_logits, all_end_logits), - True, - 20, - args.max_seq_len, - 0.0, - ) - - # Can also write all_nbest_json and scores_diff_json files if needed - with open(os.path.join(args.output_dir, "predict_ans"), "w") as f_ans_pred: - f_ans_pred.write(json.dumps(all_predictions, ensure_ascii=False, indent=4) + "\n") - with open(os.path.join(args.output_dir, "predict_feature_index"), "w") as f_feature_index: - f_feature_index.write(json.dumps(all_feature_index, ensure_ascii=False, indent=4) + "\n") - - # squad_evaluate(examples=data_loader.dataset.data, preds=all_predictions, na_probs=scores_diff_json) - # model.train() - - -if __name__ == "__main__": - args = get_args() - if args.base_model.startswith("roberta"): - model, tokenizer, dataloader, dev_ds = init_roberta_var(args) - else: - raise ValueError("unsupported base model name.") - - with paddle.amp.auto_cast(enable=args.use_amp): - sd = paddle.load(args.init_checkpoint) - model.set_dict(sd) - log.debug("load model from %s" % args.init_checkpoint) - evaluate(model, dataloader, args) diff --git a/examples/model_interpretation/task/mrc/saliency_map/squad.py b/examples/model_interpretation/task/mrc/saliency_map/squad.py deleted file mode 100644 index 3ae811de5e5b..000000000000 --- a/examples/model_interpretation/task/mrc/saliency_map/squad.py +++ /dev/null @@ -1,476 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# !/usr/bin/env python3 -import collections -import json - -import numpy as np - -from paddlenlp.datasets import DatasetBuilder - - -class Similarity(DatasetBuilder): - # similarity test 21.10.3 - def _read(self, filename): - with open(filename, "r", encoding="utf8") as f: - for line in f.readlines(): - line_split = line.strip().split("\t") - assert len(line_split) == 3 - yield {"text_a": line_split[0], "text_b": line_split[1], "label": line_split[2]} - - -class RCInterpret(DatasetBuilder): - # interpret 21.9.24 - def _read(self, filename): - with open(filename, "r", encoding="utf8") as f: - for line in f.readlines(): - example_dic = json.loads(line) - id = example_dic["id"] - title = example_dic["title"] - context = example_dic["context"] - question = example_dic["question"] - if "sent_token" in example_dic: - sent_token = example_dic["sent_token"] - yield { - "id": id, - "title": title, - "context": context, - "question": question, - "sent_token": sent_token, - } - else: - yield {"id": id, "title": title, "context": context, "question": question} - - -class DuReaderChecklist(DatasetBuilder): - def _read(self, filename): - with open(filename, "r", encoding="utf8") as f: - input_data = json.load(f)["data"] - - for entry in input_data: - # title = entry.get("title", "").strip() - for paragraph in entry["paragraphs"]: - context = paragraph["context"].strip() - title = paragraph.get("title", "").strip() - for qa in paragraph["qas"]: - qas_id = qa["id"] - question = qa["question"].strip() - answer_starts = [] - answers = [] - is_impossible = False - - if "is_impossible" in qa.keys(): - is_impossible = qa["is_impossible"] - - answer_starts = [answer["answer_start"] for answer in qa.get("answers", [])] - answers = [answer["text"].strip() for answer in qa.get("answers", [])] - - yield { - "id": qas_id, - "title": title, - "context": context, - "question": question, - "answers": answers, - "answer_starts": answer_starts, - "is_impossible": is_impossible, - } - - -def compute_prediction_checklist( - examples, - features, - predictions, - version_2_with_negative: bool = False, - n_best_size: int = 20, - max_answer_length: int = 30, - cls_threshold: float = 0.5, -): - """ - Post-processes the predictions of a question-answering model to convert them to answers that are substrings of the - original contexts. This is the base postprocessing functions for models that only return start and end logits. - - Args: - examples: The non-preprocessed dataset (see the main script for more information). - features: The processed dataset (see the main script for more information). - predictions (:obj:`Tuple[np.ndarray, np.ndarray]`): - The predictions of the model: two arrays containing the start logits and the end logits respectively. Its - first dimension must match the number of elements of :obj:`features`. - version_2_with_negative (:obj:`bool`, `optional`, defaults to :obj:`False`): - Whether or not the underlying dataset contains examples with no answers. - n_best_size (:obj:`int`, `optional`, defaults to 20): - The total number of n-best predictions to generate when looking for an answer. - max_answer_length (:obj:`int`, `optional`, defaults to 30): - The maximum length of an answer that can be generated. This is needed because the start and end predictions - are not conditioned on one another. - null_score_diff_threshold (:obj:`float`, `optional`, defaults to 0): - The threshold used to select the null answer: if the best answer has a score that is less than the score of - the null answer minus this threshold, the null answer is selected for this example (note that the score of - the null answer for an example giving several features is the minimum of the scores for the null answer on - each feature: all features must be aligned on the fact they `want` to predict a null answer). - - Only useful when :obj:`version_2_with_negative` is :obj:`True`. - """ - - assert ( - len(predictions) == 3 - ), "`predictions` should be a tuple with two elements (start_logits, end_logits, cls_logits)." - all_start_logits, all_end_logits, all_cls_logits = predictions - - assert len(predictions[0]) == len(features), "Number of predictions should be equal to number of features." # 样本数 - - # Build a map example to its corresponding features. - features_per_example = collections.defaultdict(list) - for i, feature in enumerate(features): - features_per_example[feature["example_id"]].append( - i - ) # feature: dict(keys: 'input_ids', 'token_type_ids', 'offset_mapping', 'overflow_to_sample', 'example_id') - - # The dictionaries we have to fill. - all_predictions = collections.OrderedDict() - all_feature_index = collections.OrderedDict() - all_nbest_json = collections.OrderedDict() - all_cls_predictions = [] - - # Let's loop over all the examples! - for example_index, example in enumerate(examples): - # Those are the indices of the features associated to the current example. - feature_indices = features_per_example[example["id"]] - - # if len(feature_indices) > 1: - # print('example_index: %s' % example_index) - - min_null_prediction = None - prelim_predictions = [] - score_answerable = -1 - # Looping through all the features associated to the current example. - for feature_index in feature_indices: - # We grab the predictions of the model for this feature. - start_logits = all_start_logits[feature_index] - end_logits = all_end_logits[feature_index] - cls_logits = all_cls_logits[feature_index] - # This is what will allow us to map some the positions in our logits to span of texts in the original context. - offset_mapping = features[feature_index][ - "offset_mapping" - ] # list[tuple(2)], list长度与input_ids, start_logits, end_logits相同 - - # if len(feature_indices) > 1: - # print('offset_mapping: %s' % offset_mapping) - - # Optional `token_is_max_context`, if provided we will remove answers that do not have the maximum context - # available in the current feature. - token_is_max_context = features[feature_index].get("token_is_max_context", None) - - exp_answerable_scores = np.exp(cls_logits - np.max(cls_logits)) - feature_answerable_score = exp_answerable_scores / exp_answerable_scores.sum() - if feature_answerable_score[-1] > score_answerable: - score_answerable = feature_answerable_score[-1] - answerable_probs = feature_answerable_score - - # Update minimum null prediction. - feature_null_score = start_logits[0] + end_logits[0] - if min_null_prediction is None or min_null_prediction["score"] > feature_null_score: - min_null_prediction = { - "feature_index": (0, 0), - "offsets": (0, 0), - "score": feature_null_score, - "start_logit": start_logits[0], - "end_logit": end_logits[0], - } - - # Go through all possibilities for the `n_best_size` greater start and end logits. - start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist() # list(n_best_size) 从大到小 - end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist() # list(n_best_size) 从大到小 - for start_index in start_indexes: - for end_index in end_indexes: - # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond - # to part of the input_ids that are not in the context. - if ( - start_index >= len(offset_mapping) - or end_index >= len(offset_mapping) - or offset_mapping[start_index] is None - or offset_mapping[end_index] is None # CLS、Question和第一个SEP的位置 - or offset_mapping[start_index] == (0, 0) - or offset_mapping[end_index] == (0, 0) # 第二个SEP的位置 - ): - continue - # Don't consider answers with a length that is either < 0 or > max_answer_length. - if end_index < start_index or end_index - start_index + 1 > max_answer_length: - continue - # Don't consider answer that don't have the maximum context available (if such information is - # provided). - if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False): - continue - prelim_predictions.append( - { - "feature_index": (start_index, end_index), - "offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]), - "score": start_logits[start_index] + end_logits[end_index], - "start_logit": start_logits[start_index], - "end_logit": end_logits[end_index], - } - ) - if version_2_with_negative: - # Add the minimum null prediction - prelim_predictions.append(min_null_prediction) - pred_cls_label = np.argmax(np.array(answerable_probs)) - all_cls_predictions.append([example["id"], pred_cls_label, answerable_probs[0], answerable_probs[1]]) - - # Only keep the best `n_best_size` predictions. - predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size] - - # Add back the minimum null prediction if it was removed because of its low score. - if version_2_with_negative and not any(p["offsets"] == (0, 0) for p in predictions): - predictions.append(min_null_prediction) - - # Use the offsets to gather the answer text in the original context. - context = example["context"] - for pred in predictions: - # offsets = pred.pop("offsets") - offsets = pred["offsets"] - pred["text"] = context[offsets[0] : offsets[1]] if context[offsets[0] : offsets[1]] != "" else "no answer" - - # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid - # failure. - if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == "no answer"): - predictions.insert( - 0, - { - "feature_index": (0, 0), - "offsets": (0, 0), - "text": "no answer", - "start_logit": 0.0, - "end_logit": 0.0, - "score": 0.0, - }, - ) - - # Compute the softmax of all scores (we do it with numpy to stay independent from torch/tf in this file, using - # the LogSumExp trick). - scores = np.array([pred.pop("score") for pred in predictions]) - exp_scores = np.exp(scores - np.max(scores)) - probs = exp_scores / exp_scores.sum() - - # Include the probabilities in our predictions. - for prob, pred in zip(probs, predictions): - pred["probability"] = prob - - # Pick the best prediction. If the null answer is not possible, this is easy. - if not version_2_with_negative: - all_predictions[example["id"]] = predictions[0]["text"] - all_feature_index[example["id"]] = predictions[0]["feature_index"] - else: - # Otherwise we first need to find the best non-empty prediction. - i = 0 - while predictions[i]["text"] == "no answer": - i += 1 - best_non_null_pred = predictions[i] - - if answerable_probs[1] < cls_threshold: - all_predictions[example["id"]] = "no answer" - else: - all_predictions[example["id"]] = best_non_null_pred["text"] - all_feature_index[example["id"]] = predictions[i]["feature_index"] - - # Make `predictions` JSON-serializable by casting np.float back to float. - all_nbest_json[example["id"]] = [ - {k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()} - for pred in predictions - ] - - return all_predictions, all_nbest_json, all_cls_predictions, all_feature_index - - -def compute_prediction( - examples, - features, - predictions, - version_2_with_negative=False, - n_best_size=20, - max_answer_length=30, - null_score_diff_threshold=0.0, -): - """ - Post-processes the predictions of a question-answering model to convert - them to answers that are substrings of the original contexts. This is - the base postprocessing functions for models that only return start and - end logits. - - Args: - examples (list): List of raw squad-style data (see `run_squad.py - `__ for more - information). - features (list): List of processed squad-style features (see - `run_squad.py `__ - for more information). - predictions (tuple): The predictions of the model. Should be a tuple - of two list containing the start logits and the end logits. - version_2_with_negative (bool, optional): Whether the dataset contains - examples with no answers. Defaults to False. - n_best_size (int, optional): The total number of candidate predictions - to generate. Defaults to 20. - max_answer_length (int, optional): The maximum length of predicted answer. - Defaults to 20. - null_score_diff_threshold (float, optional): The threshold used to select - the null answer. Only useful when `version_2_with_negative` is True. - Defaults to 0.0. - - Returns: - A tuple of three dictionaries containing final selected answer, all n_best - answers along with their probability and scores, and the score_diff of each - example. - """ - assert len(predictions) == 2, "`predictions` should be a tuple with two elements (start_logits, end_logits)." - all_start_logits, all_end_logits = predictions - - assert len(predictions[0]) == len(features), "Number of predictions should be equal to number of features." - - # Build a map example to its corresponding features. - features_per_example = collections.defaultdict(list) - for i, feature in enumerate(features): - features_per_example[feature["example_id"]].append(i) - - # The dictionaries we have to fill. - all_predictions = collections.OrderedDict() - all_nbest_json = collections.OrderedDict() - all_feature_index = collections.OrderedDict() - scores_diff_json = collections.OrderedDict() - - # Let's loop over all the examples! - for example_index, example in enumerate(examples): - # Those are the indices of the features associated to the current example. - feature_indices = features_per_example[example["id"]] - - min_null_prediction = None - prelim_predictions = [] - - # Looping through all the features associated to the current example. - for feature_index in feature_indices: - # We grab the predictions of the model for this feature. - start_logits = all_start_logits[feature_index] - end_logits = all_end_logits[feature_index] - # This is what will allow us to map some the positions in our logits to span of texts in the original - # context. - offset_mapping = features[feature_index]["offset_mapping"] - # Optional `token_is_max_context`, if provided we will remove answers that do not have the maximum context - # available in the current feature. - token_is_max_context = features[feature_index].get("token_is_max_context", None) - - # Update minimum null prediction. - feature_null_score = start_logits[0] + end_logits[0] - if min_null_prediction is None or min_null_prediction["score"] > feature_null_score: - min_null_prediction = { - "feature_index": (0, 0), - "offsets": (0, 0), - "score": feature_null_score, - "start_logit": start_logits[0], - "end_logit": end_logits[0], - } - - # Go through all possibilities for the `n_best_size` greater start and end logits. - start_indexes = np.argsort(start_logits)[-1 : -n_best_size - 1 : -1].tolist() - end_indexes = np.argsort(end_logits)[-1 : -n_best_size - 1 : -1].tolist() - for start_index in start_indexes: - for end_index in end_indexes: - # Don't consider out-of-scope answers, either because the indices are out of bounds or correspond - # to part of the input_ids that are not in the context. - if ( - start_index >= len(offset_mapping) - or end_index >= len(offset_mapping) - or offset_mapping[start_index] is None - or offset_mapping[end_index] is None - or offset_mapping[start_index] == (0, 0) - or offset_mapping[end_index] == (0, 0) - ): - continue - # Don't consider answers with a length that is either < 0 or > max_answer_length. - if end_index < start_index or end_index - start_index + 1 > max_answer_length: - continue - # Don't consider answer that don't have the maximum context available (if such information is - # provided). - if token_is_max_context is not None and not token_is_max_context.get(str(start_index), False): - continue - prelim_predictions.append( - { - "feature_index": (start_index, end_index), - "offsets": (offset_mapping[start_index][0], offset_mapping[end_index][1]), - "score": start_logits[start_index] + end_logits[end_index], - "start_logit": start_logits[start_index], - "end_logit": end_logits[end_index], - } - ) - if version_2_with_negative: - # Add the minimum null prediction - prelim_predictions.append(min_null_prediction) - null_score = min_null_prediction["score"] - - # Only keep the best `n_best_size` predictions. - predictions = sorted(prelim_predictions, key=lambda x: x["score"], reverse=True)[:n_best_size] - - # Add back the minimum null prediction if it was removed because of its low score. - if version_2_with_negative and not any(p["offsets"] == (0, 0) for p in predictions): - predictions.append(min_null_prediction) - - # Use the offsets to gather the answer text in the original context. - context = example["context"] - for pred in predictions: - offsets = pred.pop("offsets") - pred["text"] = context[offsets[0] : offsets[1]] - - # In the very rare edge case we have not a single non-null prediction, we create a fake prediction to avoid - # failure. - if len(predictions) == 0 or (len(predictions) == 1 and predictions[0]["text"] == ""): - predictions.insert( - 0, {"feature_index": (0, 0), "text": "empty", "start_logit": 0.0, "end_logit": 0.0, "score": 0.0} - ) - - # Compute the softmax of all scores (we do it with numpy to stay independent from torch/tf in this file, using - # the LogSumExp trick). - scores = np.array([pred.pop("score") for pred in predictions]) - exp_scores = np.exp(scores - np.max(scores)) - probs = exp_scores / exp_scores.sum() - - # Include the probabilities in our predictions. - for prob, pred in zip(probs, predictions): - pred["probability"] = prob - - # Pick the best prediction. If the null answer is not possible, this is easy. - if not version_2_with_negative: - all_predictions[example["id"]] = predictions[0]["text"] - all_feature_index[example["id"]] = predictions[0]["feature_index"] - else: - # Otherwise we first need to find the best non-empty prediction. - i = 0 - while predictions[i]["text"] == "": - i += 1 - best_non_null_pred = predictions[i] - - # Then we compare to the null prediction using the threshold. - score_diff = null_score - best_non_null_pred["start_logit"] - best_non_null_pred["end_logit"] - scores_diff_json[example["id"]] = float(score_diff) # To be JSON-serializable. - if score_diff > null_score_diff_threshold: - all_predictions[example["id"]] = "" - else: - all_predictions[example["id"]] = best_non_null_pred["text"] - all_feature_index[example["id"]] = predictions[i]["feature_index"] - - # Make `predictions` JSON-serializable by casting np.float back to float. - all_nbest_json[example["id"]] = [ - {k: (float(v) if isinstance(v, (np.float16, np.float32, np.float64)) else v) for k, v in pred.items()} - for pred in predictions - ] - - return all_predictions, all_nbest_json, scores_diff_json, all_feature_index diff --git a/examples/model_interpretation/task/mrc/saliency_map/utils.py b/examples/model_interpretation/task/mrc/saliency_map/utils.py deleted file mode 100644 index 88c1619769ee..000000000000 --- a/examples/model_interpretation/task/mrc/saliency_map/utils.py +++ /dev/null @@ -1,37 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import absolute_import, division, print_function, unicode_literals - -import paddle - - -class UnpackDataLoader(paddle.io.DataLoader): - def __init__(self, *args, **kwargs): - super(UnpackDataLoader, self).__init__(*args, batch_size=1, **kwargs) - - def __iter__(self): - return ([yy[0] for yy in y] for y in super(UnpackDataLoader, self).__iter__()) - - -def create_if_not_exists(dir): - try: - dir.mkdir(parents=True) - except: - pass - return dir - - -def get_warmup_and_linear_decay(max_steps, warmup_steps): - return lambda step: min(step / warmup_steps, 1.0 - (step - warmup_steps) / (max_steps - warmup_steps)) diff --git a/examples/model_interpretation/task/senti/LIME/exceptions.py b/examples/model_interpretation/task/senti/LIME/exceptions.py deleted file mode 100644 index c5fa1a29924a..000000000000 --- a/examples/model_interpretation/task/senti/LIME/exceptions.py +++ /dev/null @@ -1,2 +0,0 @@ -class LimeError(Exception): - """Raise for errors""" diff --git a/examples/model_interpretation/task/senti/LIME/explanation.py b/examples/model_interpretation/task/senti/LIME/explanation.py deleted file mode 100644 index 6e212b1613ca..000000000000 --- a/examples/model_interpretation/task/senti/LIME/explanation.py +++ /dev/null @@ -1,344 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -Explanation class, with visualization functions. -""" -from io import open -import os -import os.path -import json -import string -import numpy as np - -# from .exceptions import LimeError -from LIME.exceptions import LimeError - -from sklearn.utils import check_random_state - - -def id_generator(size=15, random_state=None): - """Helper function to generate random div ids. This is useful for embedding - HTML into ipython notebooks.""" - chars = list(string.ascii_uppercase + string.digits) - return "".join(random_state.choice(chars, size, replace=True)) - - -class DomainMapper(object): - """Class for mapping features to the specific domain. - - The idea is that there would be a subclass for each domain (text, tables, - images, etc), so that we can have a general Explanation class, and separate - out the specifics of visualizing features in here. - """ - - def __init__(self): - pass - - def map_exp_ids(self, exp, **kwargs): - """Maps the feature ids to concrete names. - - Default behaviour is the identity function. Subclasses can implement - this as they see fit. - - Args: - exp: list of tuples [(id, weight), (id,weight)] - kwargs: optional keyword arguments - - Returns: - exp: list of tuples [(name, weight), (name, weight)...] - """ - return exp - - def visualize_instance_html(self, exp, label, div_name, exp_object_name, **kwargs): - """Produces html for visualizing the instance. - - Default behaviour does nothing. Subclasses can implement this as they - see fit. - - Args: - exp: list of tuples [(id, weight), (id,weight)] - label: label id (integer) - div_name: name of div object to be used for rendering(in js) - exp_object_name: name of js explanation object - kwargs: optional keyword arguments - - Returns: - js code for visualizing the instance - """ - return "" - - -class Explanation(object): - """Object returned by explainers.""" - - def __init__(self, domain_mapper, mode="classification", class_names=None, random_state=None): - """ - - Initializer. - - Args: - domain_mapper: must inherit from DomainMapper class - type: "classification" or "regression" - class_names: list of class names (only used for classification) - random_state: an integer or numpy.RandomState that will be used to - generate random numbers. If None, the random state will be - initialized using the internal numpy seed. - """ - self.random_state = random_state - self.mode = mode - self.domain_mapper = domain_mapper - self.local_exp = {} - self.intercept = {} - self.score = {} - self.local_pred = {} - if mode == "classification": - self.class_names = class_names - self.top_labels = None - self.predict_proba = None - elif mode == "regression": - self.class_names = ["negative", "positive"] - self.predicted_value = None - self.min_value = 0.0 - self.max_value = 1.0 - self.dummy_label = 1 - else: - raise LimeError( - 'Invalid explanation mode "{}". ' 'Should be either "classification" ' 'or "regression".'.format(mode) - ) - - def available_labels(self): - """ - Returns the list of classification labels for which we have any explanations. - """ - try: - assert self.mode == "classification" - except AssertionError: - raise NotImplementedError("Not supported for regression explanations.") - else: - ans = self.top_labels if self.top_labels else self.local_exp.keys() - return list(ans) - - def as_list(self, label=1, **kwargs): - """Returns the explanation as a list. - - Args: - label: desired label. If you ask for a label for which an - explanation wasn't computed, will throw an exception. - Will be ignored for regression explanations. - kwargs: keyword arguments, passed to domain_mapper - - Returns: - list of tuples (representation, weight), where representation is - given by domain_mapper. Weight is a float. - """ - label_to_use = label if self.mode == "classification" else self.dummy_label - ans = self.domain_mapper.map_exp_ids(self.local_exp[label_to_use], **kwargs) - ans = [(x[0], float(x[1])) for x in ans] - return ans - - def as_map(self): - """Returns the map of explanations. - - Returns: - Map from label to list of tuples (feature_id, weight). - """ - return self.local_exp - - def as_pyplot_figure(self, label=1, **kwargs): - """Returns the explanation as a pyplot figure. - - Will throw an error if you don't have matplotlib installed - Args: - label: desired label. If you ask for a label for which an - explanation wasn't computed, will throw an exception. - Will be ignored for regression explanations. - kwargs: keyword arguments, passed to domain_mapper - - Returns: - pyplot figure (barchart). - """ - import matplotlib.pyplot as plt - - exp = self.as_list(label=label, **kwargs) - fig = plt.figure() - vals = [x[1] for x in exp] - names = [x[0] for x in exp] - vals.reverse() - names.reverse() - colors = ["green" if x > 0 else "red" for x in vals] - pos = np.arange(len(exp)) + 0.5 - plt.barh(pos, vals, align="center", color=colors) - plt.yticks(pos, names) - if self.mode == "classification": - title = "Local explanation for class %s" % self.class_names[label] - else: - title = "Local explanation" - plt.title(title) - return fig - - def show_in_notebook(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): - """Shows html explanation in ipython notebook. - - See as_html() for parameters. - This will throw an error if you don't have IPython installed""" - - from IPython.core.display import display, HTML - - display( - HTML( - self.as_html( - labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs - ) - ) - ) - - def save_to_file(self, file_path, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): - """Saves html explanation to file. . - - Params: - file_path: file to save explanations to - - See as_html() for additional parameters. - - """ - file_ = open(file_path, "w", encoding="utf8") - file_.write( - self.as_html( - labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs - ) - ) - file_.close() - - def as_html(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): - """Returns the explanation as an html page. - - Args: - labels: desired labels to show explanations for (as barcharts). - If you ask for a label for which an explanation wasn't - computed, will throw an exception. If None, will show - explanations for all available labels. (only used for classification) - predict_proba: if true, add barchart with prediction probabilities - for the top classes. (only used for classification) - show_predicted_value: if true, add barchart with expected value - (only used for regression) - kwargs: keyword arguments, passed to domain_mapper - - Returns: - code for an html page, including javascript includes. - """ - - def jsonize(x): - return json.dumps(x, ensure_ascii=False) - - if labels is None and self.mode == "classification": - labels = self.available_labels() - - this_dir, _ = os.path.split(__file__) - bundle = open(os.path.join(this_dir, "bundle.js"), encoding="utf8").read() - - out = ( - """ - - """ - % bundle - ) - random_id = id_generator(size=15, random_state=check_random_state(self.random_state)) - out += ( - """ -
- """ - % random_id - ) - - predict_proba_js = "" - if self.mode == "classification" and predict_proba: - predict_proba_js = """ - var pp_div = top_div.append('div') - .classed('lime predict_proba', true); - var pp_svg = pp_div.append('svg').style('width', '100%%'); - var pp = new lime.PredictProba(pp_svg, %s, %s); - """ % ( - jsonize([str(x) for x in self.class_names]), - jsonize(list(self.predict_proba.astype(float))), - ) - - predict_value_js = "" - if self.mode == "regression" and show_predicted_value: - # reference self.predicted_value - # (svg, predicted_value, min_value, max_value) - predict_value_js = """ - var pp_div = top_div.append('div') - .classed('lime predicted_value', true); - var pp_svg = pp_div.append('svg').style('width', '100%%'); - var pp = new lime.PredictedValue(pp_svg, %s, %s, %s); - """ % ( - jsonize(float(self.predicted_value)), - jsonize(float(self.min_value)), - jsonize(float(self.max_value)), - ) - - exp_js = """var exp_div; - var exp = new lime.Explanation(%s); - """ % ( - jsonize([str(x) for x in self.class_names]) - ) - - if self.mode == "classification": - for label in labels: - exp = jsonize(self.as_list(label)) - exp_js += """ - exp_div = top_div.append('div').classed('lime explanation', true); - exp.show(%s, %d, exp_div); - """ % ( - exp, - label, - ) - else: - exp = jsonize(self.as_list()) - exp_js += """ - exp_div = top_div.append('div').classed('lime explanation', true); - exp.show(%s, %s, exp_div); - """ % ( - exp, - self.dummy_label, - ) - - raw_js = """var raw_div = top_div.append('div');""" - - if self.mode == "classification": - html_data = self.local_exp[labels[0]] - else: - html_data = self.local_exp[self.dummy_label] - - raw_js += self.domain_mapper.visualize_instance_html( - html_data, labels[0] if self.mode == "classification" else self.dummy_label, "raw_div", "exp", **kwargs - ) - out += """ - - """ % ( - random_id, - predict_proba_js, - predict_value_js, - exp_js, - raw_js, - ) - out += "" - - return out diff --git a/examples/model_interpretation/task/senti/LIME/lime_base.py b/examples/model_interpretation/task/senti/LIME/lime_base.py deleted file mode 100644 index 2c9104f69b54..000000000000 --- a/examples/model_interpretation/task/senti/LIME/lime_base.py +++ /dev/null @@ -1,226 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -Contains abstract functionality for learning locally linear sparse model. -""" -import numpy as np -import scipy as sp -from sklearn.linear_model import Ridge, lars_path -from sklearn.utils import check_random_state - - -class LimeBase(object): - """Class for learning a locally linear sparse model from perturbed data""" - - def __init__(self, kernel_fn, verbose=False, random_state=None): - """Init function - - Args: - kernel_fn: function that transforms an array of distances into an - array of proximity values (floats). - verbose: if true, print local prediction values from linear model. - random_state: an integer or numpy.RandomState that will be used to - generate random numbers. If None, the random state will be - initialized using the internal numpy seed. - """ - self.kernel_fn = kernel_fn - self.verbose = verbose - self.random_state = check_random_state(random_state) - - @staticmethod - def generate_lars_path(weighted_data, weighted_labels): - """Generates the lars path for weighted data. - - Args: - weighted_data: data that has been weighted by kernel - weighted_label: labels, weighted by kernel - - Returns: - (alphas, coefs), both are arrays corresponding to the - regularization parameter and coefficients, respectively - """ - x_vector = weighted_data - alphas, _, coefs = lars_path(x_vector, weighted_labels, method="lasso", verbose=False) - return alphas, coefs - - def forward_selection(self, data, labels, weights, num_features): - """Iteratively adds features to the model""" - clf = Ridge(alpha=0, fit_intercept=True, random_state=self.random_state) - used_features = [] - for _ in range(min(num_features, data.shape[1])): - max_ = -100000000 - best = 0 - for feature in range(data.shape[1]): - if feature in used_features: - continue - clf.fit(data[:, used_features + [feature]], labels, sample_weight=weights) - score = clf.score(data[:, used_features + [feature]], labels, sample_weight=weights) - if score > max_: - best = feature - max_ = score - used_features.append(best) - return np.array(used_features) - - def feature_selection(self, data, labels, weights, num_features, method): - """Selects features for the model. see explain_instance_with_data to - understand the parameters.""" - if method == "none": - return np.array(range(data.shape[1])) - - elif method == "forward_selection": - return self.forward_selection(data, labels, weights, num_features) - - elif method == "highest_weights": - clf = Ridge(alpha=0.01, fit_intercept=True, random_state=self.random_state) - clf.fit(data, labels, sample_weight=weights) - - coef = clf.coef_ - if sp.sparse.issparse(data): - coef = sp.sparse.csr_matrix(clf.coef_) - weighted_data = coef.multiply(data[0]) - # Note: most efficient to slice the data before reversing - sdata = len(weighted_data.data) - argsort_data = np.abs(weighted_data.data).argsort() - # Edge case where data is more sparse than requested number of feature importances - # In that case, we just pad with zero-valued features - if sdata < num_features: - nnz_indexes = argsort_data[::-1] - indices = weighted_data.indices[nnz_indexes] - num_to_pad = num_features - sdata - indices = np.concatenate((indices, np.zeros(num_to_pad, dtype=indices.dtype))) - indices_set = set(indices) - pad_counter = 0 - for i in range(data.shape[1]): - if i not in indices_set: - indices[pad_counter + sdata] = i - pad_counter += 1 - if pad_counter >= num_to_pad: - break - else: - nnz_indexes = argsort_data[sdata - num_features : sdata][::-1] - indices = weighted_data.indices[nnz_indexes] - return indices - else: - weighted_data = coef * data[0] - feature_weights = sorted( - zip(range(data.shape[1]), weighted_data), # zip(特征的编号, Ridge的w值) - key=lambda x: np.abs(x[1]), - reverse=True, - ) - return np.array([x[0] for x in feature_weights[:num_features]]) # 返回Ridge的前num_features大的w的值对应的特征编号 - - elif method == "lasso_path": - weighted_data = (data - np.average(data, axis=0, weights=weights)) * np.sqrt(weights[:, np.newaxis]) - weighted_labels = (labels - np.average(labels, weights=weights)) * np.sqrt(weights) - nonzero = range(weighted_data.shape[1]) - _, coefs = self.generate_lars_path(weighted_data, weighted_labels) - for i in range(len(coefs.T) - 1, 0, -1): - nonzero = coefs.T[i].nonzero()[0] - if len(nonzero) <= num_features: - break - used_features = nonzero - return used_features - - elif method == "auto": - if num_features <= 6: - n_method = "forward_selection" - else: - n_method = "highest_weights" - return self.feature_selection(data, labels, weights, num_features, n_method) - - def explain_instance_with_data( - self, - neighborhood_data, - neighborhood_labels, - distances, - label, - num_features, - feature_selection="auto", - model_regressor=None, - ): - """Takes perturbed data, labels and distances, returns explanation. - - Args: - neighborhood_data: perturbed data, 2d array. first element is - assumed to be the original data point. - neighborhood_labels: corresponding perturbed labels. should have as - many columns as the number of possible labels. - distances: distances to original data point. - label: label for which we want an explanation - num_features: maximum number of features in explanation - feature_selection: how to select num_features. options are: - 'forward_selection': iteratively add features to the model. - This is costly when num_features is high - 'highest_weights': selects the features that have the highest - product of absolute weight * original data point when - learning with all the features - 'lasso_path': chooses features based on the lasso - regularization path - 'none': uses all features, ignores num_features - 'auto': uses forward_selection if num_features <= 6, and - 'highest_weights' otherwise. - model_regressor: sklearn regressor to use in explanation. - Defaults to Ridge regression if None. Must have - model_regressor.coef_ and 'sample_weight' as a parameter - to model_regressor.fit() - - Returns: - (intercept, exp, score, local_pred): - intercept is a float. - exp is a sorted list of tuples, where each tuple (x,y) corresponds to the feature id (x) - and the local weight (y). The list is sorted by decreasing absolute value of y. - score is the R^2 value of the returned explanation - local_pred is the prediction of the explanation model on the original instance - """ - - weights = self.kernel_fn(distances) # 扰动样本权重 - labels_column = neighborhood_labels[:, label] # 类别label的softmax - - used_features = self.feature_selection( - neighborhood_data, labels_column, weights, num_features, feature_selection - ) - if model_regressor is None: - model_regressor = Ridge( - alpha=1, fit_intercept=True, random_state=self.random_state # L2正则化的系数 # 是否需要截距,即b - ) # seg的伪随机种子 - easy_model = model_regressor - easy_model.fit(neighborhood_data[:, used_features], labels_column, sample_weight=weights) - prediction_score = easy_model.score(neighborhood_data[:, used_features], labels_column, sample_weight=weights) - - local_pred = easy_model.predict(neighborhood_data[0, used_features].reshape(1, -1)) - - ridge_pred = easy_model.predict(neighborhood_data[:, used_features]) - err_np = np.abs(labels_column - ridge_pred) - # relative_err_np = err_np / labels_column - relative_err_np = err_np / ridge_pred - err = np.average(err_np, weights=weights) - relative_err = np.average(relative_err_np, weights=weights) - - if self.verbose: - print("Intercept", easy_model.intercept_) - print( - "Prediction_local", - local_pred, - ) - print("Right:", neighborhood_labels[0, label]) - return ( - easy_model.intercept_, # - sorted( - zip(used_features, easy_model.coef_), key=lambda x: np.abs(x[1]), reverse=True - ), # 按权重大小排序的token_id列表 - prediction_score, # 衡量easy_model模型的预测与label的差,越大越好(差越小),最大为1 - local_pred, # easy_model对原始样本的预测概率 - relative_err, - err, - ) diff --git a/examples/model_interpretation/task/senti/LIME/lime_text.py b/examples/model_interpretation/task/senti/LIME/lime_text.py deleted file mode 100644 index 7ef6d3bc40de..000000000000 --- a/examples/model_interpretation/task/senti/LIME/lime_text.py +++ /dev/null @@ -1,664 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# !/usr/bin/env python3 -""" -Functions for explaining text classifiers. -""" -import itertools -import json -import re -import time -import math -import paddle -from functools import partial - -import numpy as np -import scipy as sp -import sklearn -from sklearn.utils import check_random_state - -import LIME.explanation as explanation -import LIME.lime_base as lime_base - - -class TextDomainMapper(explanation.DomainMapper): - """Maps feature ids to words or word-positions""" - - def __init__(self, indexed_string, language): - """Initializer. - - Args: - indexed_string: lime_text.IndexedString, original string - """ - self.indexed_string = indexed_string - self.language = language - - def map_exp_ids(self, exp, positions=False): - """Maps ids to words or word-position strings. - - Args: - exp: list of tuples [(id, weight), (id,weight)] - positions: if True, also return word positions - - Returns: - list of tuples (word, weight), or (word_positions, weight) if - examples: ('bad', 1) or ('bad_3-6-12', 1) - """ - if positions: - exp = [ - ( - "%s_%s" - % (self.indexed_string.word(x[0]), "-".join(map(str, self.indexed_string.string_position(x[0])))), - x[1], - ) - for x in exp - ] - else: - exp = [(self.indexed_string.word(x[0]), x[1]) for x in exp] - return exp - - def visualize_instance_html(self, exp, label, div_name, exp_object_name, text=True, opacity=True): - """Adds text with highlighted words to visualization. - - Args: - exp: list of tuples [(id, weight), (id,weight)] - label: label id (integer) - div_name: name of div object to be used for rendering(in js) - exp_object_name: name of js explanation object - text: if False, return empty - opacity: if True, fade colors according to weight - """ - if not text: - return "" - text = self.indexed_string.raw_string().encode("utf-8", "xmlcharrefreplace").decode("utf-8") - text = re.sub(r"[<>&]", "|", text) - exp = [(self.indexed_string.word(x[0]), self.indexed_string.string_position(x[0]), x[1]) for x in exp] - all_occurrences = list(itertools.chain.from_iterable([itertools.product([x[0]], x[1], [x[2]]) for x in exp])) - all_occurrences = [(x[0], int(x[1]), x[2]) for x in all_occurrences] - ret = """ - %s.show_raw_text(%s, %d, %s, %s, %s); - """ % ( - exp_object_name, - json.dumps(all_occurrences), - label, - json.dumps(text), - div_name, - json.dumps(opacity), - ) - return ret - - -class IndexedString(object): - """String with various indexes.""" - - def __init__(self, raw_string, split_expression=r"\W+", bow=True, mask_string=None, language="en"): - """Initializer. - - Args: - raw_string: string with raw text in it - split_expression: Regex string or callable. If regex string, will be used with re.split. - If callable, the function should return a list of tokens. - bow: if True, a word is the same everywhere in the text - i.e. we - will index multiple occurrences of the same word. If False, - order matters, so that the same word will have different ids - according to position. - mask_string: If not None, replace words with this if bow=False - if None, default value is UNKWORDZ - """ - self.raw = raw_string - self.mask_string = "UNKWORDZ" if mask_string is None else mask_string - self.language = language - - if callable(split_expression): - tokens = split_expression(self.raw) - self.as_list = self._segment_with_tokens(self.raw, tokens) - tokens = set(tokens) - - def non_word(string): - return string not in tokens - - else: - # with the split_expression as a non-capturing group (?:), we don't need to filter out - # the separator character from the split results. - # splitter = re.compile(r'(%s)|$' % split_expression) - # self.as_list = [s for s in splitter.split(self.raw) if s] - if self.language == "ch": - splitter = re.compile(r"([\u4e00-\u9fa5])") - self.as_list = [w for w in splitter.split(self.raw) if len(w.strip()) > 0] - else: - splitter = re.compile(split_expression) - self.as_list = [w for w in self.raw.strip().split() if len(w.strip()) > 0] - valid_word = splitter.match - - self.as_np = np.array(self.as_list) - self.string_start = np.hstack(([0], np.cumsum([len(x) for x in self.as_np[:-1]]))) - vocab = {} - self.inverse_vocab = [] - self.positions = [] - self.bow = bow - non_vocab = set() - for i, word in enumerate(self.as_np): - if word in non_vocab: - continue - if (valid_word(word) and self.language == "en") or (not valid_word(word) and self.language == "ch"): - non_vocab.add(word) - continue - if bow: - if word not in vocab: - vocab[word] = len(vocab) - self.inverse_vocab.append(word) - self.positions.append([]) - idx_word = vocab[word] - self.positions[idx_word].append(i) - else: - self.inverse_vocab.append(word) - self.positions.append(i) - if not bow: - self.positions = np.array(self.positions) - - def raw_string(self): - """Returns the original raw string""" - return self.raw - - def num_words(self): - """Returns the number of tokens in the vocabulary for this document.""" - return len(self.inverse_vocab) - - def word(self, id_): - """Returns the word that corresponds to id_ (int)""" - return self.inverse_vocab[id_] - - def string_position(self, id_): - """Returns a np array with indices to id_ (int) occurrences""" - if self.bow: - return self.string_start[self.positions[id_]] - else: - return self.string_start[[self.positions[id_]]] - - def inverse_removing(self, words_to_remove): - """Returns a string after removing the appropriate words. - - If self.bow is false, replaces word with UNKWORDZ instead of removing it. - - Args: - words_to_remove: list of ids (ints) to remove - - Returns: - original raw string with appropriate words removed. - """ - mask = np.ones(self.as_np.shape[0], dtype="bool") - mask[self.__get_idxs(words_to_remove)] = False - if self.language == "ch": - if not self.bow: - return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])]) - return "".join([self.as_list[v] for v in mask.nonzero()[0]]) - else: - if not self.bow: - return " ".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])]) - return " ".join([self.as_list[v] for v in mask.nonzero()[0]]) - - @staticmethod - def _segment_with_tokens(text, tokens): - """Segment a string around the tokens created by a passed-in tokenizer""" - list_form = [] - text_ptr = 0 - for token in tokens: - inter_token_string = [] - while not text[text_ptr:].startswith(token): - inter_token_string.append(text[text_ptr]) - text_ptr += 1 - if text_ptr >= len(text): - raise ValueError("Tokenization produced tokens that do not belong in string!") - text_ptr += len(token) - if inter_token_string: - list_form.append("".join(inter_token_string)) - list_form.append(token) - if text_ptr < len(text): - list_form.append(text[text_ptr:]) - return list_form - - def __get_idxs(self, words): - """Returns indexes to appropriate words.""" - if self.bow: - return list(itertools.chain.from_iterable([self.positions[z] for z in words])) - else: - return self.positions[words] - - -class IndexedCharacters(object): - """String with various indexes.""" - - def __init__(self, raw_string, bow=True, mask_string=None): - """Initializer. - - Args: - raw_string: string with raw text in it - bow: if True, a char is the same everywhere in the text - i.e. we - will index multiple occurrences of the same character. If False, - order matters, so that the same word will have different ids - according to position. - mask_string: If not None, replace characters with this if bow=False - if None, default value is chr(0) - """ - self.raw = raw_string - self.as_list = list(self.raw) - self.as_np = np.array(self.as_list) - self.mask_string = chr(0) if mask_string is None else mask_string - self.string_start = np.arange(len(self.raw)) - vocab = {} - self.inverse_vocab = [] - self.positions = [] - self.bow = bow - non_vocab = set() - for i, char in enumerate(self.as_np): - if char in non_vocab: - continue - if bow: - if char not in vocab: - vocab[char] = len(vocab) - self.inverse_vocab.append(char) - self.positions.append([]) - idx_char = vocab[char] - self.positions[idx_char].append(i) - else: - self.inverse_vocab.append(char) - self.positions.append(i) - if not bow: - self.positions = np.array(self.positions) - - def raw_string(self): - """Returns the original raw string""" - return self.raw - - def num_words(self): - """Returns the number of tokens in the vocabulary for this document.""" - return len(self.inverse_vocab) - - def word(self, id_): - """Returns the word that corresponds to id_ (int)""" - return self.inverse_vocab[id_] - - def string_position(self, id_): - """Returns a np array with indices to id_ (int) occurrences""" - if self.bow: - return self.string_start[self.positions[id_]] - else: - return self.string_start[[self.positions[id_]]] - - def inverse_removing(self, words_to_remove): - """Returns a string after removing the appropriate words. - - If self.bow is false, replaces word with UNKWORDZ instead of removing - it. - - Args: - words_to_remove: list of ids (ints) to remove - - Returns: - original raw string with appropriate words removed. - """ - mask = np.ones(self.as_np.shape[0], dtype="bool") - mask[self.__get_idxs(words_to_remove)] = False - if not self.bow: - return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])]) - return "".join([self.as_list[v] for v in mask.nonzero()[0]]) - - def __get_idxs(self, words): - """Returns indexes to appropriate words.""" - if self.bow: - return list(itertools.chain.from_iterable([self.positions[z] for z in words])) - else: - return self.positions[words] - - -class LimeTextExplainer(object): - """Explains text classifiers. - Currently, we are using an exponential kernel on cosine distance, and - restricting explanations to words that are present in documents.""" - - def __init__( - self, - kernel_width=25, - kernel=None, - verbose=False, - class_names=None, - feature_selection="auto", - split_expression=r"\W+", - bow=True, - mask_string=None, - random_state=None, - char_level=False, - language="en", - ): - """Init function. - - Args: - kernel_width: kernel width for the exponential kernel. - kernel: similarity kernel that takes euclidean distances and kernel - width as input and outputs weights in (0,1). If None, defaults to - an exponential kernel. - verbose: if true, print local prediction values from linear model - class_names: list of class names, ordered according to whatever the - classifier is using. If not present, class names will be '0', - '1', ... - feature_selection: feature selection method. can be - 'forward_selection', 'lasso_path', 'none' or 'auto'. - See function 'explain_instance_with_data' in lime_base.py for - details on what each of the options does. - split_expression: Regex string or callable. If regex string, will be used with re.split. - If callable, the function should return a list of tokens. - bow: if True (bag of words), will perturb input data by removing - all occurrences of individual words or characters. - Explanations will be in terms of these words. Otherwise, will - explain in terms of word-positions, so that a word may be - important the first time it appears and unimportant the second. - Only set to false if the classifier uses word order in some way - (bigrams, etc), or if you set char_level=True. - mask_string: String used to mask tokens or characters if bow=False - if None, will be 'UNKWORDZ' if char_level=False, chr(0) - otherwise. - random_state: an integer or numpy.RandomState that will be used to - generate random numbers. If None, the random state will be - initialized using the internal numpy seed. - char_level: an boolean identifying that we treat each character - as an independent occurence in the string - """ - - if kernel is None: - - def kernel(d, kernel_width): - return np.sqrt(np.exp(-(d**2) / kernel_width**2)) - - kernel_fn = partial(kernel, kernel_width=kernel_width) - - self.random_state = check_random_state(random_state) - self.base = lime_base.LimeBase(kernel_fn, verbose, random_state=self.random_state) - self.class_names = class_names - self.vocabulary = None - self.feature_selection = feature_selection - self.bow = bow - self.mask_string = mask_string - self.split_expression = split_expression - self.char_level = char_level - self.language = language - - def explain_instance( - self, - text_instance: str, - tokenizer, - pred_label: int, - classifier_fn, - labels=(0, 1), - top_labels=None, - num_features=10, - num_samples=5000, - distance_metric="cosine", - model_regressor=None, - if_lstm=False, - ): - """Generates explanations for a prediction. - - First, we generate neighborhood data by randomly hiding features from - the instance (see __data_labels_distance_mapping). We then learn - locally weighted linear models on this neighborhood data to explain - each of the classes in an interpretable way (see lime_base.py). - - Args: - text_instance: raw text string to be explained. - classifier_fn: classifier prediction probability function, which - takes a list of d strings and outputs a (d, k) numpy array with - prediction probabilities, where k is the number of classes. - For ScikitClassifiers , this is classifier.predict_proba. - labels: iterable with labels to be explained. - top_labels: if not None, ignore labels and produce explanations for - the K labels with highest prediction probabilities, where K is - this parameter. - num_features: maximum number of features present in explanation - num_samples: size of the neighborhood to learn the linear model - distance_metric: the distance metric to use for sample weighting, - defaults to cosine similarity - model_regressor: sklearn regressor to use in explanation. Defaults - to Ridge regression in LimeBase. Must have model_regressor.coef_ - and 'sample_weight' as a parameter to model_regressor.fit() - Returns: - An Explanation object (see explanation.py) with the corresponding - explanations. - """ - indexed_string = ( - IndexedCharacters(text_instance, bow=self.bow, mask_string=self.mask_string) - if self.char_level - else IndexedString( - text_instance, - bow=self.bow, - split_expression=self.split_expression, - mask_string=self.mask_string, - language=self.language, - ) - ) - domain_mapper = TextDomainMapper(indexed_string, self.language) - - # 产生扰动数据集 第一条是原始数据 - # data: 解释器训练特征 list (num_samples, doc_size) - # yss: 解释器训练标签 list (num_samples, class_num(2)) - # distances: 扰动样本到原始样本的距离 np.array(float) (num_samples, ) - data, yss, distances = self.__data_labels_distances( - indexed_string, tokenizer, classifier_fn, num_samples, distance_metric=distance_metric, if_lstm=if_lstm - ) - - if self.class_names is None: - self.class_names = [str(x) for x in range(yss[0].shape[0])] - ret_exp = explanation.Explanation( - domain_mapper=domain_mapper, class_names=self.class_names, random_state=self.random_state - ) - ret_exp.predict_proba = yss[0] - if top_labels: - labels = np.argsort(yss[0])[-top_labels:] - ret_exp.top_labels = list(labels) - ret_exp.top_labels.reverse() - - num_features = indexed_string.num_words() # 特征数量跟word_num相同 - - ( - ret_exp.intercept[pred_label], - ret_exp.local_exp[pred_label], - ret_exp.score[pred_label], - ret_exp.local_pred[pred_label], - relative_err, - err, - ) = self.base.explain_instance_with_data( - data, - yss, - distances, - pred_label, - num_features, - model_regressor=model_regressor, - feature_selection=self.feature_selection, - ) - - return ret_exp, indexed_string, relative_err, err - - def __data_labels_distances( - self, indexed_string, tokenizer, classifier_fn, num_samples, distance_metric="cosine", if_lstm=False - ): - """Generates a neighborhood around a prediction. - - Generates neighborhood data by randomly removing words from - the instance, and predicting with the classifier. Uses cosine distance - to compute distances between original and perturbed instances. - Args: - indexed_string: document (IndexedString) to be explained, - classifier_fn: classifier prediction probability function, which - takes a string and outputs prediction probabilities. For - ScikitClassifier, this is classifier.predict_proba. - num_samples: size of the neighborhood to learn the linear model - distance_metric: the distance metric to use for sample weighting, - defaults to cosine similarity. - - Returns: - A tuple (data, labels, distances), where: - data: dense num_samples * K binary matrix, where K is the - number of tokens in indexed_string. The first row is the - original instance, and thus a row of ones. - labels: num_samples * L matrix, where L is the number of target - labels - distances: cosine distance between the original instance and - each perturbed instance (computed in the binary 'data' - matrix), times 100. - """ - - def distance_fn(x): - return sklearn.metrics.pairwise.pairwise_distances(x, x[0], metric=distance_metric).ravel() * 100 - - doc_size = indexed_string.num_words() - - if doc_size > 1: - sample = self.random_state.randint( - 1, doc_size, num_samples - 1 - ) # sample: [int(1 ~ doc_size-1) * num_samples-1] - else: - sample = [0 for i in range(num_samples - 1)] - data = np.ones((num_samples, doc_size)) - data[0] = np.ones(doc_size) - features_range = range(doc_size) - perturb_text = [indexed_string.raw_string()] # [文本 * num_samples] - - for i, size in enumerate(sample, start=1): - # inactive: 从range(0, doc_size)中随机取出的size个数组成的list, 要去掉的字的id - inactive = self.random_state.choice( - features_range, size, replace=False # [0, doc_size) # int: 该扰动样本中remove token的数量 - ) - - text = indexed_string.inverse_removing(inactive) # 原文本去掉了inactive中的字后的文本 - - data[i, inactive] = 0 - perturb_text.append(text) - - prev_time = time.time() - # inverse_data: 扰动数据集 [扰动样本 str] * num_samples - labels = [] - token_ids_list, s_ids_list, seq_len_list = [], [], [] - token_ids_max_len = 0 - - valid_idxs = [] - - for idx, text in enumerate(perturb_text): - if self.language == "en": - if if_lstm: - pad_id = [tokenizer.vocab.token_to_idx.get("[PAD]", 0)] - - token_ids = tokenizer.encode(text) - token_ids_max_len = max(token_ids_max_len, len(token_ids)) - seq_len = len(token_ids) - if seq_len == 0: - continue - else: - valid_idxs.append(idx) - seq_len_list.append(seq_len) - pad_id = [tokenizer.vocab.token_to_idx.get("[PAD]", 0)] - - else: - pad_id = tokenizer.convert_tokens_to_ids(["[PAD]"]) - - tokens = tokenizer.tokenize(text) - token_ids = tokenizer.convert_tokens_to_ids(tokens) - token_ids = ( - tokenizer.convert_tokens_to_ids(["[CLS]"]) - + token_ids - + tokenizer.convert_tokens_to_ids(["[SEP]"]) - ) - token_ids_max_len = max(token_ids_max_len, len(token_ids)) - - token_ids_list.append(token_ids) - else: - if len(text) == 0: # TODO - text = perturb_text[0] - tokens = tokenizer.tokenize(text) - token_ids = tokenizer.convert_tokens_to_ids(tokens) - - if if_lstm: - seq_len = len(token_ids) - if seq_len == 0: - continue - else: - valid_idxs.append(idx) - seq_len_list.append(seq_len) - else: - token_ids = ( - tokenizer.convert_tokens_to_ids(["[CLS]"]) - + token_ids - + tokenizer.convert_tokens_to_ids(["[SEP]"]) - ) - - # padding - token_ids = token_ids + tokenizer.convert_tokens_to_ids(["[PAD]"]) * ( - len(perturb_text[0]) + 2 - len(token_ids) - ) - token_ids_list.append(token_ids) - s_ids = [0 for _ in range(len(token_ids))] - s_ids_list.append(s_ids) - - if self.language == "en": - for token_ids in token_ids_list: - while len(token_ids) < token_ids_max_len: - token_ids += pad_id - - s_ids = [0 for _ in range(len(token_ids))] - s_ids_list.append(s_ids) - - token_ids_np = np.array(token_ids_list) - s_ids_np = np.array(s_ids_list) - seq_len_np = np.array(seq_len_list) - - prev_time = time.time() - - batch = 0 - if self.language == "ch": - length = len(perturb_text[0]) - - if if_lstm: - batch = 128 - else: - batch = 64 if length < 130 else 50 - else: - batch = 32 - - epoch_num = math.ceil(len(token_ids_np) / batch) - for idx in range(epoch_num): - token_ids_tensor = paddle.Tensor( - value=token_ids_np[idx * batch : (idx + 1) * batch], place=paddle.CUDAPlace(0), stop_gradient=True - ) - if if_lstm: - seq_len_tensor = paddle.Tensor( - value=seq_len_np[idx * batch : (idx + 1) * batch], - place=token_ids_tensor.place, - stop_gradient=token_ids_tensor.stop_gradient, - ) - label = classifier_fn(token_ids_tensor, seq_len_tensor)[0] # label: Tensor[num_samples, 2] - else: - s_ids_tensor = paddle.Tensor( - value=s_ids_np[idx * batch : (idx + 1) * batch], - place=token_ids_tensor.place, - stop_gradient=token_ids_tensor.stop_gradient, - ) - label = classifier_fn(token_ids_tensor, s_ids_tensor)[0] # label: Tensor[num_samples, 2] - - labels.extend(label.numpy().tolist()) - - labels = np.array(labels) # labels: nsp.array(num_samples, 2) - - print("mode forward time: %.5f" % (time.time() - prev_time)) - - distances = distance_fn(sp.sparse.csr_matrix(data)) - - return data, labels, distances diff --git a/examples/model_interpretation/task/senti/pretrained_models/run_train.sh b/examples/model_interpretation/task/senti/pretrained_models/run_train.sh deleted file mode 100755 index b13d03c6486e..000000000000 --- a/examples/model_interpretation/task/senti/pretrained_models/run_train.sh +++ /dev/null @@ -1,30 +0,0 @@ -### - # This script is used to finetune pretrained models -### - -export CUDA_VISIBLE_DEVICES=5 - -LANGUAGE=en -BASE_MODEL=roberta_base # [roberta_base, roberta_large] -timestamp=`date +"%Y%m%d_%H%M%S"` - -if [[ $LANGUAGE == "ch" ]]; then - LEARNING_RATE=2e-5 - MAX_SEQ_LENGTH=128 -elif [[ $LANGUAGE == "en" ]]; then - LEARNING_RATE=5e-6 - MAX_SEQ_LENGTH=512 -fi - -[ -d "logs" ] || mkdir -p "logs" -set -x - -python3 ./train.py \ - --learning_rate ${LEARNING_RATE} \ - --max_seq_length ${MAX_SEQ_LENGTH} \ - --batch_size 32 \ - --epochs 5 \ - --base_model $BASE_MODEL \ - --save_dir saved_model_${LANGUAGE}/${BASE_MODEL}_${timestamp} \ - --language $LANGUAGE >> logs/log_${BASE_MODEL}_${timestamp} - diff --git a/examples/model_interpretation/task/senti/pretrained_models/train.py b/examples/model_interpretation/task/senti/pretrained_models/train.py deleted file mode 100644 index 61dcb01ada08..000000000000 --- a/examples/model_interpretation/task/senti/pretrained_models/train.py +++ /dev/null @@ -1,230 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - This file is used to fine-tune pretrained models -""" -import argparse -import os -import random -import sys -import time -from functools import partial - -import numpy as np -import paddle -import paddle.nn.functional as F - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.datasets import load_dataset -from paddlenlp.transformers import LinearDecayWithWarmup -from paddlenlp.transformers.roberta.tokenizer import ( - RobertaBPETokenizer, - RobertaTokenizer, -) - -sys.path.append("..") -sys.path.append("../../..") -from roberta.modeling import RobertaForSequenceClassification # noqa: E402 - -sys.path.remove("../../..") -sys.path.remove("..") -from utils import convert_example # noqa: E402 - -parser = argparse.ArgumentParser() -parser.add_argument("--base_model", type=str, choices=["roberta_base", "roberta_large"]) -parser.add_argument( - "--save_dir", - default="./checkpoint", - type=str, - help="The output directory where the model checkpoints will be written.", -) -parser.add_argument( - "--max_seq_length", - default=128, - type=int, - help="The maximum total input sequence length after tokenization. " - "Sequences longer than this will be truncated, sequences shorter will be padded.", -) -parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") -parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") -parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") -parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") -parser.add_argument( - "--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process." -) -parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") -parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") -parser.add_argument( - "--device", - choices=["cpu", "gpu", "xpu"], - default="gpu", - help="Select which device to train model, defaults to gpu.", -) -parser.add_argument( - "--language", choices=["ch", "en"], required=True, default=None, help="Language that the model is built for" -) -args = parser.parse_args() - - -def set_seed(seed): - """sets random seed""" - random.seed(seed) - np.random.seed(seed) - paddle.seed(seed) - - -@paddle.no_grad() -def evaluate(model, criterion, metric, data_loader): - """ - Given a dataset, it evals model and computes the metric. - - Args: - model(obj:`paddle.nn.Layer`): A model to classify texts. - data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. - criterion(obj:`paddle.nn.Layer`): It can compute the loss. - metric(obj:`paddle.metric.Metric`): The evaluation metric. - """ - model.eval() - metric.reset() - losses = [] - for batch in data_loader: - input_ids, token_type_ids, labels = batch - logits = model(input_ids, token_type_ids) - loss = criterion(logits, labels) - losses.append(loss.numpy()) - correct = metric.compute(logits, labels) - metric.update(correct) - accu = metric.accumulate() - print("eval loss: %.5f, accu: %.5f" % (np.mean(losses), accu)) - model.train() - metric.reset() - - -def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): - """ - This function created the dataloader which feeds data into model - """ - if trans_fn: - dataset = dataset.map(trans_fn) - - shuffle = True if mode == "train" else False - if mode == "train": - batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - else: - batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - - return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) - - -def do_train(): - """ - This function is the main part of the fine-tunning process - """ - paddle.set_device(args.device) - rank = paddle.distributed.get_rank() - if paddle.distributed.get_world_size() > 1: - paddle.distributed.init_parallel_env() - - set_seed(args.seed) - if args.language == "ch": - train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"]) - if args.base_model == "roberta_base": - tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext") - model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext", num_classes=2) - elif args.base_model == "roberta_large": - tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large") - model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext-large", num_classes=2) - else: - train_ds, dev_ds = load_dataset("glue", "sst-2", splits=["train", "dev"]) - # for English version, we load models from local machine - if args.base_model == "roberta_base": - tokenizer = RobertaBPETokenizer.from_pretrained("roberta-base") - model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_classes=2) - elif args.base_model == "roberta_large": - tokenizer = RobertaBPETokenizer.from_pretrained("roberta-large") - model = RobertaForSequenceClassification.from_pretrained("roberta-large", num_classes=2) - - trans_func = partial( - convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=args.language - ) - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id), # input - Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment - Stack(dtype="int64"), # label - ): [data for data in fn(samples)] - train_data_loader = create_dataloader( - train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func - ) - dev_data_loader = create_dataloader( - dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func - ) - - if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): - state_dict = paddle.load(args.init_from_ckpt) - model.set_dict(state_dict) - model = paddle.DataParallel(model) - - num_training_steps = len(train_data_loader) * args.epochs - - lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) - - # Generate parameter names needed to perform weight decay. - # All bias and LayerNorm parameters are excluded. - decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] - optimizer = paddle.optimizer.AdamW( - learning_rate=lr_scheduler, - parameters=model.parameters(), - weight_decay=args.weight_decay, - apply_decay_param_fun=lambda x: x in decay_params, - ) - - criterion = paddle.nn.loss.CrossEntropyLoss() - metric = paddle.metric.Accuracy() - - global_step = 0 - tic_train = time.time() - log_per_step = 100 if args.language == "en" else 10 - for epoch in range(1, args.epochs + 1): - for step, batch in enumerate(train_data_loader, start=1): - input_ids, token_type_ids, labels = batch - logits = model(input_ids=input_ids, token_type_ids=token_type_ids) - loss = criterion(logits, labels) - probs = F.softmax(logits, axis=1) - correct = metric.compute(probs, labels) - metric.update(correct) - acc = metric.accumulate() - - global_step += 1 - if global_step % log_per_step == 0 and rank == 0: - print( - "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s" - % (global_step, epoch, step, loss, acc, log_per_step / (time.time() - tic_train)), - flush=True, - ) - tic_train = time.time() - loss.backward() - optimizer.step() - lr_scheduler.step() - optimizer.clear_grad() - if global_step % (log_per_step * 10) == 0 and rank == 0: - save_dir = os.path.join(args.save_dir, "model_%d" % global_step) - if not os.path.exists(save_dir): - os.makedirs(save_dir) - evaluate(model, criterion, metric, dev_data_loader) - model._layers.save_pretrained(save_dir) - tokenizer.save_pretrained(save_dir) - - -if __name__ == "__main__": - do_train() diff --git a/examples/model_interpretation/task/senti/pretrained_models/utils.py b/examples/model_interpretation/task/senti/pretrained_models/utils.py deleted file mode 100644 index d8c0bad17bd6..000000000000 --- a/examples/model_interpretation/task/senti/pretrained_models/utils.py +++ /dev/null @@ -1,59 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - This file contains some public function -""" -import numpy as np - - -def convert_example(example, tokenizer, max_seq_length=512, is_test=False, language="ch"): - """ - Builds model inputs from a sequence or a pair of sequence for sequence classification tasks - by concatenating and adding special tokens. And creates a mask from the two sequences passed - to be used in a sequence-pair classification task. - - A BERT sequence has the following format: - - - single sequence: ``[CLS] X [SEP]`` - - It returns the first portion of the mask (0's). - - - Args: - example(obj:`list[str]`): List of input data, containing text and label if it have label. - tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` - which contains most of the methods. Users should refer to the superclass for more information regarding methods. - max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. - Sequences longer than this will be truncated, sequences shorter will be padded. - is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. - - Returns: - input_ids(obj:`list[int]`): The list of token ids. - token_type_ids(obj: `list[int]`): List of sequence pair mask. - label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. - """ - if language == "ch": - text = "text" - label = "label" - else: - text = "sentence" - label = "labels" - encoded_inputs = tokenizer(text=example[text], max_seq_len=max_seq_length) - input_ids = encoded_inputs["input_ids"] - token_type_ids = encoded_inputs["token_type_ids"] - - if is_test: - return input_ids, token_type_ids - label = np.array([example[label]], dtype="int64") - return input_ids, token_type_ids, label diff --git a/examples/model_interpretation/task/senti/rnn/lstm_train.sh b/examples/model_interpretation/task/senti/rnn/lstm_train.sh deleted file mode 100755 index fd3b4a4cc2b8..000000000000 --- a/examples/model_interpretation/task/senti/rnn/lstm_train.sh +++ /dev/null @@ -1,20 +0,0 @@ -### - # This script is used to train lstm models -### - -unset CUDA_VISIBLE_DEVICES -LANGUAGE=en - -if [[ $LANGUAGE == 'ch' ]]; then - VOCAB_PATH='./vocab.txt' -else - VOCAB_PATH='vocab.sst2_train' -fi -python -m paddle.distributed.launch --gpus "5" train.py \ - --device=gpu \ - --lr=4e-4 \ - --batch_size=64 \ - --epochs=12 \ - --vocab_path=$VOCAB_PATH \ - --language=$LANGUAGE \ - --save_dir="./checkpoints_"${LANGUAGE} diff --git a/examples/model_interpretation/task/senti/rnn/model.py b/examples/model_interpretation/task/senti/rnn/model.py deleted file mode 100644 index 247a5f65bc5e..000000000000 --- a/examples/model_interpretation/task/senti/rnn/model.py +++ /dev/null @@ -1,265 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import numpy as np -import paddle -import paddle.nn as nn -import paddle.nn.functional as F - -INF = 1.0 * 1e12 - - -class LSTMModel(nn.Layer): - def __init__( - self, - vocab_size, - num_classes, - emb_dim=128, - padding_idx=0, - lstm_hidden_size=198, - direction="forward", - lstm_layers=1, - dropout_rate=0.0, - pooling_type=None, - fc_hidden_size=96, - ): - super().__init__() - - self.direction = direction - - self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) - - # self.lstm_encoder = nlp.seq2vec.LSTMEncoder(emb_dim, - # lstm_hidden_size, - # num_layers=lstm_layers, - # direction=direction, - # dropout=dropout_rate, - # pooling_type=pooling_type) - - self.lstm_layer = nn.LSTM( - input_size=emb_dim, - hidden_size=lstm_hidden_size, - num_layers=lstm_layers, - direction=direction, - dropout=dropout_rate, - ) - - self.fc = nn.Linear(lstm_hidden_size * (2 if direction == "bidirect" else 1), fc_hidden_size) - self.output_layer = nn.Linear(fc_hidden_size, num_classes) - self.softmax = nn.Softmax(axis=1) - - def forward(self, text, seq_len): - # Shape: (batch_size, num_tokens, embedding_dim) - embedded_text = self.embedder(text) - # Shape: (batch_size, num_tokens, num_directions*lstm_hidden_size) - # num_directions = 2 if direction is 'bidirect' - # if not, num_directions = 1 - - # text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len) - - encoded_text, (last_hidden, last_cell) = self.lstm_layer(embedded_text, sequence_length=seq_len) - if self.direction == "bidirect": - text_repr = paddle.concat((last_hidden[-2, :, :], last_hidden[-1, :, :]), axis=1) - else: - text_repr = last_hidden[-1, :, :] - - fc_out = paddle.tanh(self.fc(text_repr)) # Shape: (batch_size, fc_hidden_size) - logits = self.output_layer(fc_out) # Shape: (batch_size, num_classes) - return logits - - def forward_interpet(self, text, seq_len): - embedded_text = self.embedder(text) # Shape: (batch_size, num_tokens, embedding_dim) - - # text_repr = self.lstm_encoder(embedded_text, sequence_length=seq_len) # Shape: (batch_size, num_tokens, num_directions * hidden) - - # encoded_text: tensor[batch, seq_len, num_directions * hidden] - # last_hidden: tensor[2, batch, hiddens] - encoded_text, (last_hidden, last_cell) = self.lstm_layer(embedded_text, sequence_length=seq_len) - if self.direction == "bidirect": - text_repr = paddle.concat( - (last_hidden[-2, :, :], last_hidden[-1, :, :]), axis=1 - ) # text_repr: tensor[batch, 2 * hidden] 双向 - else: - text_repr = last_hidden[-1, :, :] # text_repr: tensor[1, hidden_size] 单向 - - fc_out = paddle.tanh(self.fc(text_repr)) # Shape: (batch_size, fc_hidden_size) - logits = self.output_layer(fc_out) # Shape: (batch_size, num_classes) - probs = self.softmax(logits) - - return probs, text_repr, embedded_text - - -class BiLSTMAttentionModel(nn.Layer): - def __init__( - self, - attention_layer, - vocab_size, - num_classes, - emb_dim=128, - lstm_hidden_size=196, - fc_hidden_size=96, - lstm_layers=1, - dropout_rate=0.0, - padding_idx=0, - ): - super().__init__() - self.padding_idx = padding_idx - - self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) - self.bilstm = nn.LSTM( - input_size=emb_dim, - hidden_size=lstm_hidden_size, - num_layers=lstm_layers, - dropout=dropout_rate, - direction="bidirect", - ) - self.attention = attention_layer - if isinstance(attention_layer, SelfAttention): - self.fc = nn.Linear(lstm_hidden_size, fc_hidden_size) - elif isinstance(attention_layer, SelfInteractiveAttention): - self.fc = nn.Linear(lstm_hidden_size * 2, fc_hidden_size) - else: - raise RuntimeError("Unknown attention type %s." % attention_layer.__class__.__name__) - self.output_layer = nn.Linear(fc_hidden_size, num_classes) - self.softmax = nn.Softmax(axis=1) - - def forward(self, text, seq_len): - mask = text != self.padding_idx - embedded_text = self.embedder(text) - # Encode text, shape: (batch, max_seq_len, num_directions * hidden_size) - encoded_text, (last_hidden, last_cell) = self.bilstm(embedded_text, sequence_length=seq_len) - # Shape: (batch_size, lstm_hidden_size) - hidden, att_weights = self.attention(encoded_text, mask) # Shape: (batch_size, fc_hidden_size) - fc_out = paddle.tanh(self.fc(hidden)) # Shape: (batch_size, num_classes) - logits = self.output_layer(fc_out) - return logits - - def forward_interpet(self, text, seq_len, noise=None, i=None, n_samples=None): - mask = text != self.padding_idx - - baseline_text = paddle.to_tensor( - [[0] * text.shape[1]], dtype=text.dtype, place=text.place, stop_gradient=text.stop_gradient - ) - - embedded_text = self.embedder(text) - baseline_embedded = self.embedder(baseline_text) - - if noise is not None: - if noise.upper() == "GAUSSIAN": - stdev_spread = 0.15 - stdev = stdev_spread * (embedded_text.max() - embedded_text.min()).numpy() - noise = paddle.to_tensor( - np.random.normal(0, stdev, embedded_text.shape).astype(np.float32), stop_gradient=False - ) - embedded_text = embedded_text + noise - - elif noise.upper() == "INTEGRATED": - embedded_text = baseline_embedded + (i / (n_samples - 1)) * (embedded_text - baseline_embedded) - - else: - raise ValueError("unsupported noise method: %s" % (noise)) - - # Encode text, shape: (batch, max_seq_len, num_directions * hidden_size) - encoded_text, (last_hidden, last_cell) = self.bilstm(embedded_text, sequence_length=seq_len) - # Shape: (batch_size, lstm_hidden_size) - hidden, att_weights = self.attention(encoded_text, mask) # Shape: (batch_size, fc_hidden_size) - fc_out = paddle.tanh(self.fc(hidden)) # Shape: (batch_size, num_classes) - logits = self.output_layer(fc_out) - probs = self.softmax(logits) - return probs, att_weights.squeeze(axis=-1), embedded_text - - -class SelfAttention(nn.Layer): - """ - A close implementation of attention network of ACL 2016 paper, - Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification (Zhou et al., 2016). - ref: https://www.aclweb.org/anthology/P16-2034/ - Args: - hidden_size (int): The number of expected features in the input x. - """ - - def __init__(self, hidden_size): - super().__init__() - self.hidden_size = hidden_size - self.att_weight = self.create_parameter(shape=[1, hidden_size, 1], dtype="float32") - - def forward(self, input, mask=None): - """ - Args: - input (paddle.Tensor) of shape (batch, seq_len, input_size): Tensor containing the features of the input sequence. - mask (paddle.Tensor) of shape (batch, seq_len) : - Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not. - Defaults to `None`. - """ - forward_input, backward_input = paddle.chunk(input, chunks=2, axis=2) - # elementwise-sum forward_x and backward_x - # Shape: (batch_size, max_seq_len, hidden_size) - h = paddle.add_n([forward_input, backward_input]) - # Shape: (batch_size, hidden_size, 1) - att_weight = self.att_weight.tile(repeat_times=(h.shape[0], 1, 1)) - # Shape: (batch_size, max_seq_len, 1) - att_score = paddle.bmm(paddle.tanh(h), att_weight) - if mask is not None: - # mask, remove the effect of 'PAD' - mask = paddle.cast(mask, dtype="float32") - mask = mask.unsqueeze(axis=-1) - inf_tensor = paddle.full(shape=mask.shape, dtype="float32", fill_value=-INF) - att_score = paddle.multiply(att_score, mask) + paddle.multiply(inf_tensor, (1 - mask)) - # Shape: (batch_size, max_seq_len, 1) - att_weight = F.softmax(att_score, axis=1) - # Shape: (batch_size, lstm_hidden_size) - reps = paddle.bmm(h.transpose(perm=(0, 2, 1)), att_weight).squeeze(axis=-1) - reps = paddle.tanh(reps) - return reps, att_weight - - -class SelfInteractiveAttention(nn.Layer): - """ - A close implementation of attention network of NAACL 2016 paper, Hierarchical Attention Networks for Document Classification (Yang et al., 2016). - ref: https://www.cs.cmu.edu/~./hovy/papers/16HLT-hierarchical-attention-networks.pdf - Args: - hidden_size (int): The number of expected features in the input x. - """ - - def __init__(self, hidden_size): - super().__init__() - self.input_weight = self.create_parameter(shape=[1, hidden_size, hidden_size], dtype="float32") - self.bias = self.create_parameter(shape=[1, 1, hidden_size], dtype="float32") - self.att_context_vector = self.create_parameter(shape=[1, hidden_size, 1], dtype="float32") - - def forward(self, input, mask=None): - """ - Args: - input (paddle.Tensor) of shape (batch, seq_len, hidden_size): Tensor containing the features of the input sequence. - mask (paddle.Tensor) of shape (batch, seq_len) : - Tensor is a bool tensor, whose each element identifies whether the input word id is pad token or not. - Defaults to `None - """ - weight = self.input_weight.tile(repeat_times=(input.shape[0], 1, 1)) # tensor[batch, hidden_size, hidden_size] - bias = self.bias.tile(repeat_times=(input.shape[0], 1, 1)) # tensor[batch, 1, hidden_size] - word_squish = paddle.bmm(input, weight) + bias # Shape: (batch_size, seq_len, hidden_size) - att_context_vector = self.att_context_vector.tile( - repeat_times=(input.shape[0], 1, 1) - ) # Shape: (batch_size, hidden_size, 1) - att_score = paddle.bmm(word_squish, att_context_vector) # tensor[batch_size, seq_len, 1] - if mask is not None: - # mask, remove the effect of 'PAD' - mask = paddle.cast(mask, dtype="float32") - mask = mask.unsqueeze(axis=-1) - inf_tensor = paddle.full(shape=mask.shape, dtype="float32", fill_value=-INF) - att_score = paddle.multiply(att_score, mask) + paddle.multiply(inf_tensor, (1 - mask)) - att_weight = F.softmax(att_score, axis=1) # tensor[batch_size, seq_len, 1] - - reps = paddle.bmm(input.transpose(perm=(0, 2, 1)), att_weight).squeeze(-1) # Shape: (batch_size, hidden_size) - return reps, att_weight diff --git a/examples/model_interpretation/task/senti/rnn/tokenizer_config.json b/examples/model_interpretation/task/senti/rnn/tokenizer_config.json deleted file mode 100644 index 1b15a3460241..000000000000 --- a/examples/model_interpretation/task/senti/rnn/tokenizer_config.json +++ /dev/null @@ -1 +0,0 @@ -{"model":"LSTM"} \ No newline at end of file diff --git a/examples/model_interpretation/task/senti/rnn/train.py b/examples/model_interpretation/task/senti/rnn/train.py deleted file mode 100644 index 570334a5d94e..000000000000 --- a/examples/model_interpretation/task/senti/rnn/train.py +++ /dev/null @@ -1,142 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License" -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import os -import random -from functools import partial - -import numpy as np -import paddle -from model import BiLSTMAttentionModel, SelfInteractiveAttention -from utils import CharTokenizer, convert_example - -from paddlenlp.data import Pad, Stack, Tuple, Vocab -from paddlenlp.datasets import load_dataset - -parser = argparse.ArgumentParser(__doc__) -parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.") -parser.add_argument( - "--device", - choices=["cpu", "gpu", "xpu"], - default="gpu", - help="Select which device to train model, defaults to gpu.", -) -parser.add_argument("--lr", type=float, default=5e-5, help="Learning rate used to train.") -parser.add_argument("--save_dir", type=str, default="checkpoints/", help="Directory to save model checkpoint") -parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") -parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") -parser.add_argument("--vocab_path", type=str, default=None) -parser.add_argument("--language", choices=["ch", "en"], default=None, help="Language that the model is built for") -args = parser.parse_args() - - -def set_seed(seed=1000): - """sets random seed""" - random.seed(seed) - np.random.seed(seed) - paddle.seed(seed) - - -def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None): - """ - Creats dataloader. - - Args: - dataset(obj:`paddle.io.Dataset`): Dataset instance. - trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. - mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. - batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. - batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging - the sample list, None for only stack each fields of sample in axis - 0(same as :attr::`np.stack(..., axis=0)`). - - Returns: - dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. - """ - if trans_fn: - dataset = dataset.map(trans_fn) - - shuffle = True if mode == "train" else False - if mode == "train": - sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) - else: - sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) - dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn) - return dataloader - - -if __name__ == "__main__": - paddle.set_device(args.device) - set_seed() - - if args.language == "ch": - train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"]) - else: - train_ds, dev_ds = load_dataset("glue", "sst-2", splits=["train", "dev"]) - - # Loads vocab. - if not os.path.exists(args.vocab_path): - raise RuntimeError("The vocab_path can not be found in the path %s" % args.vocab_path) - vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") - - tokenizer = CharTokenizer(vocab, args.language, "../../../punctuations") - - # Constructs the newtork. - vocab_size = len(vocab) - num_classes = len(train_ds.label_list) - pad_token_id = 0 - pad_value = vocab.token_to_idx.get("[PAD]", 0) - - lstm_hidden_size = 196 - attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size) - model = BiLSTMAttentionModel( - attention_layer=attention, - vocab_size=vocab_size, - lstm_hidden_size=lstm_hidden_size, - num_classes=num_classes, - padding_idx=pad_token_id, - ) - - model = paddle.Model(model) - - # Reads data and generates mini-batches. - trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False, language=args.language) - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=pad_value), Stack(dtype="int64"), Stack(dtype="int64") # input_ids # seq len # label - ): [data for data in fn(samples)] - - train_loader = create_dataloader( - train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn - ) - dev_loader = create_dataloader( - dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn - ) - - optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr) - - # Defines loss and metric. - criterion = paddle.nn.CrossEntropyLoss() - metric = paddle.metric.Accuracy() - - model.prepare(optimizer, criterion, metric) - - # Loads pre-trained parameters. - if args.init_from_ckpt: - model.load(args.init_from_ckpt) - print("Loaded checkpoint from %s" % args.init_from_ckpt) - - # Starts training and evaluating. - callback = paddle.callbacks.ProgBarLogger(log_freq=10, verbose=3) - model.fit(train_loader, dev_loader, epochs=args.epochs, save_dir=args.save_dir, callbacks=callback) diff --git a/examples/model_interpretation/task/senti/rnn/utils.py b/examples/model_interpretation/task/senti/rnn/utils.py deleted file mode 100644 index 4d574423d48a..000000000000 --- a/examples/model_interpretation/task/senti/rnn/utils.py +++ /dev/null @@ -1,166 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License" -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import numpy as np - - -def convert_example(example, tokenizer, is_test=False, language="en"): - """ - Builds model inputs from a sequence for sequence classification tasks. - It use `jieba.cut` to tokenize text. - - Args: - example(obj:`list[str]`): List of input data, containing text and label if it have label. - tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. - is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. - - Returns: - input_ids(obj:`list[int]`): The list of token ids. - valid_length(obj:`int`): The input sequence valid length. - label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. - """ - if is_test: - input_ids = tokenizer.encode(example["context"]) - valid_length = np.array(len(input_ids), dtype="int64") - input_ids = np.array(input_ids, dtype="int64") - return input_ids, valid_length - else: - if language == "en": - input_ids = tokenizer.encode(example["sentence"]) - label = np.array(example["labels"], dtype="int64") - else: - input_ids = tokenizer.encode(example["text"]) - label = np.array(example["label"], dtype="int64") - valid_length = np.array(len(input_ids), dtype="int64") - input_ids = np.array(input_ids, dtype="int64") - return input_ids, valid_length, label - - -def preprocess_prediction_data(data, tokenizer): - """ - It process the prediction data as the format used as training. - - Args: - data (obj:`List[str]`): The prediction data whose each element is a tokenized text. - tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. - - Returns: - examples (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. - A Example object contains `text`(word_ids) and `seq_len`(sequence length). - - """ - examples = [] - for text in data: - # ids = tokenizer.encode(text) # JiebaTokenizer - ids = tokenizer.encode(text)[0].tolist()[1:-1] # ErnieTokenizer list[ids] - examples.append([ids, len(ids)]) - - return examples - - -def get_idx_from_word(word, word_to_idx, unk_word): - if word in word_to_idx: - return word_to_idx[word] - return word_to_idx[unk_word] - - -class CharTokenizer: - def __init__(self, vocab, language, vocab_path): - self.tokenizer = list - self.vocab = vocab - self.language = language - self.vocab_path = vocab_path - self.unk_token = [] - - def encode(self, sentence): - if self.language == "ch": - words = tokenizer_punc(sentence, self.vocab_path) - else: - words = sentence.strip().split() - return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in words] - - def tokenize(self, sentence, wo_unk=True): - if self.language == "ch": - return tokenizer_punc(sentence, self.vocab_path) - else: - return sentence.strip().split() - - def convert_tokens_to_string(self, tokens): - return " ".join(tokens) - - def convert_tokens_to_ids(self, tokens): - return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in tokens] - - -def tokenizer_lac(string, lac): - temp = "" - res = [] - for c in string: - if "\u4e00" <= c <= "\u9fff": - if temp != "": - res.extend(lac.run(temp)) - temp = "" - res.append(c) - else: - temp += c - if temp != "": - res.extend(lac.run(temp)) - return res - - -def tokenizer_punc(string, vocab_path): - res = [] - sub_string_list = string.strip().split("[MASK]") - for idx, sub_string in enumerate(sub_string_list): - temp = "" - for c in sub_string: - if "\u4e00" <= c <= "\u9fff": - if temp != "": - temp_seg = punc_split(temp, vocab_path) - res.extend(temp_seg) - temp = "" - res.append(c) - else: - temp += c - if temp != "": - temp_seg = punc_split(temp, vocab_path) - res.extend(temp_seg) - if idx < len(sub_string_list) - 1: - res.append("[MASK]") - return res - - -def punc_split(string, vocab_path): - punc_set = set() - with open(vocab_path, "r") as f: - for token in f: - punc_set.add(token.strip()) - punc_set.add(" ") - for ascii_num in range(65296, 65306): - punc_set.add(chr(ascii_num)) - for ascii_num in range(48, 58): - punc_set.add(chr(ascii_num)) - - res = [] - temp = "" - for c in string: - if c in punc_set: - if temp != "": - res.append(temp) - temp = "" - res.append(c) - else: - temp += c - if temp != "": - res.append(temp) - return res diff --git a/examples/model_interpretation/task/senti/roberta/modeling.py b/examples/model_interpretation/task/senti/roberta/modeling.py deleted file mode 100644 index 02f2bec87d85..000000000000 --- a/examples/model_interpretation/task/senti/roberta/modeling.py +++ /dev/null @@ -1,608 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import sys - -import paddle -import paddle.nn as nn - -from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model - -sys.path.append("../..") -from task.transformer import TransformerEncoder, TransformerEncoderLayer # noqa: E402 - -sys.path.remove("../..") - -__all__ = [ - "RobertaModel", - "RobertaPretrainedModel", - "RobertaForSequenceClassification", - "RobertaForTokenClassification", - "RobertaForQuestionAnswering", -] - - -class RobertaEmbeddings(nn.Layer): - r""" - Include embeddings from word, position and token_type embeddings. - """ - - def __init__( - self, - vocab_size, - hidden_size=768, - hidden_dropout_prob=0.1, - max_position_embeddings=512, - type_vocab_size=16, - pad_token_id=0, - ): - super(RobertaEmbeddings, self).__init__() - self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id) - self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size) - self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size) - self.layer_norm = nn.LayerNorm(hidden_size) - self.dropout = nn.Dropout(hidden_dropout_prob) - - def forward(self, input_ids, token_type_ids=None, position_ids=None): - if position_ids is None: - # maybe need use shape op to unify static graph and dynamic graph - ones = paddle.ones_like(input_ids, dtype="int64") - seq_length = paddle.cumsum(ones, axis=-1) - position_ids = seq_length - ones - position_ids.stop_gradient = True - if token_type_ids is None: - token_type_ids = paddle.zeros_like(input_ids, dtype="int64") - - input_embedings = self.word_embeddings(input_ids) - position_embeddings = self.position_embeddings(position_ids) - token_type_embeddings = self.token_type_embeddings(token_type_ids) - - embeddings = input_embedings + position_embeddings + token_type_embeddings - embeddings = self.layer_norm(embeddings) - embeddings = self.dropout(embeddings) - return embeddings - - -class RobertaPooler(nn.Layer): - def __init__(self, hidden_size): - super(RobertaPooler, self).__init__() - self.dense = nn.Linear(hidden_size, hidden_size) - self.activation = nn.Tanh() - - def forward(self, hidden_states): - # We "pool" the model by simply taking the hidden state corresponding - # to the first token. - first_token_tensor = hidden_states[:, 0] - pooled_output = self.dense(first_token_tensor) - pooled_output = self.activation(pooled_output) - return pooled_output - - -class RobertaPretrainedModel(PretrainedModel): - r""" - An abstract class for pretrained RoBerta models. It provides RoBerta related - `model_config_file`, `pretrained_resource_files_map`, `resource_files_names`, - `pretrained_init_configuration`, `base_model_prefix` for downloading and - loading pretrained models. - Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details. - - """ - - model_config_file = "model_config.json" - pretrained_init_configuration = { - "roberta-wwm-ext": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 768, - "initializer_range": 0.02, - "intermediate_size": 3072, - "max_position_embeddings": 512, - "num_attention_heads": 12, - "num_hidden_layers": 12, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - "roberta-wwm-ext-large": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 1024, - "initializer_range": 0.02, - "intermediate_size": 4096, - "max_position_embeddings": 512, - "num_attention_heads": 16, - "num_hidden_layers": 24, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - "rbt3": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 768, - "initializer_range": 0.02, - "intermediate_size": 3072, - "max_position_embeddings": 512, - "num_attention_heads": 12, - "num_hidden_layers": 3, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - "rbtl3": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 1024, - "initializer_range": 0.02, - "intermediate_size": 4096, - "max_position_embeddings": 512, - "num_attention_heads": 16, - "num_hidden_layers": 3, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - } - resource_files_names = {"model_state": "model_state.pdparams"} - pretrained_resource_files_map = { - "model_state": { - "roberta-wwm-ext": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_base/roberta_chn_base.pdparams", - "roberta-wwm-ext-large": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams", - "rbt3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbt3/rbt3_chn_large.pdparams", - "rbtl3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbtl3/rbtl3_chn_large.pdparams", - } - } - base_model_prefix = "roberta" - - def _init_weights(self, layer): - """Initialization hook""" - if isinstance(layer, (nn.Linear, nn.Embedding)): - # only support dygraph, use truncated_normal and make it inplace - # and configurable later - layer.weight.set_value( - paddle.tensor.normal( - mean=0.0, - std=self.initializer_range - if hasattr(self, "initializer_range") - else self.roberta.config["initializer_range"], - shape=layer.weight.shape, - ) - ) - elif isinstance(layer, nn.LayerNorm): - layer._epsilon = 1e-12 - - -@register_base_model -class RobertaModel(RobertaPretrainedModel): - r""" - The bare Roberta Model outputting raw hidden-states. - - This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`. - Refer to the superclass documentation for the generic methods. - - This model is also a Paddle `paddle.nn.Layer `__ subclass. Use it as a regular Paddle Layer - and refer to the Paddle documentation for all matter related to general usage and behavior. - - Args: - vocab_size (int): - Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size of token embedding matrix. - Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RobertaModel`. - hidden_size (int, optional): - Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`. - num_hidden_layers (int, optional): - Number of hidden layers in the Transformer encoder. Defaults to `12`. - num_attention_heads (int, optional): - Number of attention heads for each attention layer in the Transformer encoder. - Defaults to `12`. - intermediate_size (int, optional): - Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors - to ff layers are firstly projected from `hidden_size` to `intermediate_size`, - and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`. - Defaults to `3072`. - hidden_act (str, optional): - The non-linear activation function in the feed-forward layer. - ``"gelu"``, ``"relu"`` and any other paddle supported activation functions - are supported. Defaults to ``"gelu"``. - hidden_dropout_prob (float, optional): - The dropout probability for all fully connected layers in the embeddings and encoder. - Defaults to `0.1`. - attention_probs_dropout_prob (float, optional): - The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target. - Defaults to `0.1`. - max_position_embeddings (int, optional): - The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input - sequence. Defaults to `512`. - type_vocab_size (int, optional): - The vocabulary size of the `token_type_ids` passed when calling `~transformers.RobertaModel`. - Defaults to `2`. - initializer_range (float, optional): - The standard deviation of the normal initializer. Defaults to 0.02. - - .. note:: - A normal_initializer initializes weight matrices as normal distributions. - See :meth:`RobertaPretrainedModel._init_weights()` for how weights are initialized in `RobertaModel`. - - pad_token_id(int, optional): - The index of padding token in the token vocabulary. - Defaults to `0`. - """ - - def __init__( - self, - vocab_size, - hidden_size=768, - num_hidden_layers=12, - num_attention_heads=12, - intermediate_size=3072, - hidden_act="gelu", - hidden_dropout_prob=0.1, - attention_probs_dropout_prob=0.1, - max_position_embeddings=512, - type_vocab_size=16, - initializer_range=0.02, - layer_norm_eps=1e-12, - pad_token_id=0, - ): - super(RobertaModel, self).__init__() - self.pad_token_id = pad_token_id - self.initializer_range = initializer_range - self.embeddings = RobertaEmbeddings( - vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings, type_vocab_size, pad_token_id - ) - encoder_layer = TransformerEncoderLayer( - hidden_size, - num_attention_heads, - intermediate_size, - dropout=hidden_dropout_prob, - activation=hidden_act, - attn_dropout=attention_probs_dropout_prob, - act_dropout=0, - ) - self.encoder = TransformerEncoder(encoder_layer, num_hidden_layers) - self.pooler = RobertaPooler(hidden_size) - - def forward( - self, - input_ids, - token_type_ids=None, - position_ids=None, - attention_mask=None, - noise=None, - i=None, - n_samples=None, - ): - r""" - Args: - input_ids (Tensor): - Indices of input sequence tokens in the vocabulary. They are - numerical representations of tokens that build the input sequence. - It's data type should be `int64` and has a shape of [batch_size, sequence_length]. - token_type_ids (Tensor, optional): - Segment token indices to indicate first and second portions of the inputs. - Indices can be either 0 or 1: - - - 0 corresponds to a **sentence A** token, - - 1 corresponds to a **sentence B** token. - - It's data type should be `int64` and has a shape of [batch_size, sequence_length]. - Defaults to None, which means no segment embeddings is added to token embeddings. - position_ids (Tensor, optional): - Indices of positions of each input sequence tokens in the position embeddings. - Selected in the range ``[0, max_position_embeddings - 1]``. - It's data type should be `int64` and has a shape of [batch_size, sequence_length]. - Defaults to `None`. - attention_mask (Tensor, optional): - Mask used in multi-head attention to avoid performing attention to some unwanted positions, - usually the paddings or the subsequent positions. - Its data type can be int, float and bool. - When the data type is bool, the `masked` tokens have `False` values and the others have `True` values. - When the data type is int, the `masked` tokens have `0` values and the others have `1` values. - When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values. - It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`. - For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], - [batch_size, num_attention_heads, sequence_length, sequence_length]. - Defaults to `None`, which means nothing needed to be prevented attention to. - - Returns: - tuple: Returns tuple (`sequence_output`, `pooled_output`). - - With the fields: - - - sequence_output (Tensor): - Sequence of hidden-states at the last layer of the model. - It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size]. - - - pooled_output (Tensor): - The output of first token (`[CLS]`) in sequence. - We "pool" the model by simply taking the hidden state corresponding to the first token. - Its data type should be float32 and its shape is [batch_size, hidden_size]. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaModel, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaModel.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - sequence_output, pooled_output = model(**inputs) - - """ - if attention_mask is None: - attention_mask = paddle.unsqueeze( - (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2] - ) - # CLS: 101; SEP: 102; PAD: 0 - baseline_ids = paddle.to_tensor( - [101] + [0] * (input_ids.shape[1] - 2) + [102], - dtype=input_ids.dtype, - place=input_ids.place, - stop_gradient=input_ids.stop_gradient, - ) - - embedding_output = self.embeddings( - input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids - ) - baseline_embedding_output = self.embeddings( - input_ids=baseline_ids, position_ids=position_ids, token_type_ids=token_type_ids - ) - - if noise is not None: - if noise.upper() == "GAUSSIAN": - pass - # stdev_spread = 0.15 - # stdev = stdev_spread * (orig_embedded.max() - orig_embedded.min()).numpy() - # noise = paddle.to_tensor(np.random.normal(0, stdev, orig_embedded.shape).astype(np.float32), - # stop_gradient=False) - # orig_embedded = orig_embedded + noise - if noise.upper() == "INTEGRATED": - embedding_output = baseline_embedding_output + i / (n_samples - 1) * ( - embedding_output - baseline_embedding_output - ) - else: - raise ValueError("unsupported noise method: %s" % (noise)) - - # encoder_outputs = self.encoder(embedding_output, attention_mask) - encoder_outputs, att_weights_list = self.encoder(embedding_output, attention_mask) # interpret - sequence_output = encoder_outputs - pooled_output = self.pooler(sequence_output) - return sequence_output, pooled_output, att_weights_list, embedding_output - - -class RobertaForQuestionAnswering(RobertaPretrainedModel): - r""" - Roberta Model with a linear layer on top of the hidden-states output to - compute `span_start_logits` and `span_end_logits`, designed for question-answering tasks like SQuAD. - - Args: - roberta (:class:`RobertaModel`): - An instance of RobertaModel. - dropout (float, optional): - The dropout probability for output of Roberta. - If None, use the same value as `hidden_dropout_prob` of `RobertaModel` - instance `roberta`. Defaults to `None`. - """ - - def __init__(self, roberta, dropout=None): - super(RobertaForQuestionAnswering, self).__init__() - self.roberta = roberta # allow roberta to be config - self.classifier = nn.Linear(self.roberta.config["hidden_size"], 2) - - def forward(self, input_ids, token_type_ids=None): - r""" - Args: - input_ids (Tensor): - See :class:`RobertaModel`. - token_type_ids (Tensor, optional): - See :class:`RobertaModel`. - position_ids (Tensor, optional): - See :class:`RobertaModel`. - attention_mask (Tensor, optional): - See :class:`RobertaModel`. - - Returns: - tuple: Returns tuple (`start_logits`, `end_logits`). - - With the fields: - - - `start_logits` (Tensor): - A tensor of the input token classification logits, indicates the start position of the labelled span. - Its data type should be float32 and its shape is [batch_size, sequence_length]. - - - `end_logits` (Tensor): - A tensor of the input token classification logits, indicates the end position of the labelled span. - Its data type should be float32 and its shape is [batch_size, sequence_length]. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - logits = model(**inputs) - - """ - sequence_output, _ = self.roberta( - input_ids, token_type_ids=token_type_ids, position_ids=None, attention_mask=None - ) - - logits = self.classifier(sequence_output) - logits = paddle.transpose(logits, perm=[2, 0, 1]) - start_logits, end_logits = paddle.unstack(x=logits, axis=0) - - return start_logits, end_logits - - -class RobertaForSequenceClassification(RobertaPretrainedModel): - r""" - Roberta Model with a linear layer on top of the output layer, - designed for sequence classification/regression tasks like GLUE tasks. - - Args: - roberta (:class:`RobertaModel`): - An instance of `RobertaModel`. - num_classes (int, optional): - The number of classes. Defaults to `2`. - dropout (float, optional): - The dropout probability for output of Roberta. - If None, use the same value as `hidden_dropout_prob` - of `RobertaModel` instance `roberta`. Defaults to `None`. - """ - - def __init__(self, roberta, num_classes=2, dropout=None): - super(RobertaForSequenceClassification, self).__init__() - self.num_classes = num_classes - self.roberta = roberta # allow roberta to be config - self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) - self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) - self.softmax = nn.Softmax() - - def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): - r""" - Args: - input_ids (Tensor): - See :class:`RobertaModel`. - token_type_ids (Tensor, optional): - See :class:`RobertaModel`. - position_ids (Tensor, optional): - See :class:`RobertaModel`. - attention_mask (Tensor, optional): - See :class:`RobertaModel`. - - Returns: - Tensor: Returns tensor `logits`, a tensor of the input text classification logits. - Its data type should be float32 and it has a shape of [batch_size, num_classes]. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - logits = model(**inputs) - - """ - _, pooled_output, _, _ = self.roberta( - input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask - ) - - pooled_output = self.dropout(pooled_output) - logits = self.classifier(pooled_output) - return logits - - def forward_interpet( - self, - input_ids, - token_type_ids=None, - position_ids=None, - attention_mask=None, - noise=None, - i=None, - n_samples=None, - ): - _, pooled_output, att_weights_list, embedding_output = self.roberta( - input_ids, - token_type_ids=token_type_ids, - position_ids=position_ids, - attention_mask=attention_mask, - noise=noise, - i=i, - n_samples=n_samples, - ) - - pooled_output = self.dropout(pooled_output) - logits = self.classifier(pooled_output) - probs = self.softmax(logits) - - return probs, att_weights_list, embedding_output - - -class RobertaForTokenClassification(RobertaPretrainedModel): - r""" - Roberta Model with a linear layer on top of the hidden-states output layer, - designed for token classification tasks like NER tasks. - - Args: - roberta (:class:`RobertaModel`): - An instance of `RobertaModel`. - num_classes (int, optional): - The number of classes. Defaults to `2`. - dropout (float, optional): - The dropout probability for output of Roberta. - If None, use the same value as `hidden_dropout_prob` - of `RobertaModel` instance `roberta`. Defaults to `None`. - """ - - def __init__(self, roberta, num_classes=2, dropout=None): - super(RobertaForTokenClassification, self).__init__() - self.num_classes = num_classes - self.roberta = roberta # allow roberta to be config - self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) - self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) - - def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): - r""" - Args: - input_ids (Tensor): - See :class:`RobertaModel`. - token_type_ids (Tensor, optional): - See :class:`RobertaModel`. - position_ids (Tensor, optional): - See :class:`RobertaModel`. - attention_mask (Tensor, optional): - See :class:`RobertaModel`. - - Returns: - Tensor: Returns tensor `logits`, a tensor of the input token classification logits. - Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaForTokenClassification.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - logits = model(**inputs) - - """ - sequence_output, _ = self.roberta( - input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask - ) - - sequence_output = self.dropout(sequence_output) - logits = self.classifier(sequence_output) - return logits diff --git a/examples/model_interpretation/task/senti/run_inter.sh b/examples/model_interpretation/task/senti/run_inter.sh deleted file mode 100755 index c7b71e78d212..000000000000 --- a/examples/model_interpretation/task/senti/run_inter.sh +++ /dev/null @@ -1,65 +0,0 @@ -### - # This file contains script to generate saliency map of a specific baseline model and language on given input data - # The result of this script will be used to evaluate the interpretive performance of the baseline model -### - -export CUDA_VISIBLE_DEVICES=4 -export PYTHONPATH=./:$PYTHONPATH - -LANGUAGE=en # LANGUAGE choose in [ch, en] -BASE_MODEL=roberta_base # BASE_MODEL choose in [roberta_base, roberta_large, lstm] -INTER_MODE=attention # INTER_MODE choice in [attention, integrated_gradient, lime] -TASK=senti_${LANGUAGE} -DATA=../../data/${TASK} -START_ID=0 -FROM_PRETRAIN='test' -VOCAB_PATH='test' - -if [[ $LANGUAGE == "en" ]]; then - - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN='roberta-base' - CKPT=pretrained_models/saved_model_en/roberta_base_20211105_135732/model_10000/model_state.pdparams - #CKPT=pretrained_models/saved_model_en/roberta_base_20211206_164443/model_10000/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN='roberta-large' - CKPT=pretrained_models/saved_model_en/roberta_large_20211105_160323/model_4000/model_state.pdparams - #CKPT=pretrained_models/saved_model_en/roberta_large_20211207_174631/model_4000/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - VOCAB_PATH='rnn/vocab.sst2_train' - CKPT=rnn/checkpoints_en/final.pdparams - fi - -elif [[ $LANGUAGE == "ch" ]]; then - - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN='roberta-wwm-ext' - CKPT=pretrained_models/saved_model_ch/roberta_base/model_900/model_state.pdparams - #CKPT=pretrained_models/saved_model_ch/roberta_base_20211229_101252/model_900/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN='roberta-wwm-ext-large' - CKPT=pretrained_models/saved_model_ch/roberta_large_20211014_192021/model_900/model_state.pdparams - #CKPT=pretrained_models/saved_model_ch/roberta_large_20211229_105019/model_900/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - VOCAB_PATH='rnn/vocab.txt' - CKPT=rnn/checkpoints_ch/final.pdparams - fi -fi - -OUTPUT=./output/${TASK}.${BASE_MODEL} -[ -d $OUTPUT ] || mkdir -p $OUTPUT -set -x - -python3 ./saliency_map/sentiment_interpretable.py \ - --language $LANGUAGE \ - --base_model $BASE_MODEL \ - --data_dir $DATA \ - --vocab_path $VOCAB_PATH \ - --from_pretrained $FROM_PRETRAIN \ - --batch_size 1 \ - --init_checkpoint $CKPT \ - --inter_mode $INTER_MODE\ - --output_dir $OUTPUT \ - --n-samples 200 \ - --start_id $START_ID \ - --eval $@ diff --git a/examples/model_interpretation/task/senti/run_inter_all.sh b/examples/model_interpretation/task/senti/run_inter_all.sh deleted file mode 100755 index 8b0a1d98bf01..000000000000 --- a/examples/model_interpretation/task/senti/run_inter_all.sh +++ /dev/null @@ -1,75 +0,0 @@ -### - # This file contains script to generate saliency map of all baseline models and languages on given input data - # The result of this script will be used to evaluate the interpretive performance of the baseline model -### - -export CUDA_VISIBLE_DEVICES=1 -export PYTHONPATH=./:$PYTHONPATH -START_ID=0 -FROM_PRETRAIN='test' -VOCAB_PATH='test' - -for BASE_MODEL in "lstm" "roberta_base" "roberta_large"; -do - for INTER_MODE in "attention" "integrated_gradient" "lime"; - do - for LANGUAGE in "ch" "en"; - do - TASK=senti_${LANGUAGE} - DATA=../../data/${TASK} - - if [[ $LANGUAGE == "en" ]]; then - - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN='roberta-base' - CKPT=pretrained_models/saved_model_en/roberta_base_20211105_135732/model_10000/model_state.pdparams - #CKPT=pretrained_models/saved_model_en/roberta_base_20211206_164443/model_10000/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN='roberta-large' - CKPT=pretrained_models/saved_model_en/roberta_large_20211105_160323/model_4000/model_state.pdparams - #CKPT=pretrained_models/saved_model_en/roberta_large_20211207_174631/model_4000/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - VOCAB_PATH='rnn/vocab.sst2_train' - CKPT=rnn/checkpoints_en/final.pdparams - #CKPT=rnn/checkpoints_en/final.pdparams - fi - - elif [[ $LANGUAGE == "ch" ]]; then - - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN='roberta-wwm-ext' - CKPT=pretrained_models/saved_model_ch/roberta_base/model_900/model_state.pdparams - #CKPT=pretrained_models/saved_model_ch/roberta_base_20211229_101252/model_900/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN='roberta-wwm-ext-large' - CKPT=pretrained_models/saved_model_ch/roberta_large_20211014_192021/model_900/model_state.pdparams - #CKPT=pretrained_models/saved_model_ch/roberta_large_20211229_105019/model_900/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - VOCAB_PATH='rnn/vocab.txt' - CKPT=rnn/checkpoints_ch/final.pdparams - #CKPT=rnn/checkpoints_ch/final.pdparams - fi - fi - - OUTPUT=./output/${TASK}.${BASE_MODEL} - [ -d $OUTPUT ] || mkdir -p $OUTPUT - set -x - - if [[ ! -f ${OUTPUT}/interpret.${INTER_MODE} ]]; then - python3 ./saliency_map/sentiment_interpretable.py \ - --language $LANGUAGE \ - --base_model $BASE_MODEL \ - --data_dir $DATA \ - --vocab_path $VOCAB_PATH \ - --from_pretrained $FROM_PRETRAIN \ - --batch_size 1 \ - --init_checkpoint $CKPT \ - --inter_mode $INTER_MODE\ - --output_dir $OUTPUT \ - --n-samples 200 \ - --start_id $START_ID \ - --eval $@ - fi - done - done -done \ No newline at end of file diff --git a/examples/model_interpretation/task/senti/saliency_map/sentiment_interpretable.py b/examples/model_interpretation/task/senti/saliency_map/sentiment_interpretable.py deleted file mode 100644 index 61afefc70ec5..000000000000 --- a/examples/model_interpretation/task/senti/saliency_map/sentiment_interpretable.py +++ /dev/null @@ -1,502 +0,0 @@ -# !/usr/bin/env python3 -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import collections -import json -import logging -import os -import sys -from functools import partial -from pathlib import Path - -import numpy as np -import paddle -from LIME.lime_text import LimeTextExplainer -from rnn.model import BiLSTMAttentionModel, SelfInteractiveAttention -from rnn.utils import CharTokenizer, convert_example -from roberta.modeling import RobertaForSequenceClassification -from tqdm import tqdm - -from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab -from paddlenlp.datasets import DatasetBuilder -from paddlenlp.transformers.roberta.tokenizer import ( - RobertaBPETokenizer, - RobertaTokenizer, -) - -sys.path.append("../../..") -from model_interpretation.utils import ( # noqa: E402 - convert_tokenizer_res_to_old_version, - match, -) - -sys.path.remove("../../..") - -log = logging.getLogger(__name__) -log.setLevel(logging.DEBUG) -logging.getLogger().setLevel(logging.DEBUG) - - -def get_args(): - parser = argparse.ArgumentParser("interpret sentiment analysis task") - parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"]) - parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") - parser.add_argument( - "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" - ) - parser.add_argument("--batch_size", type=int, default=1, help="batchsize") - parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") - parser.add_argument("--eval", action="store_true") - parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") - parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") - parser.add_argument( - "--use_amp", - action="store_true", - help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", - ) - parser.add_argument( - "--inter_mode", - type=str, - default="attention", - choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"], - help="appoint the mode of interpretable.", - ) - parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") - parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") - parser.add_argument("--start_id", type=int, default=0) - parser.add_argument("--vocab_path", type=str, required=True) - parser.add_argument("--language", type=str, required=True, help="language that the model is built for") - args = parser.parse_args() - return args - - -class Senti_data(DatasetBuilder): - def _read(self, filename): - with open(filename, "r", encoding="utf8") as f: - for line in f.readlines(): - line_split = json.loads(line) - yield { - "id": line_split["id"], - "context": line_split["context"], - "sent_token": line_split["sent_token"], - } - - -def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None): - """ - Creats dataloader. - - Args: - dataset(obj:`paddle.io.Dataset`): Dataset instance. - trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. - mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. - batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. - batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging - the sample list, None for only stack each fields of sample in axis - 0(same as :attr::`np.stack(..., axis=0)`). - - Returns: - dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. - """ - if trans_fn: - dataset = dataset.map(trans_fn) - - shuffle = True if mode == "train" else False - if mode == "train": - sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) - else: - sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) - dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn) - return dataloader - - -def map_fn_senti(examples, tokenizer, args): - log.debug("load data %d" % len(examples)) - if args.language == "en": - contexts = [example["context"].encode("ascii", errors="replace").decode("UTF-8") for example in examples] - else: - contexts = [example["context"] for example in examples] - tokenized_examples = tokenizer(contexts, max_seq_len=args.max_seq_len) - tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) - for i in range(len(tokenized_examples)): - tokenized_examples[i]["offset_mapping"] = ( - [(0, 0)] + tokenizer.get_offset_mapping(contexts[i])[: args.max_seq_len - 2] + [(0, 0)] - ) - return tokenized_examples - - -def init_lstm_var(args): - vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") - tokenizer = CharTokenizer(vocab, args.language, "../../punctuations") - padding_idx = vocab.token_to_idx.get("[PAD]", 0) - - trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=True, language=args.language) - - # Init attention layer - lstm_hidden_size = 196 - attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size) - model = BiLSTMAttentionModel( - attention_layer=attention, - vocab_size=len(tokenizer.vocab), - lstm_hidden_size=lstm_hidden_size, - num_classes=2, - padding_idx=padding_idx, - ) - - # Reads data and generates mini-batches. - dev_ds = Senti_data().read(args.data_dir) - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=padding_idx), # input_ids - Stack(dtype="int64"), # seq len - ): [data for data in fn(samples)] - - dev_loader = create_dataloader( - dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn - ) - - return model, tokenizer, dev_loader - - -def init_roberta_var(args): - tokenizer = None - if args.language == "ch": - tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) - else: - tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) - model = RobertaForSequenceClassification.from_pretrained( - args.from_pretrained, - hidden_dropout_prob=0, - attention_probs_dropout_prob=0, - dropout=0, - num_labels=2, - name="", - return_inter_score=True, - ) - - map_fn = partial(map_fn_senti, tokenizer=tokenizer, args=args) - - dev_ds = Senti_data().read(args.data_dir) - dev_ds.map(map_fn, batched=True) - dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) - batchify_fn = lambda samples, fn=Dict( - { - "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), - "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), - "offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id), - } - ): fn(samples) - - dataloader = paddle.io.DataLoader( - dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True - ) - - return model, tokenizer, dataloader - - -def extract_attention_scores(args, atts, input_ids, tokens, sub_word_id_dict, result, offset, out_handle): - if args.base_model.startswith("roberta"): - inter_score = atts[-1][:, :, 0, :].mean(1) # (bsz, seq) - inter_score = inter_score[0][1:-1] # remove CLS and SEP - input_ids = input_ids[0][1:-1] - - elif args.base_model == "lstm": - inter_score = atts[0] - input_ids = input_ids[0] - - length = (inter_score > 0).cast("int32").sum(-1).tolist()[0] - assert len(tokens) == length, f"%s: {len(tokens)} != {length}" % (step + 1) - - char_attribution_dict = {} - # Collect scores in different situation - if args.base_model.startswith("roberta"): - assert len(inter_score) == len(offset), str(len(inter_score)) + "not equal to" + str(len(offset)) - sorted_token = [] - for i in range(len(inter_score)): - sorted_token.append([i, offset[i], inter_score[i]]) - - char_attribution_dict = match(result["context"], result["sent_token"], sorted_token) - - result["char_attri"] = collections.OrderedDict() - for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True): - result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] - result.pop("sent_token") - else: - if args.language == "ch": - idx = 0 - for token, score in zip(tokens, inter_score.numpy().tolist()): - char_attribution_dict[idx] = (token, score) - idx += 1 - else: - idx = 0 - for word, sub_word_score in zip(tokens, inter_score.tolist()): - char_attribution_dict[idx] = (word, sub_word_score) - idx += 1 - - result["char_attri"] = collections.OrderedDict() - for token_id, token_info in sorted(char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): - result["char_attri"][token_id] = token_info - - out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") - - -def extract_integrated_gradient_scores( - args, - atts, - input_ids, - tokens, - sub_word_id_dict, - fwd_args, - fwd_kwargs, - model, - result, - pred_label, - err_total, - offset, - out_handle, -): - embedded_grads_list = [] - for i in range(args.n_samples): - probs, _, embedded = model.forward_interpet( - *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples - ) - predicted_class_prob = probs[0][pred_label] - predicted_class_prob.backward(retain_graph=False) - embedded_grad = embedded.grad - model.clear_gradients() - embedded_grads_list.append(embedded_grad) - - if i == 0: - baseline_pred_confidence = probs.tolist()[0][pred_label] # scalar - baseline_embedded = embedded # Tensor(1, seq_len, embed_size) - elif i == args.n_samples - 1: - pred_confidence = probs.tolist()[0][pred_label] # scalar - pred_embedded = embedded # Tensor(1, seq_len, embed_size) - - embedded_grads_tensor = paddle.to_tensor( - embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True - ) - - trapezoidal_grads = (embedded_grads_tensor[1:] + embedded_grads_tensor[:-1]) / 2 - integral_grads = trapezoidal_grads.sum(0) / trapezoidal_grads.shape[0] # Tensor(1, seq_len, embed_size) - - inter_score = (pred_embedded - baseline_embedded) * integral_grads # Tensor(1, seq_len, embed_size) - inter_score = inter_score.sum(-1) # Tensor(1, seq_len) - - # eval err - delta_pred_confidence = pred_confidence - baseline_pred_confidence - sum_gradient = inter_score.sum().tolist()[0] - err = (delta_pred_confidence - sum_gradient + 1e-12) / (delta_pred_confidence + 1e-12) - err_total.append(np.abs(err)) - - print_str = "%s\t%d\t%.3f\t%.3f\t%.3f\t%.3f" - print_vals = (result["id"], args.n_samples, delta_pred_confidence, sum_gradient, err, np.average(err_total)) - log.debug(print_str % print_vals) - - inter_score.stop_gradient = True - - char_attribution_dict = {} - if args.base_model.startswith("roberta"): - inter_score = inter_score[0][1:-1] - sorted_token = [] - for i in range(len(inter_score)): - sorted_token.append([i, offset[i], inter_score[i]]) - char_attribution_dict = match(result["context"], result["sent_token"], sorted_token) - - result["char_attri"] = collections.OrderedDict() - for token_info in sorted(char_attribution_dict, key=lambda x: x[2], reverse=True): - result["char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] - result.pop("sent_token") - - elif args.base_model == "lstm": - inter_score = inter_score[0] - idx = 0 - for word, sub_word_score in zip(tokens, inter_score.tolist()): - char_attribution_dict[idx] = (word, sub_word_score) - idx += 1 - - result["char_attri"] = collections.OrderedDict() - for token_id, token_info in sorted(char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): - result["char_attri"][token_id] = token_info - - out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") - return err_total - - -def extract_LIME_scores( - args, - tokenizer, - tokens, - pred_label, - model, - probs, - result, - lime_err_total, - lime_score_total, - lime_relative_err_total, - out_handle, -): - explainer = LimeTextExplainer(class_names=["neg", "pos"], verbose=False, language=args.language) - - if_lstm = args.base_model == "lstm" - explain_res = None - - text_instance = result["context"] - - explain_res = explainer.explain_instance( - text_instance=text_instance, - tokenizer=tokenizer, - pred_label=pred_label, - classifier_fn=model.forward_interpet, - num_samples=5000, - if_lstm=if_lstm, - ) - - exp, indexed_string, relative_err, err = explain_res - - score = exp.score[pred_label] - local_exps = exp.local_exp - ridge_pred = exp.local_pred[pred_label] - model_pred = probs.numpy().tolist()[0][pred_label] - - lime_score_total.append(score) - lime_relative_err_total.append(relative_err) - lime_err_total.append(err) - log.debug("score: %.2f" % score) - log.debug("relative_err: %.2f" % relative_err) - log.debug("err: %.2f" % err) - log.debug("ridge_pred: %.2f\tpred: %.2f\tdelta: %.2f" % (ridge_pred, model_pred, ridge_pred - model_pred)) - - for kind, local_exp in local_exps.items(): # only have one iteration here - char_attribution_dict = [] - - for idx in range(len(result["sent_token"])): - t = result["sent_token"][idx] # .replace('Ġ', '') - got_score = False - for word_id, attribution in local_exp: - if indexed_string.inverse_vocab[word_id] == t: - char_attribution_dict.append((idx, t, attribution)) - got_score = True - break - if not got_score: - char_attribution_dict.append((idx, t, 0)) - char_attribution_dict = sorted(char_attribution_dict, key=lambda x: x[2], reverse=True) - - result["char_attri"] = collections.OrderedDict() - for s in char_attribution_dict: - result["char_attri"][s[0]] = (s[1], s[2]) - - out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") - return lime_err_total, lime_score_total, lime_relative_err_total - - -if __name__ == "__main__": - args = get_args() - if args.base_model.startswith("roberta"): - model, tokenizer, dataloader = init_roberta_var(args) - elif args.base_model == "lstm": - model, tokenizer, dataloader = init_lstm_var(args) - else: - raise ValueError("unsupported base model name.") - - assert args.eval, "INTERPRETER must be run in eval mode" - with paddle.amp.auto_cast(enable=args.use_amp), open( - os.path.join(args.output_dir, "interpret" + f".{args.inter_mode}"), "w" - ) as out_handle: - - # Load model - sd = paddle.load(args.init_checkpoint) - model.set_dict(sd) - model.train() # set dropout to 0 in order to get the gradient - log.debug("load model from %s" % args.init_checkpoint) - - get_sub_word_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word))) - for step, d in tqdm(enumerate(dataloader)): - if step + 1 < args.start_id: # start from the step's instance - continue - # Initialize input_ids, fwd_args, tokens - result = {} - offset = None - if args.base_model.startswith("roberta"): - input_ids, token_type_ids, offset_map = d - fwd_args = [input_ids, token_type_ids] - fwd_kwargs = {} - tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:-1].tolist()) # list - offset = offset_map[0, 1:-1] - - elif args.base_model == "lstm": - input_ids, seq_lens = d - fwd_args = [input_ids, seq_lens] - fwd_kwargs = {} - tokens = [tokenizer.vocab.idx_to_token[input_id] for input_id in input_ids.tolist()[0]] - - result["id"] = dataloader.dataset.data[step]["id"] - - probs, atts, embedded = model.forward_interpet(*fwd_args, **fwd_kwargs) - pred_label = paddle.argmax(probs, axis=-1).tolist()[0] - - result["pred_label"] = pred_label - result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()] - sub_word_id_dict = [] - err_total = [] - lime_err_total, lime_score_total, lime_relative_err_total = [], [], [] - - result["context"] = dataloader.dataset.data[step]["context"] - result["sent_token"] = dataloader.dataset.data[step]["sent_token"] - - # Attention - if args.inter_mode == "attention": - # extract attention scores and write resutls to file - extract_attention_scores(args, atts, input_ids, tokens, sub_word_id_dict, result, offset, out_handle) - - # Integrated_gradient - elif args.inter_mode == "integrated_gradient": - err_total = extract_integrated_gradient_scores( - args, - atts, - input_ids, - tokens, - sub_word_id_dict, - fwd_args, - fwd_kwargs, - model, - result, - pred_label, - err_total, - offset, - out_handle, - ) - - # LIME - elif args.inter_mode == "lime": - lime_err_total, lime_score_total, lime_relative_err_total = extract_LIME_scores( - args, - tokenizer, - tokens, - pred_label, - model, - probs, - result, - lime_err_total, - lime_score_total, - lime_relative_err_total, - out_handle, - ) - - else: - raise KeyError(f"Unkonwn interpretable mode: {args.inter_mode}") - - if args.inter_mode == "lime": - log.debug(np.average(np.array(lime_relative_err_total))) diff --git a/examples/model_interpretation/task/senti/saliency_map/utils.py b/examples/model_interpretation/task/senti/saliency_map/utils.py deleted file mode 100644 index da76e25bfa59..000000000000 --- a/examples/model_interpretation/task/senti/saliency_map/utils.py +++ /dev/null @@ -1,38 +0,0 @@ -# !/usr/bin/env python3 -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import absolute_import, division, print_function, unicode_literals - -import paddle - - -class UnpackDataLoader(paddle.io.DataLoader): - def __init__(self, *args, **kwargs): - super(UnpackDataLoader, self).__init__(*args, batch_size=1, **kwargs) - - def __iter__(self): - return ([yy[0] for yy in y] for y in super(UnpackDataLoader, self).__iter__()) - - -def create_if_not_exists(dir): - try: - dir.mkdir(parents=True) - except: - pass - return dir - - -def get_warmup_and_linear_decay(max_steps, warmup_steps): - return lambda step: min(step / warmup_steps, 1.0 - (step - warmup_steps) / (max_steps - warmup_steps)) diff --git a/examples/model_interpretation/task/similarity/LIME/exceptions.py b/examples/model_interpretation/task/similarity/LIME/exceptions.py deleted file mode 100644 index c5fa1a29924a..000000000000 --- a/examples/model_interpretation/task/similarity/LIME/exceptions.py +++ /dev/null @@ -1,2 +0,0 @@ -class LimeError(Exception): - """Raise for errors""" diff --git a/examples/model_interpretation/task/similarity/LIME/explanation.py b/examples/model_interpretation/task/similarity/LIME/explanation.py deleted file mode 100644 index 46b3f0463fa6..000000000000 --- a/examples/model_interpretation/task/similarity/LIME/explanation.py +++ /dev/null @@ -1,343 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -Explanation class, with visualization functions. -""" -from io import open -import os -import os.path -import json -import string -import numpy as np - -from sklearn.utils import check_random_state - -from LIME.exceptions import LimeError - - -def id_generator(size=15, random_state=None): - """Helper function to generate random div ids. This is useful for embedding - HTML into ipython notebooks.""" - chars = list(string.ascii_uppercase + string.digits) - return "".join(random_state.choice(chars, size, replace=True)) - - -class DomainMapper(object): - """Class for mapping features to the specific domain. - - The idea is that there would be a subclass for each domain (text, tables, - images, etc), so that we can have a general Explanation class, and separate - out the specifics of visualizing features in here. - """ - - def __init__(self): - pass - - def map_exp_ids(self, exp, **kwargs): - """Maps the feature ids to concrete names. - - Default behaviour is the identity function. Subclasses can implement - this as they see fit. - - Args: - exp: list of tuples [(id, weight), (id,weight)] - kwargs: optional keyword arguments - - Returns: - exp: list of tuples [(name, weight), (name, weight)...] - """ - return exp - - def visualize_instance_html(self, exp, label, div_name, exp_object_name, **kwargs): - """Produces html for visualizing the instance. - - Default behaviour does nothing. Subclasses can implement this as they - see fit. - - Args: - exp: list of tuples [(id, weight), (id,weight)] - label: label id (integer) - div_name: name of div object to be used for rendering(in js) - exp_object_name: name of js explanation object - kwargs: optional keyword arguments - - Returns: - js code for visualizing the instance - """ - return "" - - -class Explanation(object): - """Object returned by explainers.""" - - def __init__(self, domain_mapper, mode="classification", class_names=None, random_state=None): - """ - - Initializer. - - Args: - domain_mapper: must inherit from DomainMapper class - type: "classification" or "regression" - class_names: list of class names (only used for classification) - random_state: an integer or numpy.RandomState that will be used to - generate random numbers. If None, the random state will be - initialized using the internal numpy seed. - """ - self.random_state = random_state - self.mode = mode - self.domain_mapper = domain_mapper - self.local_exp = {} - self.intercept = {} - self.score = {} - self.local_pred = {} - if mode == "classification": - self.class_names = class_names - self.top_labels = None - self.predict_proba = None - elif mode == "regression": - self.class_names = ["negative", "positive"] - self.predicted_value = None - self.min_value = 0.0 - self.max_value = 1.0 - self.dummy_label = 1 - else: - raise LimeError( - 'Invalid explanation mode "{}". ' 'Should be either "classification" ' 'or "regression".'.format(mode) - ) - - def available_labels(self): - """ - Returns the list of classification labels for which we have any explanations. - """ - try: - assert self.mode == "classification" - except AssertionError: - raise NotImplementedError("Not supported for regression explanations.") - else: - ans = self.top_labels if self.top_labels else self.local_exp.keys() - return list(ans) - - def as_list(self, label=1, **kwargs): - """Returns the explanation as a list. - - Args: - label: desired label. If you ask for a label for which an - explanation wasn't computed, will throw an exception. - Will be ignored for regression explanations. - kwargs: keyword arguments, passed to domain_mapper - - Returns: - list of tuples (representation, weight), where representation is - given by domain_mapper. Weight is a float. - """ - label_to_use = label if self.mode == "classification" else self.dummy_label - ans = self.domain_mapper.map_exp_ids(self.local_exp[label_to_use], **kwargs) - ans = [(x[0], float(x[1])) for x in ans] - return ans - - def as_map(self): - """Returns the map of explanations. - - Returns: - Map from label to list of tuples (feature_id, weight). - """ - return self.local_exp - - def as_pyplot_figure(self, label=1, **kwargs): - """Returns the explanation as a pyplot figure. - - Will throw an error if you don't have matplotlib installed - Args: - label: desired label. If you ask for a label for which an - explanation wasn't computed, will throw an exception. - Will be ignored for regression explanations. - kwargs: keyword arguments, passed to domain_mapper - - Returns: - pyplot figure (barchart). - """ - import matplotlib.pyplot as plt - - exp = self.as_list(label=label, **kwargs) - fig = plt.figure() - vals = [x[1] for x in exp] - names = [x[0] for x in exp] - vals.reverse() - names.reverse() - colors = ["green" if x > 0 else "red" for x in vals] - pos = np.arange(len(exp)) + 0.5 - plt.barh(pos, vals, align="center", color=colors) - plt.yticks(pos, names) - if self.mode == "classification": - title = "Local explanation for class %s" % self.class_names[label] - else: - title = "Local explanation" - plt.title(title) - return fig - - def show_in_notebook(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): - """Shows html explanation in ipython notebook. - - See as_html() for parameters. - This will throw an error if you don't have IPython installed""" - - from IPython.core.display import display, HTML - - display( - HTML( - self.as_html( - labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs - ) - ) - ) - - def save_to_file(self, file_path, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): - """Saves html explanation to file. . - - Params: - file_path: file to save explanations to - - See as_html() for additional parameters. - - """ - file_ = open(file_path, "w", encoding="utf8") - file_.write( - self.as_html( - labels=labels, predict_proba=predict_proba, show_predicted_value=show_predicted_value, **kwargs - ) - ) - file_.close() - - def as_html(self, labels=None, predict_proba=True, show_predicted_value=True, **kwargs): - """Returns the explanation as an html page. - - Args: - labels: desired labels to show explanations for (as barcharts). - If you ask for a label for which an explanation wasn't - computed, will throw an exception. If None, will show - explanations for all available labels. (only used for classification) - predict_proba: if true, add barchart with prediction probabilities - for the top classes. (only used for classification) - show_predicted_value: if true, add barchart with expected value - (only used for regression) - kwargs: keyword arguments, passed to domain_mapper - - Returns: - code for an html page, including javascript includes. - """ - - def jsonize(x): - return json.dumps(x, ensure_ascii=False) - - if labels is None and self.mode == "classification": - labels = self.available_labels() - - this_dir, _ = os.path.split(__file__) - bundle = open(os.path.join(this_dir, "bundle.js"), encoding="utf8").read() - - out = ( - """ - - """ - % bundle - ) - random_id = id_generator(size=15, random_state=check_random_state(self.random_state)) - out += ( - """ -
- """ - % random_id - ) - - predict_proba_js = "" - if self.mode == "classification" and predict_proba: - predict_proba_js = """ - var pp_div = top_div.append('div') - .classed('lime predict_proba', true); - var pp_svg = pp_div.append('svg').style('width', '100%%'); - var pp = new lime.PredictProba(pp_svg, %s, %s); - """ % ( - jsonize([str(x) for x in self.class_names]), - jsonize(list(self.predict_proba.astype(float))), - ) - - predict_value_js = "" - if self.mode == "regression" and show_predicted_value: - # reference self.predicted_value - # (svg, predicted_value, min_value, max_value) - predict_value_js = """ - var pp_div = top_div.append('div') - .classed('lime predicted_value', true); - var pp_svg = pp_div.append('svg').style('width', '100%%'); - var pp = new lime.PredictedValue(pp_svg, %s, %s, %s); - """ % ( - jsonize(float(self.predicted_value)), - jsonize(float(self.min_value)), - jsonize(float(self.max_value)), - ) - - exp_js = """var exp_div; - var exp = new lime.Explanation(%s); - """ % ( - jsonize([str(x) for x in self.class_names]) - ) - - if self.mode == "classification": - for label in labels: - exp = jsonize(self.as_list(label)) - exp_js += """ - exp_div = top_div.append('div').classed('lime explanation', true); - exp.show(%s, %d, exp_div); - """ % ( - exp, - label, - ) - else: - exp = jsonize(self.as_list()) - exp_js += """ - exp_div = top_div.append('div').classed('lime explanation', true); - exp.show(%s, %s, exp_div); - """ % ( - exp, - self.dummy_label, - ) - - raw_js = """var raw_div = top_div.append('div');""" - - if self.mode == "classification": - html_data = self.local_exp[labels[0]] - else: - html_data = self.local_exp[self.dummy_label] - - raw_js += self.domain_mapper.visualize_instance_html( - html_data, labels[0] if self.mode == "classification" else self.dummy_label, "raw_div", "exp", **kwargs - ) - out += """ - - """ % ( - random_id, - predict_proba_js, - predict_value_js, - exp_js, - raw_js, - ) - out += "" - - return out diff --git a/examples/model_interpretation/task/similarity/LIME/lime_base.py b/examples/model_interpretation/task/similarity/LIME/lime_base.py deleted file mode 100644 index ca9ce2838919..000000000000 --- a/examples/model_interpretation/task/similarity/LIME/lime_base.py +++ /dev/null @@ -1,225 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -Contains abstract functionality for learning locally linear sparse model. -""" -import numpy as np -import scipy as sp -from sklearn.linear_model import Ridge, lars_path -from sklearn.utils import check_random_state - - -class LimeBase(object): - """Class for learning a locally linear sparse model from perturbed data""" - - def __init__(self, kernel_fn, verbose=False, random_state=None): - """Init function - - Args: - kernel_fn: function that transforms an array of distances into an - array of proximity values (floats). - verbose: if true, print local prediction values from linear model. - random_state: an integer or numpy.RandomState that will be used to - generate random numbers. If None, the random state will be - initialized using the internal numpy seed. - """ - self.kernel_fn = kernel_fn - self.verbose = verbose - self.random_state = check_random_state(random_state) - - @staticmethod - def generate_lars_path(weighted_data, weighted_labels): - """Generates the lars path for weighted data. - - Args: - weighted_data: data that has been weighted by kernel - weighted_label: labels, weighted by kernel - - Returns: - (alphas, coefs), both are arrays corresponding to the - regularization parameter and coefficients, respectively - """ - x_vector = weighted_data - alphas, _, coefs = lars_path(x_vector, weighted_labels, method="lasso", verbose=False) - return alphas, coefs - - def forward_selection(self, data, labels, weights, num_features): - """Iteratively adds features to the model""" - clf = Ridge(alpha=0, fit_intercept=True, random_state=self.random_state) - used_features = [] - for _ in range(min(num_features, data.shape[1])): - max_ = -100000000 - best = 0 - for feature in range(data.shape[1]): - if feature in used_features: - continue - clf.fit(data[:, used_features + [feature]], labels, sample_weight=weights) - score = clf.score(data[:, used_features + [feature]], labels, sample_weight=weights) - if score > max_: - best = feature - max_ = score - used_features.append(best) - return np.array(used_features) - - def feature_selection(self, data, labels, weights, num_features, method): - """Selects features for the model. see explain_instance_with_data to - understand the parameters.""" - if method == "none": - return np.array(range(data.shape[1])) - - elif method == "forward_selection": - return self.forward_selection(data, labels, weights, num_features) - - elif method == "highest_weights": - clf = Ridge(alpha=0.01, fit_intercept=True, random_state=self.random_state) - clf.fit(data, labels, sample_weight=weights) - - coef = clf.coef_ - if sp.sparse.issparse(data): - coef = sp.sparse.csr_matrix(clf.coef_) - weighted_data = coef.multiply(data[0]) - # Note: most efficient to slice the data before reversing - sdata = len(weighted_data.data) - argsort_data = np.abs(weighted_data.data).argsort() - # Edge case where data is more sparse than requested number of feature importances - # In that case, we just pad with zero-valued features - if sdata < num_features: - nnz_indexes = argsort_data[::-1] - indices = weighted_data.indices[nnz_indexes] - num_to_pad = num_features - sdata - indices = np.concatenate((indices, np.zeros(num_to_pad, dtype=indices.dtype))) - indices_set = set(indices) - pad_counter = 0 - for i in range(data.shape[1]): - if i not in indices_set: - indices[pad_counter + sdata] = i - pad_counter += 1 - if pad_counter >= num_to_pad: - break - else: - nnz_indexes = argsort_data[sdata - num_features : sdata][::-1] - indices = weighted_data.indices[nnz_indexes] - return indices - else: - weighted_data = coef * data[0] - feature_weights = sorted( - zip(range(data.shape[1]), weighted_data), # zip(特征的编号, Ridge的w值) - key=lambda x: np.abs(x[1]), - reverse=True, - ) - return np.array([x[0] for x in feature_weights[:num_features]]) # 返回Ridge的前num_features大的w的值对应的特征编号 - - elif method == "lasso_path": - weighted_data = (data - np.average(data, axis=0, weights=weights)) * np.sqrt(weights[:, np.newaxis]) - weighted_labels = (labels - np.average(labels, weights=weights)) * np.sqrt(weights) - nonzero = range(weighted_data.shape[1]) - _, coefs = self.generate_lars_path(weighted_data, weighted_labels) - for i in range(len(coefs.T) - 1, 0, -1): - nonzero = coefs.T[i].nonzero()[0] - if len(nonzero) <= num_features: - break - used_features = nonzero - return used_features - - elif method == "auto": - if num_features <= 6: - n_method = "forward_selection" - else: - n_method = "highest_weights" - return self.feature_selection(data, labels, weights, num_features, n_method) - - def explain_instance_with_data( - self, - neighborhood_data, - neighborhood_labels, - distances, - label, - num_features, - feature_selection="auto", - model_regressor=None, - ): - """Takes perturbed data, labels and distances, returns explanation. - - Args: - neighborhood_data: perturbed data, 2d array. first element is - assumed to be the original data point. - neighborhood_labels: corresponding perturbed labels. should have as - many columns as the number of possible labels. - distances: distances to original data point. - label: label for which we want an explanation - num_features: maximum number of features in explanation - feature_selection: how to select num_features. options are: - 'forward_selection': iteratively add features to the model. - This is costly when num_features is high - 'highest_weights': selects the features that have the highest - product of absolute weight * original data point when - learning with all the features - 'lasso_path': chooses features based on the lasso - regularization path - 'none': uses all features, ignores num_features - 'auto': uses forward_selection if num_features <= 6, and - 'highest_weights' otherwise. - model_regressor: sklearn regressor to use in explanation. - Defaults to Ridge regression if None. Must have - model_regressor.coef_ and 'sample_weight' as a parameter - to model_regressor.fit() - - Returns: - (intercept, exp, score, local_pred): - intercept is a float. - exp is a sorted list of tuples, where each tuple (x,y) corresponds to the feature id (x) - and the local weight (y). The list is sorted by decreasing absolute value of y. - score is the R^2 value of the returned explanation - local_pred is the prediction of the explanation model on the original instance - """ - - weights = self.kernel_fn(distances) # 扰动样本权重 - labels_column = neighborhood_labels[:, label] # 类别label的softmax - - used_features = self.feature_selection( - neighborhood_data, labels_column, weights, num_features, feature_selection - ) - if model_regressor is None: - model_regressor = Ridge( - alpha=1, fit_intercept=True, random_state=self.random_state # L2正则化的系数 # 是否需要截距,即b - ) # seg的伪随机种子 - easy_model = model_regressor - easy_model.fit(neighborhood_data[:, used_features], labels_column, sample_weight=weights) - prediction_score = easy_model.score(neighborhood_data[:, used_features], labels_column, sample_weight=weights) - - local_pred = easy_model.predict(neighborhood_data[0, used_features].reshape(1, -1)) - - ridge_pred = easy_model.predict(neighborhood_data[:, used_features]) - err_np = np.abs(labels_column - ridge_pred) - relative_err_np = err_np / ridge_pred - err = np.average(err_np, weights=weights) - relative_err = np.average(relative_err_np, weights=weights) - - if self.verbose: - print("Intercept", easy_model.intercept_) - print( - "Prediction_local", - local_pred, - ) - print("Right:", neighborhood_labels[0, label]) - return ( - easy_model.intercept_, # - sorted( - zip(used_features, easy_model.coef_), key=lambda x: np.abs(x[1]), reverse=True - ), # 按权重大小排序的token_id列表 - prediction_score, # 衡量easy_model模型的预测与label的差,越大越好(差越小),最大为1 - local_pred, # easy_model对原始样本的预测概率 - relative_err, - err, - ) diff --git a/examples/model_interpretation/task/similarity/LIME/lime_text.py b/examples/model_interpretation/task/similarity/LIME/lime_text.py deleted file mode 100644 index b702a68d8de1..000000000000 --- a/examples/model_interpretation/task/similarity/LIME/lime_text.py +++ /dev/null @@ -1,660 +0,0 @@ -# !/usr/bin/env python3 -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" -Functions for explaining text classifiers. -""" -from functools import partial -import itertools -import json -import re -import time -import math -import paddle - -import numpy as np -import scipy as sp -import sklearn -from sklearn.utils import check_random_state - -import LIME.explanation as explanation -import LIME.lime_base as lime_base - - -class TextDomainMapper(explanation.DomainMapper): - """Maps feature ids to words or word-positions""" - - def __init__(self, indexed_string): - """Initializer. - - Args: - indexed_string: lime_text.IndexedString, original string - """ - self.indexed_string = indexed_string - - def map_exp_ids(self, exp, positions=False): - """Maps ids to words or word-position strings. - - Args: - exp: list of tuples [(id, weight), (id,weight)] - positions: if True, also return word positions - - Returns: - list of tuples (word, weight), or (word_positions, weight) if - examples: ('bad', 1) or ('bad_3-6-12', 1) - """ - if positions: - exp = [ - ( - "%s_%s" - % (self.indexed_string.word(x[0]), "-".join(map(str, self.indexed_string.string_position(x[0])))), - x[1], - ) - for x in exp - ] - else: - exp = [(self.indexed_string.word(x[0]), x[1]) for x in exp] - return exp - - def visualize_instance_html(self, exp, label, div_name, exp_object_name, text=True, opacity=True): - """Adds text with highlighted words to visualization. - - Args: - exp: list of tuples [(id, weight), (id,weight)] - label: label id (integer) - div_name: name of div object to be used for rendering(in js) - exp_object_name: name of js explanation object - text: if False, return empty - opacity: if True, fade colors according to weight - """ - if not text: - return "" - text = self.indexed_string.raw_string().encode("utf-8", "xmlcharrefreplace").decode("utf-8") - text = re.sub(r"[<>&]", "|", text) - exp = [(self.indexed_string.word(x[0]), self.indexed_string.string_position(x[0]), x[1]) for x in exp] - all_occurrences = list(itertools.chain.from_iterable([itertools.product([x[0]], x[1], [x[2]]) for x in exp])) - all_occurrences = [(x[0], int(x[1]), x[2]) for x in all_occurrences] - ret = """ - %s.show_raw_text(%s, %d, %s, %s, %s); - """ % ( - exp_object_name, - json.dumps(all_occurrences), - label, - json.dumps(text), - div_name, - json.dumps(opacity), - ) - return ret - - -class IndexedString(object): - """String with various indexes.""" - - def __init__(self, raw_string, split_expression=r"\W+", bow=True, mask_string=None, language="ch"): - """Initializer. - - Args: - raw_string: string with raw text in it - split_expression: Regex string or callable. If regex string, will be used with re.split. - If callable, the function should return a list of tokens. - bow: if True, a word is the same everywhere in the text - i.e. we - will index multiple occurrences of the same word. If False, - order matters, so that the same word will have different ids - according to position. - mask_string: If not None, replace words with this if bow=False - if None, default value is UNKWORDZ - """ - self.raw = raw_string - self.mask_string = "UNKWORDZ" if mask_string is None else mask_string - self.language = language - - if callable(split_expression): - tokens = split_expression(self.raw) - self.as_list = self._segment_with_tokens(self.raw, tokens) - tokens = set(tokens) - - def non_word(string): - return string not in tokens - - else: - # with the split_expression as a non-capturing group (?:), we don't need to filter out - # the separator character from the split results. - if self.language == "ch": - splitter = re.compile(r"([\u4e00-\u9fa5])") - else: - splitter = re.compile(split_expression) - self.as_list = [w for w in splitter.split(self.raw) if len(w.strip()) > 0] - valid_word = splitter.match - - self.as_np = np.array(self.as_list) - self.string_start = np.hstack(([0], np.cumsum([len(x) for x in self.as_np[:-1]]))) - vocab = {} - self.inverse_vocab = [] - self.positions = [] - self.bow = bow - non_vocab = set() - for i, word in enumerate(self.as_np): - if word in non_vocab: - continue - if (self.language == "ch" and not valid_word(word)) or (self.language == "en" and valid_word(word)): - non_vocab.add(word) - continue - if bow: - if word not in vocab: - vocab[word] = len(vocab) - self.inverse_vocab.append(word) - self.positions.append([]) - idx_word = vocab[word] - self.positions[idx_word].append(i) - else: - self.inverse_vocab.append(word) - self.positions.append(i) - if not bow: - self.positions = np.array(self.positions) - - def raw_string(self): - """Returns the original raw string""" - return self.raw - - def num_words(self): - """Returns the number of tokens in the vocabulary for this document.""" - return len(self.inverse_vocab) - - def word(self, id_): - """Returns the word that corresponds to id_ (int)""" - return self.inverse_vocab[id_] - - def string_position(self, id_): - """Returns a np array with indices to id_ (int) occurrences""" - if self.bow: - return self.string_start[self.positions[id_]] - else: - return self.string_start[[self.positions[id_]]] - - def inverse_removing(self, words_to_remove): - """Returns a string after removing the appropriate words. - - If self.bow is false, replaces word with UNKWORDZ instead of removing it. - - Args: - words_to_remove: list of ids (ints) to remove - - Returns: - original raw string with appropriate words removed. - """ - mask = np.ones(self.as_np.shape[0], dtype="bool") - mask[self.__get_idxs(words_to_remove)] = False - if not self.bow: - return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])]) - return "".join([self.as_list[v] for v in mask.nonzero()[0]]) - - @staticmethod - def _segment_with_tokens(text, tokens): - """Segment a string around the tokens created by a passed-in tokenizer""" - list_form = [] - text_ptr = 0 - for token in tokens: - inter_token_string = [] - while not text[text_ptr:].startswith(token): - inter_token_string.append(text[text_ptr]) - text_ptr += 1 - if text_ptr >= len(text): - raise ValueError("Tokenization produced tokens that do not belong in string!") - text_ptr += len(token) - if inter_token_string: - list_form.append("".join(inter_token_string)) - list_form.append(token) - if text_ptr < len(text): - list_form.append(text[text_ptr:]) - return list_form - - def __get_idxs(self, words): - """Returns indexes to appropriate words.""" - if self.bow: - return list(itertools.chain.from_iterable([self.positions[z] for z in words])) - else: - return self.positions[words] - - -class IndexedCharacters(object): - """String with various indexes.""" - - def __init__(self, raw_string, bow=True, mask_string=None): - """Initializer. - - Args: - raw_string: string with raw text in it - bow: if True, a char is the same everywhere in the text - i.e. we - will index multiple occurrences of the same character. If False, - order matters, so that the same word will have different ids - according to position. - mask_string: If not None, replace characters with this if bow=False - if None, default value is chr(0) - """ - self.raw = raw_string - self.as_list = list(self.raw) - self.as_np = np.array(self.as_list) - self.mask_string = chr(0) if mask_string is None else mask_string - self.string_start = np.arange(len(self.raw)) - vocab = {} - self.inverse_vocab = [] - self.positions = [] - self.bow = bow - non_vocab = set() - for i, char in enumerate(self.as_np): - if char in non_vocab: - continue - if bow: - if char not in vocab: - vocab[char] = len(vocab) - self.inverse_vocab.append(char) - self.positions.append([]) - idx_char = vocab[char] - self.positions[idx_char].append(i) - else: - self.inverse_vocab.append(char) - self.positions.append(i) - if not bow: - self.positions = np.array(self.positions) - - def raw_string(self): - """Returns the original raw string""" - return self.raw - - def num_words(self): - """Returns the number of tokens in the vocabulary for this document.""" - return len(self.inverse_vocab) - - def word(self, id_): - """Returns the word that corresponds to id_ (int)""" - return self.inverse_vocab[id_] - - def string_position(self, id_): - """Returns a np array with indices to id_ (int) occurrences""" - if self.bow: - return self.string_start[self.positions[id_]] - else: - return self.string_start[[self.positions[id_]]] - - def inverse_removing(self, words_to_remove): - """Returns a string after removing the appropriate words. - - If self.bow is false, replaces word with UNKWORDZ instead of removing - it. - - Args: - words_to_remove: list of ids (ints) to remove - - Returns: - original raw string with appropriate words removed. - """ - mask = np.ones(self.as_np.shape[0], dtype="bool") - mask[self.__get_idxs(words_to_remove)] = False - if not self.bow: - return "".join([self.as_list[i] if mask[i] else self.mask_string for i in range(mask.shape[0])]) - return "".join([self.as_list[v] for v in mask.nonzero()[0]]) - - def __get_idxs(self, words): - """Returns indexes to appropriate words.""" - if self.bow: - return list(itertools.chain.from_iterable([self.positions[z] for z in words])) - else: - return self.positions[words] - - -class LimeTextExplainer(object): - """Explains text classifiers. - Currently, we are using an exponential kernel on cosine distance, and - restricting explanations to words that are present in documents.""" - - def __init__( - self, - kernel_width=25, - kernel=None, - verbose=False, - class_names=None, - feature_selection="auto", - split_expression=r"\W+", - bow=True, - mask_string=None, - random_state=None, - char_level=False, - language="ch", - ): - """Init function. - - Args: - kernel_width: kernel width for the exponential kernel. - kernel: similarity kernel that takes euclidean distances and kernel - width as input and outputs weights in (0,1). If None, defaults to - an exponential kernel. - verbose: if true, print local prediction values from linear model - class_names: list of class names, ordered according to whatever the - classifier is using. If not present, class names will be '0', - '1', ... - feature_selection: feature selection method. can be - 'forward_selection', 'lasso_path', 'none' or 'auto'. - See function 'explain_instance_with_data' in lime_base.py for - details on what each of the options does. - split_expression: Regex string or callable. If regex string, will be used with re.split. - If callable, the function should return a list of tokens. - bow: if True (bag of words), will perturb input data by removing - all occurrences of individual words or characters. - Explanations will be in terms of these words. Otherwise, will - explain in terms of word-positions, so that a word may be - important the first time it appears and unimportant the second. - Only set to false if the classifier uses word order in some way - (bigrams, etc), or if you set char_level=True. - mask_string: String used to mask tokens or characters if bow=False - if None, will be 'UNKWORDZ' if char_level=False, chr(0) - otherwise. - random_state: an integer or numpy.RandomState that will be used to - generate random numbers. If None, the random state will be - initialized using the internal numpy seed. - char_level: an boolean identifying that we treat each character - as an independent occurence in the string - """ - - if kernel is None: - - def kernel(d, kernel_width): - return np.sqrt(np.exp(-(d**2) / kernel_width**2)) - - kernel_fn = partial(kernel, kernel_width=kernel_width) - - self.random_state = check_random_state(random_state) - self.base = lime_base.LimeBase(kernel_fn, verbose, random_state=self.random_state) - self.class_names = class_names - self.vocabulary = None - self.feature_selection = feature_selection - self.bow = bow - self.mask_string = mask_string - self.split_expression = split_expression - self.char_level = char_level - self.language = language - - def explain_instance( - self, - text_instance_q: str, - text_instance_t: str, - analysis_query, - tokenizer, - pred_label: int, - classifier_fn, - labels=(0, 1), - top_labels=None, - num_features=10, - num_samples=5000, - distance_metric="cosine", - model_regressor=None, - if_lstm=False, - ): - """Generates explanations for a prediction. - - First, we generate neighborhood data by randomly hiding features from - the instance (see __data_labels_distance_mapping). We then learn - locally weighted linear models on this neighborhood data to explain - each of the classes in an interpretable way (see lime_base.py). - - Args: - text_instance: raw text string to be explained. - classifier_fn: classifier prediction probability function, which - takes a list of d strings and outputs a (d, k) numpy array with - prediction probabilities, where k is the number of classes. - For ScikitClassifiers , this is classifier.predict_proba. - labels: iterable with labels to be explained. - top_labels: if not None, ignore labels and produce explanations for - the K labels with highest prediction probabilities, where K is - this parameter. - num_features: maximum number of features present in explanation - num_samples: size of the neighborhood to learn the linear model - distance_metric: the distance metric to use for sample weighting, - defaults to cosine similarity - model_regressor: sklearn regressor to use in explanation. Defaults - to Ridge regression in LimeBase. Must have model_regressor.coef_ - and 'sample_weight' as a parameter to model_regressor.fit() - Returns: - An Explanation object (see explanation.py) with the corresponding - explanations. - """ - # prev_time = time.time() - - text_instance = text_instance_q if analysis_query else text_instance_t - text_support = text_instance_t if analysis_query else text_instance_q - - indexed_string = ( - IndexedCharacters(text_instance, bow=self.bow, mask_string=self.mask_string) - if self.char_level - else IndexedString( - text_instance, - bow=self.bow, - split_expression=self.split_expression, - mask_string=self.mask_string, - language=self.language, - ) - ) - domain_mapper = TextDomainMapper(indexed_string) - - # 产生扰动数据集 第一条是原始数据 - # data: 解释器训练特征 list (num_samples, doc_size) - # yss: 解释器训练标签 list (num_samples, class_num(2)) - # distances: 扰动样本到原始样本的距离 np.array(float) (num_samples, ) - data, yss, distances = self.__data_labels_distances( - indexed_string, - text_support, - analysis_query, - tokenizer, - classifier_fn, - num_samples, - distance_metric=distance_metric, - if_lstm=if_lstm, - ) - - if self.class_names is None: - self.class_names = [str(x) for x in range(yss[0].shape[0])] - ret_exp = explanation.Explanation( - domain_mapper=domain_mapper, class_names=self.class_names, random_state=self.random_state - ) - ret_exp.predict_proba = yss[0] - if top_labels: - labels = np.argsort(yss[0])[-top_labels:] - ret_exp.top_labels = list(labels) - ret_exp.top_labels.reverse() - - num_features = indexed_string.num_words() # 特征数量跟word_num相同 - - ( - ret_exp.intercept[pred_label], - ret_exp.local_exp[pred_label], - ret_exp.score[pred_label], - ret_exp.local_pred[pred_label], - relative_err, - err, - ) = self.base.explain_instance_with_data( - data, - yss, - distances, - pred_label, - num_features, - model_regressor=model_regressor, - feature_selection=self.feature_selection, - ) - - return ret_exp, indexed_string, relative_err, err - - def __data_labels_distances( - self, - indexed_string, - text_support, - analysis_query, - tokenizer, - classifier_fn, - num_samples, - distance_metric="cosine", - if_lstm=False, - ): - """Generates a neighborhood around a prediction. - - Generates neighborhood data by randomly removing words from - the instance, and predicting with the classifier. Uses cosine distance - to compute distances between original and perturbed instances. - Args: - indexed_string: document (IndexedString) to be explained, - classifier_fn: classifier prediction probability function, which - takes a string and outputs prediction probabilities. For - ScikitClassifier, this is classifier.predict_proba. - num_samples: size of the neighborhood to learn the linear model - distance_metric: the distance metric to use for sample weighting, - defaults to cosine similarity. - - Returns: - A tuple (data, labels, distances), where: - data: dense num_samples * K binary matrix, where K is the - number of tokens in indexed_string. The first row is the - original instance, and thus a row of ones. - labels: num_samples * L matrix, where L is the number of target - labels - distances: cosine distance between the original instance and - each perturbed instance (computed in the binary 'data' - matrix), times 100. - """ - - def distance_fn(x): - return sklearn.metrics.pairwise.pairwise_distances(x, x[0], metric=distance_metric).ravel() * 100 - - doc_size = indexed_string.num_words() - - sample = self.random_state.randint( - 1, doc_size, num_samples - 1 - ) # sample: [int(1 ~ doc_size-1) * num_samples-1] - data = np.ones((num_samples, doc_size)) - data[0] = np.ones(doc_size) - features_range = range(doc_size) - perturb_text = [indexed_string.raw_string()] # [文本 * num_samples] - - for i, size in enumerate(sample, start=1): - # inactive: 从range(0, doc_size)中随机取出的size个数组成的list, 要去掉的字的id - inactive = self.random_state.choice( - features_range, size, replace=False # [0, doc_size) # int: 该扰动样本中remove token的数量 - ) - - text = indexed_string.inverse_removing(inactive) # 原文本去掉了inactive中的字后的文本 - - data[i, inactive] = 0 - perturb_text.append(text) - - # print('doc size: %d' % doc_size) - - prev_time = time.time() - # inverse_data: 扰动数据集 [扰动样本 str] * num_samples - labels = [] - query_list, title_list, query_len_list, title_len_list = [], [], [], [] # for lstm - token_ids_list, s_ids_list = [], [] # for roberta - max_len = 0 - - support_token_ids = tokenizer.encode(text_support) # for lstm - support_len = len(support_token_ids) # for lstm - for idx, text in enumerate(perturb_text): - if if_lstm: - text_token_ids = tokenizer.encode(text) - text_len = len(text_token_ids) - if idx == 0: - max_len = len(text_token_ids) - while len(text_token_ids) < max_len: - text_token_ids.append(0) - - query_token_ids = text_token_ids if analysis_query else support_token_ids - title_token_ids = support_token_ids if analysis_query else text_token_ids - query_len = text_len if analysis_query else support_len - title_len = support_len if analysis_query else text_len - - query_list.append(query_token_ids) - title_list.append(title_token_ids) - query_len_list.append(query_len) - title_len_list.append(title_len) - - else: - text_tokens = tokenizer.tokenize(text) - text_token_ids = tokenizer.convert_tokens_to_ids(text_tokens) - support_tokens = tokenizer.tokenize(text_support) - support_ids = tokenizer.convert_tokens_to_ids(support_tokens) - if analysis_query: - token_ids = ( - [tokenizer.cls_token_id] - + text_token_ids - + [tokenizer.sep_token_id] - + support_ids - + [tokenizer.sep_token_id] - ) - else: - token_ids = ( - [tokenizer.cls_token_id] - + support_ids - + [tokenizer.sep_token_id] - + text_token_ids - + [tokenizer.sep_token_id] - ) - if len(token_ids) > max_len: - max_len = len(token_ids) - token_ids_list.append(token_ids) - - token_ids_np = [] - if not if_lstm: - for token_ids in token_ids_list: - # token_ids = token_ids[:max_len] - token_ids = token_ids + [tokenizer.pad_token_id] * (max_len - len(token_ids)) - token_ids_np.append(token_ids) - s_ids = [0 for _ in range(len(token_ids))] - s_ids_list.append(s_ids) - - token_ids_np = np.array(token_ids_np) - s_ids_np = np.array(s_ids_list) - - length = len(perturb_text[0]) - if if_lstm: - batch = 128 - else: - batch = 64 if length < 130 else 50 - - prev_time = time.time() - epoch_num = math.ceil(len(perturb_text) / batch) - for idx in range(epoch_num): - if if_lstm: - query_list_tensor = paddle.to_tensor(query_list[idx * batch : (idx + 1) * batch]) - title_list_tensor = paddle.to_tensor(title_list[idx * batch : (idx + 1) * batch]) - query_len_list_tensor = paddle.to_tensor(query_len_list[idx * batch : (idx + 1) * batch]) - title_len_list_tensor = paddle.to_tensor(title_len_list[idx * batch : (idx + 1) * batch]) - label = classifier_fn( - query_list_tensor, title_list_tensor, query_len_list_tensor, title_len_list_tensor - )[ - 0 - ] # label: Tensor[num_samples, 2] - else: - token_ids_tensor = paddle.Tensor( - value=token_ids_np[idx * batch : (idx + 1) * batch], place=paddle.CUDAPlace(0), stop_gradient=True - ) - s_ids_tensor = paddle.Tensor( - value=s_ids_np[idx * batch : (idx + 1) * batch], - place=token_ids_tensor.place, - stop_gradient=token_ids_tensor.stop_gradient, - ) - label = classifier_fn(token_ids_tensor, s_ids_tensor)[0] # label: Tensor[num_samples, 2] - - labels.extend(label.numpy().tolist()) - - labels = np.array(labels) # labels: nsp.array(num_samples, 2) - print("mode forward time: %.5f" % (time.time() - prev_time)) - distances = distance_fn(sp.sparse.csr_matrix(data)) - - return data, labels, distances diff --git a/examples/model_interpretation/task/similarity/pretrained_models/data.py b/examples/model_interpretation/task/similarity/pretrained_models/data.py deleted file mode 100644 index 37f2c12781f4..000000000000 --- a/examples/model_interpretation/task/similarity/pretrained_models/data.py +++ /dev/null @@ -1,138 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -import numpy as np - -from paddlenlp.datasets import MapDataset - - -def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): - if trans_fn: - dataset = dataset.map(trans_fn) - - shuffle = True if mode == "train" else False - if mode == "train": - batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - else: - batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - - return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) - - -def read_text_pair(data_path): - """Reads data.""" - with open(data_path, "r", encoding="utf-8") as f: - for line in f: - data = line.rstrip().split("\t") - if len(data) != 2: - continue - yield {"query": data[0], "title": data[1]} - - -def convert_pointwise_example(example, tokenizer, max_seq_length=512, is_test=False, language="en"): - if language == "ch": - q_name = "query" - t_name = "title" - l_name = "label" - else: - q_name = "sentence1" - t_name = "sentence2" - l_name = "labels" - - query, title = example[q_name], example[t_name] - - encoded_inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) - - input_ids = encoded_inputs["input_ids"] - token_type_ids = encoded_inputs["token_type_ids"] - - if not is_test: - label = np.array([example[l_name]], dtype="int64") - return input_ids, token_type_ids, label - else: - return input_ids, token_type_ids - - -def convert_pairwise_example(example, tokenizer, max_seq_length=512, phase="train"): - - if phase == "train": - query, pos_title, neg_title = example["query"], example["title"], example["neg_title"] - - pos_inputs = tokenizer(text=query, text_pair=pos_title, max_seq_len=max_seq_length) - neg_inputs = tokenizer(text=query, text_pair=neg_title, max_seq_len=max_seq_length) - - pos_input_ids = pos_inputs["input_ids"] - pos_token_type_ids = pos_inputs["token_type_ids"] - neg_input_ids = neg_inputs["input_ids"] - neg_token_type_ids = neg_inputs["token_type_ids"] - - return (pos_input_ids, pos_token_type_ids, neg_input_ids, neg_token_type_ids) - - else: - query, title = example["query"], example["title"] - - inputs = tokenizer(text=query, text_pair=title, max_seq_len=max_seq_length) - - input_ids = inputs["input_ids"] - token_type_ids = inputs["token_type_ids"] - if phase == "eval": - return input_ids, token_type_ids, example["label"] - elif phase == "predict": - return input_ids, token_type_ids - else: - raise ValueError("not supported phase:{}".format(phase)) - - -def gen_pair(dataset, pool_size=100): - """ - Generate triplet randomly based on dataset - - Args: - dataset: A `MapDataset` or `IterDataset` or a tuple of those. - Each example is composed of 2 texts: example["query"], example["title"] - pool_size: the number of example to sample negative example randomly - - Return: - dataset: A `MapDataset` or `IterDataset` or a tuple of those. - Each example is composed of 2 texts: example["query"], example["pos_title"]、example["neg_title"] - """ - - if len(dataset) < pool_size: - pool_size = len(dataset) - - new_examples = [] - pool = [] - tmp_examples = [] - - for example in dataset: - label = example["label"] - - # Filter negative example - if label == 0: - continue - - tmp_examples.append(example) - pool.append(example["title"]) - - if len(pool) >= pool_size: - np.random.shuffle(pool) - for idx, example in enumerate(tmp_examples): - example["neg_title"] = pool[idx] - new_examples.append(example) - tmp_examples = [] - pool = [] - else: - continue - return MapDataset(new_examples) diff --git a/examples/model_interpretation/task/similarity/pretrained_models/model.py b/examples/model_interpretation/task/similarity/pretrained_models/model.py deleted file mode 100644 index cf886ba69a85..000000000000 --- a/examples/model_interpretation/task/similarity/pretrained_models/model.py +++ /dev/null @@ -1,89 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License" -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -import paddle.nn as nn -import paddle.nn.functional as F - - -class PointwiseMatching(nn.Layer): - def __init__(self, pretrained_model, dropout=None): - super().__init__() - self.ptm = pretrained_model - self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) - - # num_labels = 2 (similar or dissimilar) - self.classifier = nn.Linear(self.ptm.config["roberta"].config["hidden_size"], 2) - - def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): - - _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) - - cls_embedding = self.dropout(cls_embedding) - logits = self.classifier(cls_embedding) - probs = F.softmax(logits) - - return probs - - -class PairwiseMatching(nn.Layer): - def __init__(self, pretrained_model, dropout=None, margin=0.1): - super().__init__() - self.ptm = pretrained_model - self.dropout = nn.Dropout(dropout if dropout is not None else 0.1) - self.margin = margin - - # hidden_size -> 1, calculate similarity - self.similarity = nn.Linear(self.ptm.config["hidden_size"], 1) - - def predict(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): - - _, cls_embedding = self.ptm(input_ids, token_type_ids, position_ids, attention_mask) - - cls_embedding = self.dropout(cls_embedding) - sim_score = self.similarity(cls_embedding) - sim_score = F.sigmoid(sim_score) - - return sim_score - - def forward( - self, - pos_input_ids, - neg_input_ids, - pos_token_type_ids=None, - neg_token_type_ids=None, - pos_position_ids=None, - neg_position_ids=None, - pos_attention_mask=None, - neg_attention_mask=None, - ): - - _, pos_cls_embedding = self.ptm(pos_input_ids, pos_token_type_ids, pos_position_ids, pos_attention_mask) - - _, neg_cls_embedding = self.ptm(neg_input_ids, neg_token_type_ids, neg_position_ids, neg_attention_mask) - - pos_embedding = self.dropout(pos_cls_embedding) - neg_embedding = self.dropout(neg_cls_embedding) - - pos_sim = self.similarity(pos_embedding) - neg_sim = self.similarity(neg_embedding) - - pos_sim = F.sigmoid(pos_sim) - neg_sim = F.sigmoid(neg_sim) - - labels = paddle.full(shape=[pos_cls_embedding.shape[0]], fill_value=1.0, dtype="float32") - - loss = F.margin_ranking_loss(pos_sim, neg_sim, labels, margin=self.margin) - - return loss diff --git a/examples/model_interpretation/task/similarity/pretrained_models/predict_pointwise.py b/examples/model_interpretation/task/similarity/pretrained_models/predict_pointwise.py deleted file mode 100644 index 39c56528c9ec..000000000000 --- a/examples/model_interpretation/task/similarity/pretrained_models/predict_pointwise.py +++ /dev/null @@ -1,115 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - This script is used for predicting results -""" -import argparse -import os -from functools import partial - -import numpy as np -import paddle -from data import convert_pointwise_example as convert_example -from data import create_dataloader, read_text_pair -from model import PointwiseMatching - -from paddlenlp.data import Pad, Tuple -from paddlenlp.datasets import load_dataset -from paddlenlp.transformers import AutoModel, AutoTokenizer - -parser = argparse.ArgumentParser() -parser.add_argument("--input_file", type=str, required=True, help="The full path of input file") -parser.add_argument("--params_path", type=str, required=True, help="The path to model parameters to be loaded.") -parser.add_argument( - "--max_seq_length", - default=64, - type=int, - help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", -) -parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") -parser.add_argument( - "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." -) -parser.add_argument("--language", choices=["ch", "en"], required=True, help="Language that the model is built for") -args = parser.parse_args() - - -def predict(model, data_loader): - """ - Predicts the data labels. - - Args: - model (obj:`SemanticIndexBase`): A model to extract text embedding or calculate similarity of text pair. - data_loader (obj:`List(Example)`): The processed data ids of text pair: - [query_input_ids, query_token_type_ids, title_input_ids, title_token_type_ids] - Returns: - results(obj:`List`): cosine similarity of text pairs. - """ - batch_probs = [] - - model.eval() - - with paddle.no_grad(): - for batch_data in data_loader: - input_ids, token_type_ids = batch_data - - input_ids = paddle.to_tensor(input_ids) - token_type_ids = paddle.to_tensor(token_type_ids) - - batch_prob = model(input_ids=input_ids, token_type_ids=token_type_ids).numpy() - - batch_probs.append(batch_prob) - - batch_probs = np.concatenate(batch_probs, axis=0) - - return batch_probs - - -if __name__ == "__main__": - paddle.set_device(args.device) - pretrained_model = AutoModel.from_pretrained("ernie-3.0-medium-zh") - tokenizer = AutoTokenizer.from_pretrained("ernie-3.0-medium-zh") - - trans_func = partial( - convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, is_test=True, language=args.language - ) - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id), # input_ids - Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # segment_ids - ): [data for data in fn(samples)] - - valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) - - valid_data_loader = create_dataloader( - valid_ds, mode="predict", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func - ) - - model = PointwiseMatching(pretrained_model) - - if args.params_path and os.path.isfile(args.params_path): - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - print("Loaded parameters from %s" % args.params_path) - else: - raise ValueError("Please set --params_path with correct pretrained model file") - - y_probs = predict(model, valid_data_loader) - y_preds = np.argmax(y_probs, axis=1) - - valid_ds = load_dataset(read_text_pair, data_path=args.input_file, lazy=False) - for idx, y_pred in enumerate(y_preds): - text_pair = valid_ds[idx] - text_pair["pred_label"] = y_pred - print(text_pair) diff --git a/examples/model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh b/examples/model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh deleted file mode 100755 index 13771c1837ed..000000000000 --- a/examples/model_interpretation/task/similarity/pretrained_models/run_train_pointwise.sh +++ /dev/null @@ -1,32 +0,0 @@ -### - # This script is used to finetune pretrained models -### - -export CUDA_VISIBLE_DEVICES=7 - -LANGUAGE="ch" # ['ch', 'en'] -BASE_MODEL=roberta_large # [roberta_base, roberta_large] -timestamp=`date +"%Y%m%d_%H%M%S"` - -if [[ $LANGUAGE == "ch" ]]; then - LEARNING_RATE=3e-5 - MAX_SEQ_LENGTH=256 -elif [[ $LANGUAGE == "en" ]]; then - LEARNING_RATE=5e-6 - MAX_SEQ_LENGTH=128 -fi - -[ -d "logs" ] || mkdir -p "logs" -set -x - -python3 ./train_pointwise.py \ - --learning_rate $LEARNING_RATE \ - --max_seq_length $MAX_SEQ_LENGTH \ - --batch_size 32 \ - --epochs 5 \ - --save_step 1000 \ - --warmup_proportion 0.1 \ - --base_model $BASE_MODEL \ - --language $LANGUAGE \ - --save_dir saved_model_${LANGUAGE}/${BASE_MODEL}_${timestamp} >> logs/log_${BASE_MODEL}_${timestamp} - diff --git a/examples/model_interpretation/task/similarity/pretrained_models/train_pointwise.py b/examples/model_interpretation/task/similarity/pretrained_models/train_pointwise.py deleted file mode 100644 index 7e7bbc190efc..000000000000 --- a/examples/model_interpretation/task/similarity/pretrained_models/train_pointwise.py +++ /dev/null @@ -1,215 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os -import random -import sys -import time -from functools import partial - -import numpy as np -import paddle -from data import convert_pointwise_example as convert_example -from data import create_dataloader - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.datasets import load_dataset -from paddlenlp.transformers import LinearDecayWithWarmup -from paddlenlp.transformers.roberta.tokenizer import ( - RobertaBPETokenizer, - RobertaTokenizer, -) - -sys.path.append("..") -sys.path.append("../../..") -from roberta.modeling import RobertaForSequenceClassification # noqa: E402 - -sys.path.remove("../../..") -sys.path.remove("..") - -parser = argparse.ArgumentParser() -parser.add_argument("--base_model", type=str, choices=["roberta_base", "roberta_large"]) -parser.add_argument( - "--save_dir", - default="./checkpoint", - type=str, - help="The output directory where the model checkpoints will be written.", -) -parser.add_argument( - "--max_seq_length", - default=128, - type=int, - help="The maximum total input sequence length after tokenization. " - "Sequences longer than this will be truncated, sequences shorter will be padded.", -) -parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.") -parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") -parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") -parser.add_argument("--epochs", default=3, type=int, help="Total number of training epochs to perform.") -parser.add_argument("--eval_step", default=1000, type=int, help="Step interval for evaluation.") -parser.add_argument("--save_step", default=1000, type=int, help="Step interval for saving checkpoint.") -parser.add_argument( - "--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over the training process." -) -parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") -parser.add_argument("--seed", type=int, default=1000, help="Random seed for initialization.") -parser.add_argument( - "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." -) -parser.add_argument("--language", choices=["ch", "en"], required=True, help="Language that the model is built for") -args = parser.parse_args() - - -def set_seed(seed): - """sets random seed""" - random.seed(seed) - np.random.seed(seed) - paddle.seed(seed) - - -@paddle.no_grad() -def evaluate(model, criterion, metric, data_loader, phase="dev"): - """ - Given a dataset, it evals model and computes the metric. - - Args: - model(obj:`paddle.nn.Layer`): A model to classify texts. - data_loader(obj:`paddle.io.DataLoader`): The dataset loader which generates batches. - criterion(obj:`paddle.nn.Layer`): It can compute the loss. - metric(obj:`paddle.metric.Metric`): The evaluation metric. - """ - model.eval() - metric.reset() - losses = [] - for batch in data_loader: - input_ids, token_type_ids, labels = batch - probs = model(input_ids=input_ids, token_type_ids=token_type_ids) - loss = criterion(probs, labels) - losses.append(loss.numpy()) - correct = metric.compute(probs, labels) - metric.update(correct) - accu = metric.accumulate() - print("eval {} loss: {:.5}, accu: {:.5}".format(phase, np.mean(losses), accu)) - model.train() - metric.reset() - - -def do_train(): - paddle.set_device(args.device) - rank = paddle.distributed.get_rank() - if paddle.distributed.get_world_size() > 1: - paddle.distributed.init_parallel_env() - - set_seed(args.seed) - if args.language == "ch": - train_ds, dev_ds = load_dataset("lcqmc", splits=["train", "dev"]) - - if args.base_model == "roberta_base": - tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext") - pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext", num_classes=2) - elif args.base_model == "roberta_large": - tokenizer = RobertaTokenizer.from_pretrained("roberta-wwm-ext-large") - pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-wwm-ext-large", num_classes=2) - else: - train_ds, dev_ds = load_dataset("glue", "qqp", splits=["train", "dev"]) - - if args.base_model == "roberta_base": - tokenizer = RobertaBPETokenizer.from_pretrained("roberta-base") - pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_classes=2) - elif args.base_model == "roberta_large": - tokenizer = RobertaBPETokenizer.from_pretrained("roberta-large") - pretrained_model = RobertaForSequenceClassification.from_pretrained("roberta-large", num_classes=2) - - trans_func = partial( - convert_example, tokenizer=tokenizer, max_seq_length=args.max_seq_length, language=args.language - ) - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id), # text_pair_input - Pad(axis=0, pad_val=tokenizer.pad_token_type_id), # text_pair_segment - Stack(dtype="int64"), # label - ): [data for data in fn(samples)] - - train_data_loader = create_dataloader( - train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func - ) - - dev_data_loader = create_dataloader( - dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func - ) - - model = pretrained_model - - if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): - state_dict = paddle.load(args.init_from_ckpt) - model.set_dict(state_dict) - - model = paddle.DataParallel(model) - - num_training_steps = len(train_data_loader) * args.epochs - - lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) - - # Generate parameter names needed to perform weight decay. - # All bias and LayerNorm parameters are excluded. - decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] - optimizer = paddle.optimizer.AdamW( - learning_rate=lr_scheduler, - parameters=model.parameters(), - weight_decay=args.weight_decay, - apply_decay_param_fun=lambda x: x in decay_params, - ) - - criterion = paddle.nn.loss.CrossEntropyLoss() - metric = paddle.metric.Accuracy() - - global_step = 0 - tic_train = time.time() - for epoch in range(1, args.epochs + 1): - for step, batch in enumerate(train_data_loader, start=1): - input_ids, token_type_ids, labels = batch - probs = model(input_ids=input_ids, token_type_ids=token_type_ids) - loss = criterion(probs, labels) - correct = metric.compute(probs, labels) - metric.update(correct) - acc = metric.accumulate() - - global_step += 1 - if global_step % 100 == 0 and rank == 0: - print( - "global step %d, epoch: %d, batch: %d, loss: %.5f, accu: %.5f, speed: %.2f step/s" - % (global_step, epoch, step, loss, acc, 100 / (time.time() - tic_train)), - flush=True, - ) - tic_train = time.time() - loss.backward() - optimizer.step() - lr_scheduler.step() - optimizer.clear_grad() - - if global_step % args.eval_step == 0 and rank == 0: - evaluate(model, criterion, metric, dev_data_loader) - - if global_step % args.save_step == 0 and rank == 0: - save_dir = os.path.join(args.save_dir, "model_%d" % global_step) - if not os.path.exists(save_dir): - os.makedirs(save_dir) - save_param_path = os.path.join(save_dir, "model_state.pdparams") - paddle.save(model.state_dict(), save_param_path) - tokenizer.save_pretrained(save_dir) - - -if __name__ == "__main__": - do_train() diff --git a/examples/model_interpretation/task/similarity/roberta/modeling.py b/examples/model_interpretation/task/similarity/roberta/modeling.py deleted file mode 100644 index c5824a443f0a..000000000000 --- a/examples/model_interpretation/task/similarity/roberta/modeling.py +++ /dev/null @@ -1,618 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -""" - This script defines the model structure of roberta -""" -import sys - -import paddle -import paddle.nn as nn - -from paddlenlp.transformers.model_utils import PretrainedModel, register_base_model - -sys.path.append("../..") -from task.transformer import TransformerEncoder, TransformerEncoderLayer # noqa: E402 - -sys.path.remove("../..") - -__all__ = [ - "RobertaModel", - "RobertaPretrainedModel", - "RobertaForSequenceClassification", - "RobertaForTokenClassification", - "RobertaForQuestionAnswering", -] - - -class RobertaEmbeddings(nn.Layer): - r""" - Include embeddings from word, position and token_type embeddings. - """ - - def __init__( - self, - vocab_size, - hidden_size=768, - hidden_dropout_prob=0.1, - max_position_embeddings=512, - type_vocab_size=16, - pad_token_id=0, - ): - super(RobertaEmbeddings, self).__init__() - self.word_embeddings = nn.Embedding(vocab_size, hidden_size, padding_idx=pad_token_id) - self.position_embeddings = nn.Embedding(max_position_embeddings, hidden_size) - self.token_type_embeddings = nn.Embedding(type_vocab_size, hidden_size) - self.layer_norm = nn.LayerNorm(hidden_size) - self.dropout = nn.Dropout(hidden_dropout_prob) - - def forward(self, input_ids, token_type_ids=None, position_ids=None): - """ - forward function - """ - if position_ids is None: - # maybe need use shape op to unify static graph and dynamic graph - ones = paddle.ones_like(input_ids, dtype="int64") - seq_length = paddle.cumsum(ones, axis=-1) - position_ids = seq_length - ones - position_ids.stop_gradient = True - if token_type_ids is None: - token_type_ids = paddle.zeros_like(input_ids, dtype="int64") - - input_embedings = self.word_embeddings(input_ids) - position_embeddings = self.position_embeddings(position_ids) - token_type_embeddings = self.token_type_embeddings(token_type_ids) - - embeddings = input_embedings + position_embeddings + token_type_embeddings - embeddings = self.layer_norm(embeddings) - embeddings = self.dropout(embeddings) - return embeddings - - -class RobertaPooler(nn.Layer): - """ - An abstract class for RobertaPooler - """ - - def __init__(self, hidden_size): - super(RobertaPooler, self).__init__() - self.dense = nn.Linear(hidden_size, hidden_size) - self.activation = nn.Tanh() - - def forward(self, hidden_states): - """ - We "pool" the model by simply taking the hidden state corresponding - to the first token. - """ - first_token_tensor = hidden_states[:, 0] - pooled_output = self.dense(first_token_tensor) - pooled_output = self.activation(pooled_output) - return pooled_output - - -class RobertaPretrainedModel(PretrainedModel): - r""" - An abstract class for pretrained RoBerta models. It provides RoBerta related - `model_config_file`, `pretrained_resource_files_map`, `resource_files_names`, - `pretrained_init_configuration`, `base_model_prefix` for downloading and - loading pretrained models. - Refer to :class:`~paddlenlp.transformers.model_utils.PretrainedModel` for more details. - - """ - - model_config_file = "model_config.json" - pretrained_init_configuration = { - "roberta-wwm-ext": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 768, - "initializer_range": 0.02, - "intermediate_size": 3072, - "max_position_embeddings": 512, - "num_attention_heads": 12, - "num_hidden_layers": 12, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - "roberta-wwm-ext-large": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 1024, - "initializer_range": 0.02, - "intermediate_size": 4096, - "max_position_embeddings": 512, - "num_attention_heads": 16, - "num_hidden_layers": 24, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - "rbt3": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 768, - "initializer_range": 0.02, - "intermediate_size": 3072, - "max_position_embeddings": 512, - "num_attention_heads": 12, - "num_hidden_layers": 3, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - "rbtl3": { - "attention_probs_dropout_prob": 0.1, - "hidden_act": "gelu", - "hidden_dropout_prob": 0.1, - "hidden_size": 1024, - "initializer_range": 0.02, - "intermediate_size": 4096, - "max_position_embeddings": 512, - "num_attention_heads": 16, - "num_hidden_layers": 3, - "type_vocab_size": 2, - "vocab_size": 21128, - "pad_token_id": 0, - }, - } - resource_files_names = {"model_state": "model_state.pdparams"} - pretrained_resource_files_map = { - "model_state": { - "roberta-wwm-ext": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_base/roberta_chn_base.pdparams", - "roberta-wwm-ext-large": "https://paddlenlp.bj.bcebos.com/models/transformers/roberta_large/roberta_chn_large.pdparams", - "rbt3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbt3/rbt3_chn_large.pdparams", - "rbtl3": "https://paddlenlp.bj.bcebos.com/models/transformers/rbtl3/rbtl3_chn_large.pdparams", - } - } - base_model_prefix = "roberta" - - def _init_weights(self, layer): - """Initialization hook""" - if isinstance(layer, (nn.Linear, nn.Embedding)): - # only support dygraph, use truncated_normal and make it inplace - # and configurable later - layer.weight.set_value( - paddle.tensor.normal( - mean=0.0, - std=self.initializer_range - if hasattr(self, "initializer_range") - else self.roberta.config["initializer_range"], - shape=layer.weight.shape, - ) - ) - elif isinstance(layer, nn.LayerNorm): - layer._epsilon = 1e-12 - - -@register_base_model -class RobertaModel(RobertaPretrainedModel): - r""" - The bare Roberta Model outputting raw hidden-states. - - This model inherits from :class:`~paddlenlp.transformers.model_utils.PretrainedModel`. - Refer to the superclass documentation for the generic methods. - - This model is also a Paddle `paddle.nn.Layer `__ subclass. Use it as a regular Paddle Layer - and refer to the Paddle documentation for all matter related to general usage and behavior. - - Args: - vocab_size (int): - Vocabulary size of `inputs_ids` in `RobertaModel`. Also is the vocab size of token embedding matrix. - Defines the number of different tokens that can be represented by the `inputs_ids` passed when calling `RobertaModel`. - hidden_size (int, optional): - Dimensionality of the embedding layer, encoder layers and pooler layer. Defaults to `768`. - num_hidden_layers (int, optional): - Number of hidden layers in the Transformer encoder. Defaults to `12`. - num_attention_heads (int, optional): - Number of attention heads for each attention layer in the Transformer encoder. - Defaults to `12`. - intermediate_size (int, optional): - Dimensionality of the feed-forward (ff) layer in the encoder. Input tensors - to ff layers are firstly projected from `hidden_size` to `intermediate_size`, - and then projected back to `hidden_size`. Typically `intermediate_size` is larger than `hidden_size`. - Defaults to `3072`. - hidden_act (str, optional): - The non-linear activation function in the feed-forward layer. - ``"gelu"``, ``"relu"`` and any other paddle supported activation functions - are supported. Defaults to ``"gelu"``. - hidden_dropout_prob (float, optional): - The dropout probability for all fully connected layers in the embeddings and encoder. - Defaults to `0.1`. - attention_probs_dropout_prob (float, optional): - The dropout probability used in MultiHeadAttention in all encoder layers to drop some attention target. - Defaults to `0.1`. - max_position_embeddings (int, optional): - The maximum value of the dimensionality of position encoding, which dictates the maximum supported length of an input - sequence. Defaults to `512`. - type_vocab_size (int, optional): - The vocabulary size of the `token_type_ids` passed when calling `~transformers.RobertaModel`. - Defaults to `2`. - initializer_range (float, optional): - The standard deviation of the normal initializer. Defaults to 0.02. - - .. note:: - A normal_initializer initializes weight matrices as normal distributions. - See :meth:`RobertaPretrainedModel._init_weights()` for how weights are initialized in `RobertaModel`. - - pad_token_id(int, optional): - The index of padding token in the token vocabulary. - Defaults to `0`. - """ - - def __init__( - self, - vocab_size, - hidden_size=768, - num_hidden_layers=12, - num_attention_heads=12, - intermediate_size=3072, - hidden_act="gelu", - hidden_dropout_prob=0.1, - attention_probs_dropout_prob=0.1, - max_position_embeddings=512, - type_vocab_size=16, - initializer_range=0.02, - layer_norm_eps=1e-12, - pad_token_id=0, - ): - super(RobertaModel, self).__init__() - self.pad_token_id = pad_token_id - self.initializer_range = initializer_range - self.embeddings = RobertaEmbeddings( - vocab_size, hidden_size, hidden_dropout_prob, max_position_embeddings, type_vocab_size, pad_token_id - ) - encoder_layer = TransformerEncoderLayer( - hidden_size, - num_attention_heads, - intermediate_size, - dropout=hidden_dropout_prob, - activation=hidden_act, - attn_dropout=attention_probs_dropout_prob, - act_dropout=0, - ) - self.encoder = TransformerEncoder(encoder_layer, num_hidden_layers) - self.pooler = RobertaPooler(hidden_size) - - def forward( - self, - input_ids, - token_type_ids=None, - position_ids=None, - attention_mask=None, - noise=None, - i=None, - n_samples=None, - ): - r""" - Args: - input_ids (Tensor): - Indices of input sequence tokens in the vocabulary. They are - numerical representations of tokens that build the input sequence. - It's data type should be `int64` and has a shape of [batch_size, sequence_length]. - token_type_ids (Tensor, optional): - Segment token indices to indicate first and second portions of the inputs. - Indices can be either 0 or 1: - - - 0 corresponds to a **sentence A** token, - - 1 corresponds to a **sentence B** token. - - It's data type should be `int64` and has a shape of [batch_size, sequence_length]. - Defaults to None, which means no segment embeddings is added to token embeddings. - position_ids (Tensor, optional): - Indices of positions of each input sequence tokens in the position embeddings. - Selected in the range ``[0, max_position_embeddings - 1]``. - It's data type should be `int64` and has a shape of [batch_size, sequence_length]. - Defaults to `None`. - attention_mask (Tensor, optional): - Mask used in multi-head attention to avoid performing attention to some unwanted positions, - usually the paddings or the subsequent positions. - Its data type can be int, float and bool. - When the data type is bool, the `masked` tokens have `False` values and the others have `True` values. - When the data type is int, the `masked` tokens have `0` values and the others have `1` values. - When the data type is float, the `masked` tokens have `-INF` values and the others have `0` values. - It is a tensor with shape broadcasted to `[batch_size, num_attention_heads, sequence_length, sequence_length]`. - For example, its shape can be [batch_size, sequence_length], [batch_size, sequence_length, sequence_length], - [batch_size, num_attention_heads, sequence_length, sequence_length]. - Defaults to `None`, which means nothing needed to be prevented attention to. - - Returns: - tuple: Returns tuple (`sequence_output`, `pooled_output`). - - With the fields: - - - sequence_output (Tensor): - Sequence of hidden-states at the last layer of the model. - It's data type should be float32 and its shape is [batch_size, sequence_length, hidden_size]. - - - pooled_output (Tensor): - The output of first token (`[CLS]`) in sequence. - We "pool" the model by simply taking the hidden state corresponding to the first token. - Its data type should be float32 and its shape is [batch_size, hidden_size]. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaModel, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaModel.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - sequence_output, pooled_output = model(**inputs) - - """ - if attention_mask is None: - attention_mask = paddle.unsqueeze( - (input_ids == self.pad_token_id).astype(self.pooler.dense.weight.dtype) * -1e9, axis=[1, 2] - ) - # CLS: 101; SEP: 102; PAD: 0 - baseline_ids = paddle.to_tensor( - [101] + [0] * (input_ids.shape[1] - 2) + [102], - dtype=input_ids.dtype, - place=input_ids.place, - stop_gradient=input_ids.stop_gradient, - ) - - embedding_output = self.embeddings( - input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids - ) - baseline_embedding_output = self.embeddings( - input_ids=baseline_ids, position_ids=position_ids, token_type_ids=token_type_ids - ) - - if noise is not None: - if noise.upper() == "GAUSSIAN": - pass - if noise.upper() == "INTEGRATED": - embedding_output = baseline_embedding_output + i / (n_samples - 1) * ( - embedding_output - baseline_embedding_output - ) - else: - raise ValueError("unsupported noise method: %s" % (noise)) - - encoder_outputs, att_weights_list = self.encoder(embedding_output, attention_mask) # interpret - sequence_output = encoder_outputs - pooled_output = self.pooler(sequence_output) - result = [sequence_output, pooled_output, att_weights_list] - result.append(embedding_output) - return result - - -class RobertaForQuestionAnswering(RobertaPretrainedModel): - r""" - Roberta Model with a linear layer on top of the hidden-states output to - compute `span_start_logits` and `span_end_logits`, designed for question-answering tasks like SQuAD. - - Args: - roberta (:class:`RobertaModel`): - An instance of RobertaModel. - dropout (float, optional): - The dropout probability for output of Roberta. - If None, use the same value as `hidden_dropout_prob` of `RobertaModel` - instance `roberta`. Defaults to `None`. - """ - - def __init__(self, roberta, dropout=None): - super(RobertaForQuestionAnswering, self).__init__() - self.roberta = roberta # allow roberta to be config - self.classifier = nn.Linear(self.roberta.config["hidden_size"], 2) - - def forward(self, input_ids, token_type_ids=None): - r""" - Args: - input_ids (Tensor): - See :class:`RobertaModel`. - token_type_ids (Tensor, optional): - See :class:`RobertaModel`. - position_ids (Tensor, optional): - See :class:`RobertaModel`. - attention_mask (Tensor, optional): - See :class:`RobertaModel`. - - Returns: - tuple: Returns tuple (`start_logits`, `end_logits`). - - With the fields: - - - `start_logits` (Tensor): - A tensor of the input token classification logits, indicates the start position of the labelled span. - Its data type should be float32 and its shape is [batch_size, sequence_length]. - - - `end_logits` (Tensor): - A tensor of the input token classification logits, indicates the end position of the labelled span. - Its data type should be float32 and its shape is [batch_size, sequence_length]. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - logits = model(**inputs) - - """ - sequence_output, _ = self.roberta( - input_ids, token_type_ids=token_type_ids, position_ids=None, attention_mask=None - ) - - logits = self.classifier(sequence_output) - logits = paddle.transpose(logits, perm=[2, 0, 1]) - start_logits, end_logits = paddle.unstack(x=logits, axis=0) - - return start_logits, end_logits - - -class RobertaForSequenceClassification(RobertaPretrainedModel): - r""" - Roberta Model with a linear layer on top of the output layer, - designed for sequence classification/regression tasks like GLUE tasks. - - Args: - roberta (:class:`RobertaModel`): - An instance of `RobertaModel`. - num_classes (int, optional): - The number of classes. Defaults to `2`. - dropout (float, optional): - The dropout probability for output of Roberta. - If None, use the same value as `hidden_dropout_prob` - of `RobertaModel` instance `roberta`. Defaults to `None`. - """ - - def __init__(self, roberta, num_classes=2, dropout=None): - super(RobertaForSequenceClassification, self).__init__() - self.num_classes = num_classes - self.roberta = roberta # allow roberta to be config - self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) - self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) - self.softmax = nn.Softmax() - - def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): - r""" - Args: - input_ids (Tensor): - See :class:`RobertaModel`. - token_type_ids (Tensor, optional): - See :class:`RobertaModel`. - position_ids (Tensor, optional): - See :class:`RobertaModel`. - attention_mask (Tensor, optional): - See :class:`RobertaModel`. - - Returns: - Tensor: Returns tensor `logits`, a tensor of the input text classification logits. - Its data type should be float32 and it has a shape of [batch_size, num_classes]. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaForSequenceClassification, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaForSequenceClassification.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - logits = model(**inputs) - - """ - _, pooled_output, _, _ = self.roberta( - input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask - ) - - pooled_output = self.dropout(pooled_output) - logits = self.classifier(pooled_output) - return logits - - def forward_interpret( - self, - input_ids, - token_type_ids=None, - position_ids=None, - attention_mask=None, - noise=None, - i=None, - n_samples=None, - ): - """ - The forward function used when we are interpreting the model - """ - _, pooled_output, att_weights_list, embedding_output = self.roberta( - input_ids, - token_type_ids=token_type_ids, - position_ids=position_ids, - attention_mask=attention_mask, - noise=noise, - i=i, - n_samples=n_samples, - ) - - pooled_output = self.dropout(pooled_output) - logits = self.classifier(pooled_output) - probs = self.softmax(logits) - - return probs, att_weights_list, embedding_output - - -class RobertaForTokenClassification(RobertaPretrainedModel): - r""" - Roberta Model with a linear layer on top of the hidden-states output layer, - designed for token classification tasks like NER tasks. - - Args: - roberta (:class:`RobertaModel`): - An instance of `RobertaModel`. - num_classes (int, optional): - The number of classes. Defaults to `2`. - dropout (float, optional): - The dropout probability for output of Roberta. - If None, use the same value as `hidden_dropout_prob` - of `RobertaModel` instance `roberta`. Defaults to `None`. - """ - - def __init__(self, roberta, num_classes=2, dropout=None): - super(RobertaForTokenClassification, self).__init__() - self.num_classes = num_classes - self.roberta = roberta # allow roberta to be config - self.dropout = nn.Dropout(dropout if dropout is not None else self.roberta.config["hidden_dropout_prob"]) - self.classifier = nn.Linear(self.roberta.config["hidden_size"], num_classes) - - def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None): - r""" - Args: - input_ids (Tensor): - See :class:`RobertaModel`. - token_type_ids (Tensor, optional): - See :class:`RobertaModel`. - position_ids (Tensor, optional): - See :class:`RobertaModel`. - attention_mask (Tensor, optional): - See :class:`RobertaModel`. - - Returns: - Tensor: Returns tensor `logits`, a tensor of the input token classification logits. - Shape as `[batch_size, sequence_length, num_classes]` and dtype as `float32`. - - Example: - .. code-block:: - - import paddle - from paddlenlp.transformers import RobertaForTokenClassification, RobertaTokenizer - - tokenizer = RobertaTokenizer.from_pretrained('roberta-wwm-ext') - model = RobertaForTokenClassification.from_pretrained('roberta-wwm-ext') - - inputs = tokenizer("Welcome to use PaddlePaddle and PaddleNLP!") - inputs = {k:paddle.to_tensor([v]) for (k, v) in inputs.items()} - logits = model(**inputs) - - """ - sequence_output, _ = self.roberta( - input_ids, token_type_ids=token_type_ids, position_ids=position_ids, attention_mask=attention_mask - ) - - sequence_output = self.dropout(sequence_output) - logits = self.classifier(sequence_output) - return logits diff --git a/examples/model_interpretation/task/similarity/run_inter.sh b/examples/model_interpretation/task/similarity/run_inter.sh deleted file mode 100755 index e9de8e11df87..000000000000 --- a/examples/model_interpretation/task/similarity/run_inter.sh +++ /dev/null @@ -1,61 +0,0 @@ -### - # This file contains script to generate saliency map of a specific baseline model and language on given input data - # The result of this script will be used to evaluate the interpretive performance of the baseline model -### -export CUDA_VISIBLE_DEVICES=7 -export PYTHONPATH=./:$PYTHONPATH - -LANGUAGE=ch # LANGUAGE choose in [ch, en] -BASE_MODEL=roberta_base # BASE_MODEL choose in [roberta_base, roberta_large, lstm] -INTER_MODE=lime # INTER_MODE choice in [attention, integrated_gradient, lime] -TASK=similarity_${LANGUAGE} -DATA=../../data/${TASK} -START_ID=0 - -if [[ $LANGUAGE == "ch" ]]; then - - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN='roberta-wwm-ext' - CKPT=pretrained_models/saved_model_ch/roberta_base_20211018_104038/model_11400/model_state.pdparams - #CKPT=pretrained_models/saved_model_ch/roberta_base_20211208_121026/model_12000/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN='roberta-wwm-ext-large' - CKPT=pretrained_models/saved_model_ch/roberta_large_20211018_152833/model_22000/model_state.pdparams - #CKPT=pretrained_models/saved_model_ch/roberta_large_20211208_131546/model_22000/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - FROM_PRETRAIN='skep_ernie_1.0_large_ch' - CKPT=simnet/checkpoints_ch/final.pdparams - fi - -elif [[ $LANGUAGE == "en" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-base - CKPT=pretrained_models/saved_model_en/roberta_base_20211109_205245/model_54000/model_state.pdparams - #CKPT=pretrained_models/saved_model_en/roberta_base_20211208_121339/model_54000/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-large - CKPT=pretrained_models/saved_model_en/roberta_large_20211109_205649/model_46000/model_state.pdparams - #CKPT=pretrained_models/saved_model_en/roberta_large_20211208_131440/model_42000/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - FROM_PRETRAIN='data/skep_ernie_1.0_large_ch' - CKPT=simnet/checkpoints_en/final.pdparams - fi -fi - -OUTPUT=./output/$TASK.$BASE_MODEL -[ -d $OUTPUT ] || mkdir -p $OUTPUT -set -x - -python3 ./saliency_map/similarity_interpretable.py \ - --base_model $BASE_MODEL \ - --data_dir $DATA \ - --from_pretrained $FROM_PRETRAIN \ - --batch_size 1 \ - --max_seq_len 256 \ - --init_checkpoint $CKPT \ - --inter_mode $INTER_MODE \ - --start_id $START_ID \ - --output_dir $OUTPUT \ - --n-samples 500 \ - --language $LANGUAGE \ - --eval $@ diff --git a/examples/model_interpretation/task/similarity/run_inter_all.sh b/examples/model_interpretation/task/similarity/run_inter_all.sh deleted file mode 100755 index edabd07d6f41..000000000000 --- a/examples/model_interpretation/task/similarity/run_inter_all.sh +++ /dev/null @@ -1,69 +0,0 @@ -### - # This file contains script to generate saliency map of all baseline models and languages on given input data - # The result of this script will be used to evaluate the interpretive performance of the baseline model -### -export CUDA_VISIBLE_DEVICES=4 -export PYTHONPATH=./:$PYTHONPATH - -START_ID=0 - -for BASE_MODEL in "lstm" "roberta_base" "roberta_large"; -do - for INTER_MODE in "attention" "integrated_gradient" "lime"; - do - for LANGUAGE in "ch" "en"; - do - TASK=similarity_${LANGUAGE} - DATA=../../data/${TASK} - - if [[ $LANGUAGE == "ch" ]]; then - - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN='roberta-wwm-ext' - CKPT=pretrained_models/saved_model_ch/roberta_base_20211018_104038/model_11400/model_state.pdparams - #CKPT=pretrained_models/saved_model_ch/roberta_base_20211208_121026/model_12000/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN='roberta-wwm-ext-large' - CKPT=pretrained_models/saved_model_ch/roberta_large_20211018_152833/model_22000/model_state.pdparams - #CKPT=pretrained_models/saved_model_ch/roberta_large_20211208_131546/model_22000/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - FROM_PRETRAIN='data/skep_ernie_1.0_large_ch' - CKPT=simnet/checkpoints_ch/final.pdparams - fi - - elif [[ $LANGUAGE == "en" ]]; then - if [[ $BASE_MODEL == "roberta_base" ]]; then - FROM_PRETRAIN=roberta-base - CKPT=pretrained_models/saved_model_en/roberta_base_20211109_205245/model_54000/model_state.pdparams - #CKPT=pretrained_models/saved_model_en/roberta_base_20211208_121339/model_54000/model_state.pdparams - elif [[ $BASE_MODEL == "roberta_large" ]]; then - FROM_PRETRAIN=roberta-large - CKPT=pretrained_models/saved_model_en/roberta_large_20211109_205649/model_46000/model_state.pdparams - #CKPT=pretrained_models/saved_model_en/roberta_large_20211208_131440/model_42000/model_state.pdparams - elif [[ $BASE_MODEL == "lstm" ]]; then - FROM_PRETRAIN='data/skep_ernie_1.0_large_ch' - CKPT=simnet/checkpoints_en/final.pdparams - fi - fi - - OUTPUT=./output/$TASK.$BASE_MODEL - [ -d $OUTPUT ] || mkdir -p $OUTPUT - set -x - if [[ ! -f ${OUTPUT}/interpret.${INTER_MODE} ]]; then - python3 ./saliency_map/similarity_interpretable.py \ - --base_model $BASE_MODEL \ - --data_dir $DATA \ - --from_pretrained $FROM_PRETRAIN \ - --batch_size 1 \ - --max_seq_len 256 \ - --init_checkpoint $CKPT \ - --inter_mode $INTER_MODE \ - --start_id $START_ID \ - --output_dir $OUTPUT \ - --n-samples 500 \ - --language $LANGUAGE \ - --eval $@ - fi - done - done -done diff --git a/examples/model_interpretation/task/similarity/saliency_map/similarity_interpretable.py b/examples/model_interpretation/task/similarity/saliency_map/similarity_interpretable.py deleted file mode 100644 index 730640962190..000000000000 --- a/examples/model_interpretation/task/similarity/saliency_map/similarity_interpretable.py +++ /dev/null @@ -1,646 +0,0 @@ -# !/usr/bin/env python3 -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import collections -import json -import logging -import os -import re -import sys -from functools import partial -from pathlib import Path - -import numpy as np -import paddle -from LIME.lime_text import LimeTextExplainer -from roberta.modeling import RobertaForSequenceClassification -from simnet.model import SimNet -from simnet.utils import CharTokenizer, preprocess_data -from tqdm import tqdm - -from paddlenlp.data import Dict, Pad, Stack, Tuple, Vocab -from paddlenlp.datasets import DatasetBuilder -from paddlenlp.transformers.roberta.tokenizer import ( - RobertaBPETokenizer, - RobertaTokenizer, -) - -sys.path.append("../../..") -from model_interpretation.utils import ( # noqa: E402 - convert_tokenizer_res_to_old_version, - match, -) - -sys.path.remove("../../..") - -log = logging.getLogger(__name__) -log.setLevel(logging.DEBUG) -logging.getLogger().setLevel(logging.DEBUG) - - -def get_args(): - parser = argparse.ArgumentParser("interpret textual similarity task") - parser.add_argument("--base_model", required=True, choices=["roberta_base", "roberta_large", "lstm"]) - parser.add_argument("--from_pretrained", type=str, required=True, help="pretrained model directory or tag") - parser.add_argument( - "--max_seq_len", type=int, default=128, help="max sentence length, should not greater than 512" - ) - parser.add_argument("--batch_size", type=int, default=1, help="batchsize") - parser.add_argument("--data_dir", type=str, required=True, help="data directory includes train / develop data") - parser.add_argument("--eval", action="store_true") - parser.add_argument("--init_checkpoint", type=str, default=None, help="checkpoint to warm start from") - parser.add_argument("--wd", type=float, default=0.01, help="weight decay, aka L2 regularizer") - parser.add_argument( - "--use_amp", - action="store_true", - help="only activate AMP(auto mixed precision accelatoin) on TensorCore compatible devices", - ) - parser.add_argument( - "--inter_mode", - type=str, - default="attention", - choices=["attention", "simple_gradient", "smooth_gradient", "integrated_gradient", "lime"], - help="appoint the mode of interpretable.", - ) - parser.add_argument("--n-samples", type=int, default=25, help="number of samples used for smooth gradient method") - parser.add_argument("--output_dir", type=Path, required=True, help="interpretable output directory") - parser.add_argument("--start_id", type=int, default=0) - parser.add_argument("--language", type=str, required=True, help="Language that the model is based on") - args = parser.parse_args() - return args - - -class Similarity_data(DatasetBuilder): - def _read(self, filename): - with open(filename, "r", encoding="utf8") as f: - for line in f.readlines(): - line_split = json.loads(line) - if args.language == "ch": - yield { - "id": line_split["id"], - "query": line_split["query"], - "title": line_split["title"], - "text_q_seg": line_split["text_q_seg"], - "text_t_seg": line_split["text_t_seg"], - } - else: - yield { - "id": line_split["id"], - "sentence1": line_split["sentence1"], - "sentence2": line_split["sentence2"], - "text_q_seg": line_split["text_q_seg"], - "text_t_seg": line_split["text_t_seg"], - } - - -def map_fn_senti(examples, tokenizer, language): - print("load data %d" % len(examples)) - if language == "ch": - q_name = "query" - t_name = "title" - queries = [example[q_name] for example in examples] - titles = [example[t_name] for example in examples] - else: - q_name = "sentence1" - t_name = "sentence2" - queries = [example[q_name].encode("ascii", errors="replace").decode("UTF-8") for example in examples] - titles = [example[t_name].encode("ascii", errors="replace").decode("UTF-8") for example in examples] - tokenized_examples = tokenizer(queries, titles, max_seq_len=args.max_seq_len) - - tokenized_examples = convert_tokenizer_res_to_old_version(tokenized_examples) - - for i in range(len(tokenized_examples)): - tokenized_examples[i]["query_offset_mapping"] = ( - [(0, 0)] + tokenizer.get_offset_mapping(queries[i])[: args.max_seq_len - 2] + [(0, 0)] - ) - tokenized_examples[i]["title_offset_mapping"] = ( - [(0, 0)] + tokenizer.get_offset_mapping(titles[i])[: args.max_seq_len - 2] + [(0, 0)] - ) - - return tokenized_examples - - -def init_roberta_var(args): - if args.language == "ch": - tokenizer = RobertaTokenizer.from_pretrained(args.from_pretrained) - else: - tokenizer = RobertaBPETokenizer.from_pretrained(args.from_pretrained) - - model = RobertaForSequenceClassification.from_pretrained( - args.from_pretrained, - hidden_dropout_prob=0, - attention_probs_dropout_prob=0, - dropout=0, - num_labels=2, - name="", - return_inter_score=True, - ) - - map_fn = partial(map_fn_senti, tokenizer=tokenizer, language=args.language) - - dev_ds = Similarity_data().read(args.data_dir) - dev_ds.map(map_fn, batched=True) - dev_batch_sampler = paddle.io.BatchSampler(dev_ds, batch_size=args.batch_size, shuffle=False) - batchify_fn = lambda samples, fn=Dict( - { - "input_ids": Pad(axis=0, pad_val=tokenizer.pad_token_id), - "token_type_ids": Pad(axis=0, pad_val=tokenizer.pad_token_type_id), - "query_offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id), - "title_offset_mapping": Pad(axis=0, pad_val=tokenizer.pad_token_id), - } - ): fn(samples) - - dataloader = paddle.io.DataLoader( - dataset=dev_ds, batch_sampler=dev_batch_sampler, collate_fn=batchify_fn, return_list=True - ) - - return model, tokenizer, dataloader, dev_ds - - -def init_lstm_var(args): - if args.language == "ch": - vocab = Vocab.load_vocabulary("simnet/vocab.char", unk_token="[UNK]", pad_token="[PAD]") - else: - vocab = Vocab.load_vocabulary("simnet/vocab_QQP", unk_token="[UNK]", pad_token="[PAD]") - - tokenizer = CharTokenizer(vocab, args.language, "../../punctuations") - model = SimNet(network="lstm", vocab_size=len(vocab), num_classes=2) - - dev_ds = Similarity_data().read(args.data_dir) - dev_examples = preprocess_data(dev_ds.data, tokenizer, args.language) - batches = [dev_examples[idx : idx + args.batch_size] for idx in range(0, len(dev_examples), args.batch_size)] - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # query_ids - Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # title_ids - Stack(dtype="int64"), # query_seq_lens - Stack(dtype="int64"), # title_seq_lens - ): [data for data in fn(samples)] - - return model, tokenizer, batches, batchify_fn, vocab, dev_ds - - -def get_seq_token_num(language): - if language == "ch": - add_idx = 1 - else: - add_idx = 2 - return add_idx - - -def get_qt_tokens(base_model, d, add_idx=None, tokenizer=None, batchify_fn=None, vocab=None): - SEP_idx = 0 - if base_model == "roberta": - input_ids, token_type_ids, query_offset_map, title_offset_map = d - fwd_args = [input_ids, token_type_ids] - fwd_kwargs = {} - - SEP_idx = input_ids.tolist()[0].index(tokenizer.sep_token_id) - q_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, 1:SEP_idx].tolist()) # list - t_tokens = tokenizer.convert_ids_to_tokens(input_ids[0, SEP_idx + add_idx : -1].tolist()) # list - q_offset = query_offset_map[0, 1:-1].tolist() - t_offset = title_offset_map[0, 1:-1].tolist() - return q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs, q_offset, t_offset - - if base_model == "lstm": - query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(d) - query_ids = paddle.to_tensor(query_ids) - title_ids = paddle.to_tensor(title_ids) - query_seq_lens = paddle.to_tensor(query_seq_lens) - title_seq_lens = paddle.to_tensor(title_seq_lens) - - fwd_args = [query_ids, title_ids, query_seq_lens, title_seq_lens] - fwd_kwargs = {} - q_tokens = [vocab._idx_to_token[idx] for idx in query_ids.tolist()[0]] - t_tokens = [vocab._idx_to_token[idx] for idx in title_ids.tolist()[0]] - return q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs - - -def extract_attention_scores(args, result, atts, q_tokens, t_tokens, out_handle, SEP_idx, q_offset, t_offset, add_idx): - if args.base_model.startswith("roberta"): - inter_score = atts[-1][:, :, 0, :].mean(1) # (bsz, seq) - q_inter_score = inter_score[0][1:SEP_idx] # remove CLS and SEP - t_inter_score = inter_score[0][SEP_idx + add_idx : -1] # remove CLS and SEP - elif args.base_model == "lstm": - q_inter_score = atts[0][0] - t_inter_score = atts[1][0] - - q_length = (q_inter_score > 0).cast("int32").sum(-1)[0] - t_length = (t_inter_score > 0).cast("int32").sum(-1)[0] - assert len(q_tokens) == q_length, f"{len(q_tokens)} != {q_length}" - assert len(t_tokens) == t_length, f"{len(t_tokens)} != {t_length}" - - q_char_attribution_dict, t_char_attribution_dict = {}, {} - if args.base_model.startswith("roberta"): - # Query - sorted_token = [] - for i in range(len(q_inter_score)): - sorted_token.append([i, q_offset[i], q_inter_score[i]]) - q_char_attribution_dict = match(result["query"], result["text_q_seg"], sorted_token) - result["query_char_attri"] = collections.OrderedDict() - for token_info in sorted(q_char_attribution_dict, key=lambda x: x[2], reverse=True): - result["query_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] - result.pop("text_q_seg") - - # Title - sorted_token = [] - for i in range(len(t_inter_score)): - sorted_token.append([i, t_offset[i], t_inter_score[i]]) - t_char_attribution_dict = match(result["title"], result["text_t_seg"], sorted_token) - result["title_char_attri"] = collections.OrderedDict() - for token_info in sorted(t_char_attribution_dict, key=lambda x: x[2], reverse=True): - result["title_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] - result.pop("text_t_seg") - - else: - idx = 0 - for token, score in zip(q_tokens, q_inter_score.tolist()): - q_char_attribution_dict[idx] = (token, score) - idx += 1 - for token, score in zip(t_tokens, t_inter_score.tolist()): - t_char_attribution_dict[idx] = (token, score) - idx += 1 - - result["query_char_attri"], result["title_char_attri"] = collections.OrderedDict(), collections.OrderedDict() - for token, attri in sorted(q_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): - result["query_char_attri"][token] = attri - for token, attri in sorted(t_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): - result["title_char_attri"][token] = attri - - out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") - - -def IG_roberta_inter_score( - args, - embedded_grads_list, - pred_embedded, - baseline_embedded, - pred_confidence, - baseline_pred_confidence, - SEP_idx, - add_idx, - err_total, -): - embedded_grads_tensor = paddle.to_tensor( - embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True - ) - - # Tensor(n_samples-1, 1, seq_len, embed_size) - trapezoidal_grads = (embedded_grads_tensor[1:] + embedded_grads_tensor[:-1]) / 2 - integral_grads = trapezoidal_grads.sum(0) / trapezoidal_grads.shape[0] # Tensor(1, seq_len, embed_size) - inter_score = (pred_embedded - baseline_embedded) * integral_grads # Tensor(1, seq_len, embed_size) - inter_score = inter_score.sum(-1) # Tensor(1, seq_len) - - # eval err - delta_pred_confidence = pred_confidence - baseline_pred_confidence - sum_gradient = inter_score.sum().tolist()[0] - err = (delta_pred_confidence - sum_gradient + 1e-12) / (delta_pred_confidence + 1e-12) - err_total.append(np.abs(err)) - - print_str = "%s\t%d\t%.3f\t%.3f\t%.3f\t%.3f" - print_vals = (result["id"], args.n_samples, delta_pred_confidence, sum_gradient, err, np.average(err_total)) - print(print_str % print_vals) - - inter_score.stop_gradient = True - q_inter_score = inter_score[0][1:SEP_idx] # remove CLS and SEP - t_inter_score = inter_score[0][SEP_idx + add_idx : -1] # remove CLS and SEP - - return q_inter_score, t_inter_score - - -def IG_lstm_inter_score(q_embedded_grads_list, pred_embedded, baseline_embedded, idx): - # query - q_embedded_grads_tensor = paddle.to_tensor( - q_embedded_grads_list, dtype="float32", place=paddle.CUDAPlace(0), stop_gradient=True - ) - q_trapezoidal_grads = ( - q_embedded_grads_tensor[1:] + q_embedded_grads_tensor[:-1] - ) / 2 # Tensor(n_samples-1, 1, seq_len, embed_size) - q_integral_grads = q_trapezoidal_grads.sum(0) / q_trapezoidal_grads.shape[0] # Tensor(1, seq_len, embed_size) - q_inter_score = (pred_embedded[idx] - baseline_embedded[idx]) * q_integral_grads # Tensor(1, seq_len, embed_size) - q_inter_score = q_inter_score.sum(-1) # Tensor(1, seq_len) - q_inter_score.stop_gradient = True - q_inter_score = q_inter_score[0] - - return q_inter_score - - -def extract_integrated_gradient_scores( - args, - result, - fwd_args, - fwd_kwargs, - model, - q_tokens, - t_tokens, - out_handle, - SEP_idx, - add_idx, - q_offset, - t_offset, - err_total, -): - embedded_grads_list = [] - q_embedded_grads_list, t_embedded_grads_list = [], [] - for i in range(args.n_samples): - probs, _, embedded = model.forward_interpret( - *fwd_args, **fwd_kwargs, noise="integrated", i=i, n_samples=args.n_samples - ) - predicted_class_prob = probs[0][pred_label] - predicted_class_prob.backward(retain_graph=False) - - if args.base_model.startswith("roberta"): - embedded_grad = embedded.grad - embedded_grads_list.append(embedded_grad) - elif args.base_model == "lstm": - q_embedded, t_embedded = embedded - q_embedded_grad = q_embedded.grad - t_embedded_grad = t_embedded.grad - q_embedded_grads_list.append(q_embedded_grad) - t_embedded_grads_list.append(t_embedded_grad) - model.clear_gradients() - if i == 0: - baseline_pred_confidence = probs.tolist()[0][pred_label] # scalar - baseline_embedded = embedded # Tensor(1, seq_len, embed_size) - elif i == args.n_samples - 1: - pred_confidence = probs.tolist()[0][pred_label] # scalar - pred_embedded = embedded # Tensor(1, seq_len, embed_size) - - if args.base_model.startswith("roberta"): - q_inter_score, t_inter_score = IG_roberta_inter_score( - args, - embedded_grads_list, - pred_embedded, - baseline_embedded, - pred_confidence, - baseline_pred_confidence, - SEP_idx, - add_idx, - err_total, - ) - elif args.base_model == "lstm": - q_inter_score = IG_lstm_inter_score(q_embedded_grads_list, pred_embedded, baseline_embedded, 0) - t_inter_score = IG_lstm_inter_score(t_embedded_grads_list, pred_embedded, baseline_embedded, 1) - - q_char_attribution_dict, t_char_attribution_dict = {}, {} - if args.base_model.startswith("roberta"): - # Query - sorted_token = [] - for i in range(len(q_inter_score)): - sorted_token.append([i, q_offset[i], q_inter_score[i]]) - q_char_attribution_dict = match(result["query"], result["text_q_seg"], sorted_token) - result["query_char_attri"] = collections.OrderedDict() - for token_info in sorted(q_char_attribution_dict, key=lambda x: x[2], reverse=True): - result["query_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] - result.pop("text_q_seg") - - # Title - sorted_token = [] - for i in range(len(t_inter_score)): - sorted_token.append([i, t_offset[i], t_inter_score[i]]) - t_char_attribution_dict = match(result["title"], result["text_t_seg"], sorted_token) - result["title_char_attri"] = collections.OrderedDict() - for token_info in sorted(t_char_attribution_dict, key=lambda x: x[2], reverse=True): - result["title_char_attri"][str(token_info[0])] = [str(token_info[1]), float(token_info[2])] - result.pop("text_t_seg") - else: - idx = 0 - for token, score in zip(q_tokens, q_inter_score.tolist()): - q_char_attribution_dict[idx] = (token, score) - idx += 1 - for token, score in zip(t_tokens, t_inter_score.tolist()): - t_char_attribution_dict[idx] = (token, score) - idx += 1 - - result["query_char_attri"], result["title_char_attri"] = collections.OrderedDict(), collections.OrderedDict() - for token, attri in sorted(q_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): - result["query_char_attri"][token] = attri - for token, attri in sorted(t_char_attribution_dict.items(), key=lambda x: x[1][1], reverse=True): - result["title_char_attri"][token] = attri - - out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") - - -def extract_LIME_scores( - args, q_tokens, t_tokens, result, tokenizer, pred_label, fwd_args, fwd_kwargs, model, probs, out_handle -): - explainer = LimeTextExplainer(class_names=["neg", "pos"], verbose=False, language=args.language) - if_lstm = args.base_model == "lstm" - - explain_res_q = explainer.explain_instance( - text_instance_q=result["query"], - text_instance_t=result["title"], - analysis_query=True, - tokenizer=tokenizer, - pred_label=pred_label, - classifier_fn=model.forward_interpret, - num_samples=5000, - if_lstm=if_lstm, - ) - exp_q, indexed_string_q, relative_err, err = explain_res_q - local_exps_q = exp_q.local_exp - - explain_res_t = explainer.explain_instance( - text_instance_q=result["query"], - text_instance_t=result["title"], - analysis_query=False, - tokenizer=tokenizer, - pred_label=pred_label, - classifier_fn=model.forward_interpret, - num_samples=5000, - if_lstm=if_lstm, - ) - exp_t, indexed_string_t, _, _ = explain_res_t - local_exps_t = exp_t.local_exp - - # query - char_attribution_dict = [] - for kind, local_exp in local_exps_q.items(): - for idx in range(len(result["text_q_seg"])): - t = result["text_q_seg"][idx] # .replace('Ġ', '') - got_score = False - for word_id, attribution in local_exp: - if indexed_string_q.inverse_vocab[word_id] == t: - char_attribution_dict.append((idx, t, attribution)) - got_score = True - break - if not got_score: - char_attribution_dict.append((idx, t, 0)) - char_attribution_dict = sorted(char_attribution_dict, key=lambda x: x[2], reverse=True) - result["query_char_attri"] = collections.OrderedDict() - for s in char_attribution_dict: - result["query_char_attri"][s[0]] = (s[1], s[2]) - - # title - char_attribution_dict = [] - for kind, local_exp in local_exps_t.items(): - for idx in range(len(result["text_t_seg"])): - t = result["text_t_seg"][idx] # .replace('Ġ', '') - got_score = False - for word_id, attribution in local_exp: - if indexed_string_t.inverse_vocab[word_id] == t: - char_attribution_dict.append((idx, t, attribution)) - got_score = True - break - if not got_score: - char_attribution_dict.append((idx, t, 0)) - char_attribution_dict = sorted(char_attribution_dict, key=lambda x: x[2], reverse=True) - result["title_char_attri"] = collections.OrderedDict() - for s in char_attribution_dict: - result["title_char_attri"][s[0]] = (s[1], s[2]) - - out_handle.write(json.dumps(result, ensure_ascii=False) + "\n") - return exp_q, exp_t, relative_err, err - - -def LIME_error_evaluation( - exp_q, pred_label, probs, lime_score_total, lime_relative_err_total, lime_err_total, relative_err, err -): - # err evaluation - score = exp_q.score[pred_label] - ridge_pred = exp_q.local_pred[pred_label] - model_pred = probs.numpy().tolist()[0][pred_label] - - lime_score_total.append(score) - lime_relative_err_total.append(relative_err) - lime_err_total.append(err) - print("score: %.2f" % score) - print("relative_err: %.2f" % relative_err) - print("err: %.2f" % err) - print("ridge_pred: %.2f\tpred: %.2f\tdelta: %.2f" % (ridge_pred, model_pred, ridge_pred - model_pred)) - return lime_score_total, lime_relative_err_total, lime_err_total - - -g_splitter = re.compile(r"([\u4e00-\u9fa5])") - -if __name__ == "__main__": - args = get_args() - if args.base_model.startswith("roberta"): - model, tokenizer, dataloader, dev_ds = init_roberta_var(args) - elif args.base_model == "lstm": - model, tokenizer, dataloader, batchify_fn, vocab, dev_ds = init_lstm_var(args) - else: - raise ValueError("unsupported base model name.") - - assert args.eval, "INTERPRETER must be run in eval mode" - with paddle.amp.auto_cast(enable=args.use_amp), open( - os.path.join(args.output_dir, "interpret" + f".{args.inter_mode}"), "w" - ) as out_handle: - # Load model - sd = paddle.load(args.init_checkpoint) - model.set_dict(sd) - model.train() # Set dropout to 0 when init the model to collect the gradient - print("load model from %s" % args.init_checkpoint) - - # For IG - err_total = [] - # For LIME - lime_score_total = [] - lime_relative_err_total = [] - lime_err_total = [] - # For Roberta - sub_word_id_dict_query = [] - sub_word_id_dict_title = [] - # For LSTM - q_offset, t_offset = None, None - - get_sub_word_ids = lambda word: map(str, tokenizer.convert_tokens_to_ids(tokenizer.tokenize(word))) - for step, d in tqdm(enumerate(dataloader)): - if step + 1 < args.start_id: - continue - - result = {} - # English and Chinese models have different numbers of [SEQ] tokens between query and title - add_idx = get_seq_token_num(args.language) - - if args.base_model.startswith("roberta"): - q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs, q_offset, t_offset = get_qt_tokens( - base_model="roberta", d=d, add_idx=add_idx, tokenizer=tokenizer - ) - elif args.base_model == "lstm": - q_tokens, t_tokens, SEP_idx, fwd_args, fwd_kwargs = get_qt_tokens( - base_model="lstm", d=d, batchify_fn=batchify_fn, vocab=vocab - ) - - result["id"] = dev_ds.data[step]["id"] - result["text_q_seg"] = dev_ds.data[step]["text_q_seg"] - result["text_t_seg"] = dev_ds.data[step]["text_t_seg"] - - probs, atts, embedded = model.forward_interpret(*fwd_args, **fwd_kwargs) - pred_label = paddle.argmax(probs, axis=-1).tolist()[0] - - result["pred_label"] = pred_label - result["probs"] = [float(format(prob, ".5f")) for prob in probs.numpy()[0].tolist()] - - if args.language == "ch": - result["query"] = dev_ds.data[step]["query"] - result["title"] = dev_ds.data[step]["title"] - else: - result["query"] = dev_ds.data[step]["sentence1"] - result["title"] = dev_ds.data[step]["sentence2"] - - # Attention - if args.inter_mode == "attention": - extract_attention_scores( - args, result, atts, q_tokens, t_tokens, out_handle, SEP_idx, q_offset, t_offset, add_idx - ) - - elif args.inter_mode == "integrated_gradient": - extract_integrated_gradient_scores( - args, - result, - fwd_args, - fwd_kwargs, - model, - q_tokens, - t_tokens, - out_handle, - SEP_idx, - add_idx, - q_offset, - t_offset, - err_total, - ) - - elif args.inter_mode == "lime": - exp_q, exp_t, relative_err, err = extract_LIME_scores( - args, - q_tokens, - t_tokens, - result, - tokenizer, - pred_label, - fwd_args, - fwd_kwargs, - model, - probs, - out_handle, - ) - lime_score_total, lime_relative_err_total, lime_err_total = LIME_error_evaluation( - exp_q, - pred_label, - probs, - lime_score_total, - lime_relative_err_total, - lime_err_total, - relative_err, - err, - ) - - else: - raise KeyError(f"Unkonwn interpretable mode: {args.inter_mode}") - - if args.inter_mode == "lime": - print(np.average(np.array(lime_relative_err_total))) diff --git a/examples/model_interpretation/task/similarity/saliency_map/utils.py b/examples/model_interpretation/task/similarity/saliency_map/utils.py deleted file mode 100644 index 9e6dd7e1a61b..000000000000 --- a/examples/model_interpretation/task/similarity/saliency_map/utils.py +++ /dev/null @@ -1,38 +0,0 @@ -# !/usr/bin/env python3 -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import absolute_import, division, print_function, unicode_literals - -import paddle - - -class UnpackDataLoader(paddle.io.DataLoader): - def __init__(self, *args, **kwargs): - super(UnpackDataLoader, self).__init__(*args, batch_size=1, **kwargs) - - def __iter__(self): - return ([yy[0] for yy in y] for y in super(UnpackDataLoader, self).__iter__()) - - -def create_if_not_exists(dir): - try: - dir.mkdir(parents=True) - except FileExistsError: - pass - return dir - - -def get_warmup_and_linear_decay(max_steps, warmup_steps): - return lambda step: min(step / warmup_steps, 1.0 - (step - warmup_steps) / (max_steps - warmup_steps)) diff --git a/examples/model_interpretation/task/similarity/simnet/gen_vocab.py b/examples/model_interpretation/task/similarity/simnet/gen_vocab.py deleted file mode 100644 index 435990282531..000000000000 --- a/examples/model_interpretation/task/similarity/simnet/gen_vocab.py +++ /dev/null @@ -1,60 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# !/usr/bin/env python -# coding=utf-8 - -import sys -from collections import defaultdict - -import spacy - -from paddlenlp.datasets import load_dataset - -if sys.argv[1] == "ch": - train_ds, dev_ds, test_ds = load_dataset("lcqmc", splits=["train", "dev", "test"]) - - vocab = defaultdict(int) - for example in train_ds.data: - query = example["query"] - title = example["title"] - for c in query: - vocab[c] += 1 - for c in title: - vocab[c] += 1 - with open("vocab.char", "w") as f: - for k, v in vocab.items(): - if v > 3: - f.write(k + "\n") - -else: - tokenizer = spacy.load("en_core_web_sm") - vocab = defaultdict(int) - - with open("../data/QQP/train/train.tsv", "r") as f_dataset: - for idx, line in enumerate(f_dataset.readlines()): - if idx == 0: - continue - line_split = line.strip().split("\t") - query = [token.text for token in tokenizer(line_split[0])] - title = [token.text for token in tokenizer(line_split[1])] - - for word in query: - vocab[word] += 1 - for word in title: - vocab[word] += 1 - - with open("vocab_QQP", "w") as f: - for k, v in vocab.items(): - if v > 3: - f.write(k + "\n") diff --git a/examples/model_interpretation/task/similarity/simnet/interpreter_attention.py b/examples/model_interpretation/task/similarity/simnet/interpreter_attention.py deleted file mode 100644 index e2ed642e836b..000000000000 --- a/examples/model_interpretation/task/similarity/simnet/interpreter_attention.py +++ /dev/null @@ -1,121 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License" -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import sys - -import paddle - -from paddlenlp.data import Pad, Stack, Tuple, Vocab -from paddlenlp.datasets import load_dataset - -sys.path.append("../../..") -from model import SimNet # noqa: E402 -from utils import CharTokenizer, preprocess_data # noqa: E402 - -parser = argparse.ArgumentParser(__doc__) -parser.add_argument( - "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." -) -parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number of a batch for training.") -parser.add_argument("--vocab_path", type=str, default="./vocab.char", help="The path to vocabulary.") -parser.add_argument( - "--network", type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?" -) -parser.add_argument( - "--params_path", type=str, default="./checkpoints/final.pdparams", help="The path of model parameter to be loaded." -) -parser.add_argument("--language", type=str, required=True, help="Language that this model based on") -args = parser.parse_args() - - -def interpret(model, data, label_map, batch_size=1, pad_token_id=0, vocab=None): - """ - Predicts the data labels. - - Args: - model (obj:`paddle.nn.Layer`): A model to classify texts. - data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. - A Example object contains `text`(word_ids) and `seq_len`(sequence length). - label_map(obj:`dict`): The label id (key) to label str (value) map. - batch_size(obj:`int`, defaults to 1): The number of batch. - pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. - - Returns: - results(obj:`dict`): All the predictions labels. - """ - - # Separates data into some batches. - batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)] - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=pad_token_id), # query_ids - Pad(axis=0, pad_val=pad_token_id), # title_ids - Stack(dtype="int64"), # query_seq_lens - Stack(dtype="int64"), # title_seq_lens - ): [data for data in fn(samples)] - - model.eval() - results = [] - for batch in batches: - query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(batch) - query_ids = paddle.to_tensor(query_ids) - title_ids = paddle.to_tensor(title_ids) - query_seq_lens = paddle.to_tensor(query_seq_lens) - title_seq_lens = paddle.to_tensor(title_seq_lens) - - logits, attention, _ = model.forward_interpret(query_ids, title_ids, query_seq_lens, title_seq_lens) - query_att = attention[0] - title_att = attention[1] - - model.clear_gradients() - for query_id, title_id in zip(query_ids.numpy().tolist(), title_ids.numpy().tolist()): - query = [vocab._idx_to_token[idx] for idx in query_id] - title = [vocab._idx_to_token[idx] for idx in title_id] - results.append([query_att, query, title_att, title]) - - print("query_att: %s" % query_att.shape) - print("title_att: %s" % title_att.shape) - - return results - - -if __name__ == "__main__": - paddle.set_device(args.device + ":2") - # Loads vocab. - vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") - tokenizer = CharTokenizer(vocab, args.language) - label_map = {0: "dissimilar", 1: "similar"} - - # Constructs the newtork. - model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(label_map)) - - # Loads model parameters. - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - print("Loaded parameters from %s" % args.params_path) - - # Firstly pre-processing prediction data and then do predict. - dev_ds, test_ds = load_dataset("lcqmc", splits=["dev", "test"]) - - dev_examples = preprocess_data(dev_ds.data, tokenizer, args.language) - test_examples = preprocess_data(test_ds.data, tokenizer, args.language) - results = interpret( - model, - dev_examples, - label_map=label_map, - batch_size=args.batch_size, - pad_token_id=vocab.token_to_idx.get("[PAD]", 0), - vocab=vocab, - ) diff --git a/examples/model_interpretation/task/similarity/simnet/interpreter_grad.py b/examples/model_interpretation/task/similarity/simnet/interpreter_grad.py deleted file mode 100644 index 8da2733bee65..000000000000 --- a/examples/model_interpretation/task/similarity/simnet/interpreter_grad.py +++ /dev/null @@ -1,131 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License" -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import sys - -import paddle - -from paddlenlp.data import Pad, Stack, Tuple, Vocab -from paddlenlp.datasets import load_dataset - -sys.path.append("../../..") -from model import SimNet # noqa: E402 -from utils import CharTokenizer, preprocess_data # noqa: E402 - -parser = argparse.ArgumentParser(__doc__) -parser.add_argument( - "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." -) -parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number of a batch for training.") -parser.add_argument("--vocab_path", type=str, default="./vocab.char", help="The path to vocabulary.") -parser.add_argument( - "--network", type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?" -) -parser.add_argument( - "--params_path", type=str, default="./checkpoints/final.pdparams", help="The path of model parameter to be loaded." -) -parser.add_argument("--language", type=str, required=True, help="Language that this model based on") -args = parser.parse_args() - - -def interpret(model, data, label_map, batch_size=1, pad_token_id=0, vocab=None): - """ - Predicts the data labels. - - Args: - model (obj:`paddle.nn.Layer`): A model to classify texts. - data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. - A Example object contains `text`(word_ids) and `seq_len`(sequence length). - label_map(obj:`dict`): The label id (key) to label str (value) map. - batch_size(obj:`int`, defaults to 1): The number of batch. - pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. - - Returns: - results(obj:`dict`): All the predictions labels. - """ - - # Separates data into some batches. - batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)] - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=pad_token_id), # query_ids - Pad(axis=0, pad_val=pad_token_id), # title_ids - Stack(dtype="int64"), # query_seq_lens - Stack(dtype="int64"), # title_seq_lens - Stack(dtype="int64"), - ): [data for data in fn(samples)] - - model.train() - results = [] - for batch in batches: - query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(batch) - query_ids = paddle.to_tensor(query_ids) - title_ids = paddle.to_tensor(title_ids) - query_seq_lens = paddle.to_tensor(query_seq_lens) - title_seq_lens = paddle.to_tensor(title_seq_lens) - probs, addiational_info = model.forward_interpreter(query_ids, title_ids, query_seq_lens, title_seq_lens) - query_emb = addiational_info["embedded"][0] - title_emb = addiational_info["embedded"][1] - - predicted_class_probs = paddle.max(probs, axis=-1) - predicted_class_probs = predicted_class_probs.sum() - paddle.autograd.backward([predicted_class_probs]) - q_gradients = ((query_emb * query_emb.grad).sum(-1).detach()).abs() # gradients: (1, seq_len) - q_grad_output = q_gradients / q_gradients.sum(-1, keepdim=True) - t_gradients = ((title_emb * title_emb.grad).sum(-1).detach()).abs() # gradients: (1, seq_len) - t_grad_output = t_gradients / t_gradients.sum(-1, keepdim=True) - - model.clear_gradients() - for query_id, title_id in zip(query_ids.numpy().tolist(), title_ids.numpy().tolist()): - query = [vocab._idx_to_token[idx] for idx in query_id] - title = [vocab._idx_to_token[idx] for idx in title_id] - results.append([q_grad_output, query, t_grad_output, title]) - print([q_grad_output, query, t_grad_output, title]) - - return results - - -if __name__ == "__main__": - paddle.set_device(args.device + ":1") - # Loads vocab. - vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") - tokenizer = CharTokenizer(vocab, args.language) - label_map = {0: "dissimilar", 1: "similar"} - - # Constructs the newtork. - model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(label_map)) - - # Loads model parameters. - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - print("Loaded parameters from %s" % args.params_path) - - # Firstly pre-processing prediction data and then do predict. - - dev_ds, test_ds = load_dataset("lcqmc", splits=["dev", "test"]) - - dev_examples = preprocess_data(dev_ds.data, tokenizer, args.language) - test_examples = preprocess_data(test_ds.data, tokenizer, args.language) - results = interpret( - model, - dev_examples, - label_map=label_map, - batch_size=args.batch_size, - pad_token_id=vocab.token_to_idx.get("[PAD]", 0), - vocab=vocab, - ) - - # for idx, text in enumerate(data): - # print('Data: {} \t Label: {}'.format(text, results[idx])) diff --git a/examples/model_interpretation/task/similarity/simnet/lstm_train.sh b/examples/model_interpretation/task/similarity/simnet/lstm_train.sh deleted file mode 100755 index 5c1b671f0930..000000000000 --- a/examples/model_interpretation/task/similarity/simnet/lstm_train.sh +++ /dev/null @@ -1,21 +0,0 @@ -### - # This script is used to train lstm models -### - -unset CUDA_VISIBLE_DEVICES -LANGUAGE=en - -if [[ $LANGUAGE == "ch" ]]; then - VOCAB_PATH=vocab.char -elif [[ $LANGUAGE == "en" ]]; then - VOCAB_PATH=vocab_QQP -fi - -python -m paddle.distributed.launch --gpus "5" train.py \ - --device=gpu \ - --lr=4e-4 \ - --batch_size=64 \ - --epochs=12 \ - --vocab_path=$VOCAB_PATH \ - --language=$LANGUAGE \ - --save_dir="./checkpoints_"${LANGUAGE} diff --git a/examples/model_interpretation/task/similarity/simnet/model.py b/examples/model_interpretation/task/similarity/simnet/model.py deleted file mode 100644 index e3c86ad21c4e..000000000000 --- a/examples/model_interpretation/task/similarity/simnet/model.py +++ /dev/null @@ -1,270 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -import paddle.nn as nn -import paddle.nn.functional as F - -import paddlenlp as nlp - - -class SimNet(nn.Layer): - def __init__(self, network, vocab_size, num_classes, emb_dim=128, pad_token_id=0): - super().__init__() - - network = network.lower() - if network == "bow": - self.model = BoWModel(vocab_size, num_classes, emb_dim, padding_idx=pad_token_id) - elif network == "cnn": - self.model = CNNModel(vocab_size, num_classes, emb_dim, padding_idx=pad_token_id) - elif network == "gru": - self.model = GRUModel(vocab_size, num_classes, emb_dim, direction="forward", padding_idx=pad_token_id) - elif network == "lstm": - self.model = LSTMModel(vocab_size, num_classes, emb_dim, direction="forward", padding_idx=pad_token_id) - else: - raise ValueError("Unknown network: %s, it must be one of bow, cnn, lstm or gru." % network) - - def forward(self, query, title, query_seq_len=None, title_seq_len=None): - logits = self.model(query, title, query_seq_len, title_seq_len) - return logits - - def forward_interpret( - self, query, title, query_seq_len=None, title_seq_len=None, noise=None, i=None, n_samples=None - ): - - logits, addiational_info = self.model.forward_interpreter( - query, title, query_seq_len, title_seq_len, noise=noise, i=i, n_samples=n_samples - ) - - return logits, addiational_info["attention"], addiational_info["embedded"] - - -class BoWModel(nn.Layer): - """ - This class implements the Bag of Words Classification Network model to classify texts. - At a high level, the model starts by embedding the tokens and running them through - a word embedding. Then, we encode these representations with a `BoWEncoder`. - Lastly, we take the output of the encoder to create a final representation, - which is passed through some feed-forward layers to output a logits (`output_layer`). - Args: - vocab_size (obj:`int`): The vocabulary size. - emb_dim (obj:`int`, optional, defaults to 128): The embedding dimension. - padding_idx (obj:`int`, optional, defaults to 0) : The pad token index. - hidden_size (obj:`int`, optional, defaults to 128): The first full-connected layer hidden size. - fc_hidden_size (obj:`int`, optional, defaults to 96): The second full-connected layer hidden size. - num_classes (obj:`int`): All the labels that the data has. - """ - - def __init__(self, vocab_size, num_classes, emb_dim=128, padding_idx=0, fc_hidden_size=128): - super().__init__() - self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) - self.bow_encoder = nlp.seq2vec.BoWEncoder(emb_dim) - self.fc = nn.Linear(self.bow_encoder.get_output_dim() * 2, fc_hidden_size) - self.output_layer = nn.Linear(fc_hidden_size, num_classes) - - def forward(self, query, title, query_seq_len=None, title_seq_len=None): - # Shape: (batch_size, num_tokens, embedding_dim) - embedded_query = self.embedder(query) - embedded_title = self.embedder(title) - # Shape: (batch_size, embedding_dim) - summed_query = self.bow_encoder(embedded_query) - summed_title = self.bow_encoder(embedded_title) - encoded_query = paddle.tanh(summed_query) - encoded_title = paddle.tanh(summed_title) - # Shape: (batch_size, embedding_dim*2) - contacted = paddle.concat([encoded_query, encoded_title], axis=-1) - # Shape: (batch_size, fc_hidden_size) - fc_out = paddle.tanh(self.fc(contacted)) - # Shape: (batch_size, num_classes) - logits = self.output_layer(fc_out) - # probs = F.softmax(logits, axis=-1) - return logits - - -class LSTMModel(nn.Layer): - def __init__( - self, - vocab_size, - num_classes, - emb_dim=128, - padding_idx=0, - lstm_hidden_size=128, - direction="forward", - lstm_layers=1, - dropout_rate=0.0, - pooling_type=None, - fc_hidden_size=128, - ): - super().__init__() - self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) - self.lstm_encoder = nlp.seq2vec.LSTMEncoder( - emb_dim, lstm_hidden_size, num_layers=lstm_layers, direction=direction, dropout=dropout_rate - ) - self.fc = nn.Linear(self.lstm_encoder.get_output_dim() * 2, fc_hidden_size) - self.output_layer = nn.Linear(fc_hidden_size, num_classes) - self.pad_token_id = padding_idx - - def forward(self, query, title, query_seq_len, title_seq_len): - assert query_seq_len is not None and title_seq_len is not None - # Shape: (batch_size, num_tokens, embedding_dim) - embedded_query = self.embedder(query) - embedded_title = self.embedder(title) - # Shape: (batch_size, lstm_hidden_size) - query_repr = self.lstm_encoder(embedded_query, sequence_length=query_seq_len) - title_repr = self.lstm_encoder(embedded_title, sequence_length=title_seq_len) - # Shape: (batch_size, 2*lstm_hidden_size) - contacted = paddle.concat([query_repr, title_repr], axis=-1) - # Shape: (batch_size, fc_hidden_size) - fc_out = paddle.tanh(self.fc(contacted)) - # Shape: (batch_size, num_classes) - logits = self.output_layer(fc_out) - # probs = F.softmax(logits, axis=-1) - - return logits - - def forward_interpreter(self, query, title, query_seq_len, title_seq_len, noise=None, i=None, n_samples=None): - assert query_seq_len is not None and title_seq_len is not None - # Shape: (batch_size, num_tokens, embedding_dim) - - query_baseline = paddle.to_tensor([self.pad_token_id] * query.shape[1]).unsqueeze(0) - title_baseline = paddle.to_tensor([self.pad_token_id] * title.shape[1]).unsqueeze(0) - - embedded_query = self.embedder(query) - embedded_title = self.embedder(title) - embedded_query_baseline = self.embedder(query_baseline) - embedded_title_baseline = self.embedder(title_baseline) - - if noise is not None and noise.upper() == "INTEGRATED": - embedded_query = embedded_query_baseline + i / (n_samples - 1) * (embedded_query - embedded_query_baseline) - embedded_title = embedded_title_baseline + i / (n_samples - 1) * (embedded_title - embedded_title_baseline) - - # Shape: (batch_size, lstm_hidden_size) - query_repr = self.lstm_encoder(embedded_query, sequence_length=query_seq_len) - title_repr = self.lstm_encoder(embedded_title, sequence_length=title_seq_len) - # Shape: (batch_size, 2*lstm_hidden_size) - contacted = paddle.concat([query_repr, title_repr], axis=-1) - # Shape: (batch_size, fc_hidden_size) - fc_out = paddle.tanh(self.fc(contacted)) - # Shape: (batch_size, num_classes) - logits = self.output_layer(fc_out) - probs = F.softmax(logits, axis=-1) - - q_att = paddle.matmul(fc_out, embedded_query, transpose_y=True).squeeze(axis=[1]) # (bsz, query_len) - q_att = F.softmax(q_att, axis=-1) - t_att = paddle.matmul(fc_out, embedded_title, transpose_y=True).squeeze(axis=[1]) # (bsz, title_len) - t_att = F.softmax(t_att, axis=-1) - - addiational_info = { - "embedded": [embedded_query, embedded_title], - "attention": [q_att, t_att], - } - # return logits, addiational_info - return probs, addiational_info - - -class GRUModel(nn.Layer): - def __init__( - self, - vocab_size, - num_classes, - emb_dim=128, - padding_idx=0, - gru_hidden_size=128, - direction="forward", - gru_layers=1, - dropout_rate=0.0, - pooling_type=None, - fc_hidden_size=96, - ): - super().__init__() - self.embedder = nn.Embedding(num_embeddings=vocab_size, embedding_dim=emb_dim, padding_idx=padding_idx) - self.gru_encoder = nlp.seq2vec.GRUEncoder( - emb_dim, gru_hidden_size, num_layers=gru_layers, direction=direction, dropout=dropout_rate - ) - self.fc = nn.Linear(self.gru_encoder.get_output_dim() * 2, fc_hidden_size) - self.output_layer = nn.Linear(fc_hidden_size, num_classes) - - def forward(self, query, title, query_seq_len, title_seq_len): - # Shape: (batch_size, num_tokens, embedding_dim) - embedded_query = self.embedder(query) - embedded_title = self.embedder(title) - # Shape: (batch_size, gru_hidden_size) - query_repr = self.gru_encoder(embedded_query, sequence_length=query_seq_len) - title_repr = self.gru_encoder(embedded_title, sequence_length=title_seq_len) - # Shape: (batch_size, 2*gru_hidden_size) - contacted = paddle.concat([query_repr, title_repr], axis=-1) - # Shape: (batch_size, fc_hidden_size) - fc_out = paddle.tanh(self.fc(contacted)) - # Shape: (batch_size, num_classes) - logits = self.output_layer(fc_out) - # probs = F.softmax(logits, axis=-1) - - return logits - - -class CNNModel(nn.Layer): - """ - This class implements the - - - Convolution Neural Network model. - At a high level, the model starts by embedding the tokens and running them through - a word embedding. Then, we encode these representations with a `CNNEncoder`. - The CNN has one convolution layer for each ngram filter size. Each convolution operation gives - out a vector of size num_filter. The number of times a convolution layer will be used - is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these - outputs from the convolution layer and outputs the max. - Lastly, we take the output of the encoder to create a final representation, - which is passed through some feed-forward layers to output a logits (`output_layer`). - Args: - vocab_size (obj:`int`): The vocabulary size. - emb_dim (obj:`int`, optional, defaults to 128): The embedding dimension. - padding_idx (obj:`int`, optional, defaults to 0) : The pad token index. - num_classes (obj:`int`): All the labels that the data has. - """ - - def __init__( - self, - vocab_size, - num_classes, - emb_dim=128, - padding_idx=0, - num_filter=256, - ngram_filter_sizes=(3,), - fc_hidden_size=128, - ): - super().__init__() - self.padding_idx = padding_idx - self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) - self.encoder = nlp.seq2vec.CNNEncoder( - emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes - ) - self.fc = nn.Linear(self.encoder.get_output_dim() * 2, fc_hidden_size) - self.output_layer = nn.Linear(fc_hidden_size, num_classes) - - def forward(self, query, title, query_seq_len=None, title_seq_len=None): - # Shape: (batch_size, num_tokens, embedding_dim) - embedded_query = self.embedder(query) - embedded_title = self.embedder(title) - # Shape: (batch_size, num_filter) - query_repr = self.encoder(embedded_query) - title_repr = self.encoder(embedded_title) - # Shape: (batch_size, 2*num_filter) - contacted = paddle.concat([query_repr, title_repr], axis=-1) - # Shape: (batch_size, fc_hidden_size) - fc_out = paddle.tanh(self.fc(contacted)) - # Shape: (batch_size, num_classes) - logits = self.output_layer(fc_out) - # probs = F.softmax(logits, axis=-1) - return logits diff --git a/examples/model_interpretation/task/similarity/simnet/predict.py b/examples/model_interpretation/task/similarity/simnet/predict.py deleted file mode 100644 index dec464bf4130..000000000000 --- a/examples/model_interpretation/task/similarity/simnet/predict.py +++ /dev/null @@ -1,109 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License" -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse - -import paddle -import paddle.nn.functional as F -from model import SimNet -from utils import preprocess_prediction_data - -from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab - -# yapf: disable -parser = argparse.ArgumentParser(__doc__) -parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") -parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") -parser.add_argument("--vocab_path", type=str, default="./simnet_vocab.txt", help="The path to vocabulary.") -parser.add_argument('--network', type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?") -parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.") -args = parser.parse_args() -# yapf: enable - - -def predict(model, data, label_map, batch_size=1, pad_token_id=0): - """ - Predicts the data labels. - - Args: - model (obj:`paddle.nn.Layer`): A model to classify texts. - data (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object. - A Example object contains `text`(word_ids) and `seq_len`(sequence length). - label_map(obj:`dict`): The label id (key) to label str (value) map. - batch_size(obj:`int`, defaults to 1): The number of batch. - pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. - - Returns: - results(obj:`dict`): All the predictions labels. - """ - - # Separates data into some batches. - batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)] - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=pad_token_id), # query_ids - Pad(axis=0, pad_val=pad_token_id), # title_ids - Stack(dtype="int64"), # query_seq_lens - Stack(dtype="int64"), # title_seq_lens - ): [data for data in fn(samples)] - - results = [] - model.eval() - for batch in batches: - query_ids, title_ids, query_seq_lens, title_seq_lens = batchify_fn(batch) - query_ids = paddle.to_tensor(query_ids) - title_ids = paddle.to_tensor(title_ids) - query_seq_lens = paddle.to_tensor(query_seq_lens) - title_seq_lens = paddle.to_tensor(title_seq_lens) - logits = model(query_ids, title_ids, query_seq_lens, title_seq_lens) - probs = F.softmax(logits, axis=1) - idx = paddle.argmax(probs, axis=1).numpy() - idx = idx.tolist() - labels = [label_map[i] for i in idx] - results.extend(labels) - return results - - -if __name__ == "__main__": - paddle.set_device(args.device) - # Loads vocab. - vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") - tokenizer = JiebaTokenizer(vocab) - label_map = {0: "dissimilar", 1: "similar"} - - # Constructs the newtork. - model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(label_map)) - - # Loads model parameters. - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - print("Loaded parameters from %s" % args.params_path) - - # Firstly pre-processing prediction data and then do predict. - data = [ - ["世界上什么东西最小", "世界上什么东西最小?"], - ["光眼睛大就好看吗", "眼睛好看吗?"], - ["小蝌蚪找妈妈怎么样", "小蝌蚪找妈妈是谁画的"], - ] - examples = preprocess_prediction_data(data, tokenizer) - results = predict( - model, - examples, - label_map=label_map, - batch_size=args.batch_size, - pad_token_id=vocab.token_to_idx.get("[PAD]", 0), - ) - - for idx, text in enumerate(data): - print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/model_interpretation/task/similarity/simnet/train.py b/examples/model_interpretation/task/similarity/simnet/train.py deleted file mode 100644 index ec36090726cc..000000000000 --- a/examples/model_interpretation/task/similarity/simnet/train.py +++ /dev/null @@ -1,135 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License" -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse -import os -import sys -from functools import partial - -import paddle - -from paddlenlp.data import Pad, Stack, Tuple, Vocab -from paddlenlp.datasets import load_dataset - -sys.path.append("../../../") -from model import SimNet # noqa: E402 -from utils import CharTokenizer, convert_example # noqa: E402 - -parser = argparse.ArgumentParser(__doc__) -parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.") -parser.add_argument( - "--device", choices=["cpu", "gpu"], default="gpu", help="Select which device to train model, defaults to gpu." -) -parser.add_argument("--lr", type=float, default=5e-4, help="Learning rate used to train.") -parser.add_argument("--save_dir", type=str, default="checkpoints/", help="Directory to save model checkpoint") -parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") -parser.add_argument( - "--vocab_path", - type=str, - default="./vocab.char", - help="The directory to dataset. Chinese version uses vocab.char while English version uses vocab_QQP", -) -parser.add_argument( - "--network", type=str, default="lstm", help="Which network you would like to choose bow, cnn, lstm or gru ?" -) -parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") -parser.add_argument("--language", type=str, required=True, help="Language that this model based on") -args = parser.parse_args() - - -def create_dataloader(dataset, trans_fn=None, mode="train", batch_size=1, batchify_fn=None): - """ - Creats dataloader. - - Args: - dataset(obj:`paddle.io.Dataset`): Dataset instance. - trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. - mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. - batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. - batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging - the sample list, None for only stack each fields of sample in axis - 0(same as :attr::`np.stack(..., axis=0)`). - - Returns: - dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. - """ - if trans_fn: - dataset = dataset.map(trans_fn) - - shuffle = True if mode == "train" else False - if mode == "train": - sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=True) - else: - sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) - dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, return_list=True, collate_fn=batchify_fn) - return dataloader - - -if __name__ == "__main__": - paddle.set_device(args.device) - - # Loads vocab. - if not os.path.exists(args.vocab_path): - raise RuntimeError("The vocab_path can not be found in the path %s" % args.vocab_path) - vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") - - # Loads dataset. - if args.language == "ch": - train_ds, dev_ds, test_ds = load_dataset("lcqmc", splits=["train", "dev", "test"]) - else: - train_ds, dev_ds, test_ds = load_dataset("glue", "qqp", splits=["train", "dev", "test"]) - - # Constructs the newtork. - model = SimNet(network=args.network, vocab_size=len(vocab), num_classes=len(train_ds.label_list)) - model = paddle.Model(model) - - # Reads data and generates mini-batches. - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # query_ids - Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), # title_ids - Stack(dtype="int64"), # query_seq_lens - Stack(dtype="int64"), # title_seq_lens - Stack(dtype="int64"), # label - ): [data for data in fn(samples)] - tokenizer = CharTokenizer(vocab, args.language, "../../../punctuations") - trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False, language=args.language) - train_loader = create_dataloader( - train_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn - ) - dev_loader = create_dataloader( - dev_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn - ) - test_loader = create_dataloader( - test_ds, trans_fn=trans_fn, batch_size=args.batch_size, mode="test", batchify_fn=batchify_fn - ) - - optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr) - - # Defines loss and metric. - criterion = paddle.nn.CrossEntropyLoss() - metric = paddle.metric.Accuracy() - - model.prepare(optimizer, criterion, metric) - - # Loads pre-trained parameters. - if args.init_from_ckpt: - model.load(args.init_from_ckpt) - print("Loaded checkpoint from %s" % args.init_from_ckpt) - - # Starts training and evaluating. - model.fit( - train_loader, - dev_loader, - epochs=args.epochs, - save_dir=args.save_dir, - ) diff --git a/examples/model_interpretation/task/similarity/simnet/utils.py b/examples/model_interpretation/task/similarity/simnet/utils.py deleted file mode 100644 index b2161cd48ce2..000000000000 --- a/examples/model_interpretation/task/similarity/simnet/utils.py +++ /dev/null @@ -1,211 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License" -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import numpy as np - - -def convert_example(example, tokenizer, is_test=False, language="en"): - """ - Builds model inputs from a sequence for sequence classification tasks. - It use `jieba.cut` to tokenize text. - - Args: - example(obj:`list[str]`): List of input data, containing text and label if it have label. - tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. - is_test(obj:`False`, defaults to `False`): Whether the example contains label or not. - - Returns: - query_ids(obj:`list[int]`): The list of query ids. - title_ids(obj:`list[int]`): The list of title ids. - query_seq_len(obj:`int`): The input sequence query length. - title_seq_len(obj:`int`): The input sequence title length. - label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test. - """ - if language == "ch": - q_name = "query" - t_name = "title" - label = "label" - else: - q_name = "sentence1" - t_name = "sentence2" - label = "labels" - - query, title = example[q_name], example[t_name] - query_ids = np.array(tokenizer.encode(query), dtype="int64") - query_seq_len = np.array(len(query_ids), dtype="int64") - title_ids = np.array(tokenizer.encode(title), dtype="int64") - title_seq_len = np.array(len(title_ids), dtype="int64") - result = [query_ids, title_ids, query_seq_len, title_seq_len] - if not is_test: - label = np.array(example[label], dtype="int64") - result.append(label) - return result - - -def preprocess_prediction_data(data, tokenizer): - """ - It process the prediction data as the format used as training. - - Args: - data (obj:`List[List[str, str]]`): - The prediction data whose each element is a text pair. - Each text will be tokenized by jieba.lcut() function. - tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. - - Returns: - examples (obj:`list`): The processed data whose each element - is a `list` object, which contains - - - query_ids(obj:`list[int]`): The list of query ids. - - title_ids(obj:`list[int]`): The list of title ids. - - query_seq_len(obj:`int`): The input sequence query length. - - title_seq_len(obj:`int`): The input sequence title length. - - """ - examples = [] - for query, title in data: - query_ids = tokenizer.encode(query) - title_ids = tokenizer.encode(title) - examples.append([query_ids, title_ids, len(query_ids), len(title_ids)]) - return examples - - -def preprocess_data(data, tokenizer, language): - """ - It process the prediction data as the format used as training. - - Args: - data (obj:`List[List[str, str]]`): - The prediction data whose each element is a text pair. - Each text will be tokenized by jieba.lcut() function. - tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. - - Returns: - examples (obj:`list`): The processed data whose each element - is a `list` object, which contains - - - query_ids(obj:`list[int]`): The list of query ids. - - title_ids(obj:`list[int]`): The list of title ids. - - query_seq_len(obj:`int`): The input sequence query length. - - title_seq_len(obj:`int`): The input sequence title length. - - """ - if language == "ch": - q_name = "query" - t_name = "title" - else: - q_name = "sentence1" - t_name = "sentence2" - examples = [] - for example in data: - query_ids = tokenizer.encode(example[q_name]) - title_ids = tokenizer.encode(example[t_name]) - examples.append([query_ids, title_ids, len(query_ids), len(title_ids)]) - return examples - - -def get_idx_from_word(word, word_to_idx, unk_word): - if word in word_to_idx: - return word_to_idx[word] - return word_to_idx[unk_word] - - -class CharTokenizer: - def __init__(self, vocab, language, vocab_path): - self.vocab = vocab - self.language = language - self.vocab_path = vocab_path - self.unk_token = [] - - def encode(self, sentence): - if self.language == "ch": - words = tokenizer_punc(sentence, self.vocab_path) - else: - words = sentence.strip().split() - return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in words] - - def tokenize(self, sentence, wo_unk=True): - if self.language == "ch": - return tokenizer_punc(sentence, self.vocab_path) - else: - return sentence.strip().split() - - def convert_tokens_to_string(self, tokens): - return " ".join(tokens) - - def convert_tokens_to_ids(self, tokens): - return [get_idx_from_word(word, self.vocab.token_to_idx, self.vocab.unk_token) for word in tokens] - - -def tokenizer_lac(string, lac): - temp = "" - res = [] - for c in string: - if "\u4e00" <= c <= "\u9fff": - if temp != "": - res.extend(lac.run(temp)) - temp = "" - res.append(c) - else: - temp += c - if temp != "": - res.extend(lac.run(temp)) - return res - - -def tokenizer_punc(string, vocab_path): - res = [] - sub_string_list = string.strip().split("[MASK]") - for idx, sub_string in enumerate(sub_string_list): - temp = "" - for c in sub_string: - if "\u4e00" <= c <= "\u9fff": - if temp != "": - temp_seg = punc_split(temp, vocab_path) - res.extend(temp_seg) - temp = "" - res.append(c) - else: - temp += c - if temp != "": - temp_seg = punc_split(temp, vocab_path) - res.extend(temp_seg) - if idx < len(sub_string_list) - 1: - res.append("[MASK]") - return res - - -def punc_split(string, vocab_path): - punc_set = set() - with open(vocab_path, "r") as f: - for token in f: - punc_set.add(token.strip()) - punc_set.add(" ") - for ascii_num in range(65296, 65306): - punc_set.add(chr(ascii_num)) - for ascii_num in range(48, 58): - punc_set.add(chr(ascii_num)) - - res = [] - temp = "" - for c in string: - if c in punc_set: - if temp != "": - res.append(temp) - temp = "" - res.append(c) - else: - temp += c - if temp != "": - res.append(temp) - return res diff --git a/examples/model_interpretation/task/transformer.py b/examples/model_interpretation/task/transformer.py deleted file mode 100644 index 2504503739b0..000000000000 --- a/examples/model_interpretation/task/transformer.py +++ /dev/null @@ -1,1329 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -# TODO: define the classes of Transformer neural network - -import collections -import copy - -import numpy as np -import paddle -from paddle import ParamAttr, tensor -from paddle.common_ops_import import convert_dtype -from paddle.nn import Layer, LayerList -from paddle.nn import functional as F -from paddle.nn.layer.common import Dropout, Linear -from paddle.nn.layer.norm import LayerNorm - -__all__ = [] - - -def _convert_param_attr_to_list(param_attr, n): - """ - If `param_attr` is a list or tuple, convert every element in it to a - ParamAttr instance. Otherwise, repeat `param_attr` `n` times to - construct a list, and rename every one by appending a increasing index - suffix to avoid having same names when `param_attr` contains a name. - - Parameters: - param_attr (list|tuple|ParamAttr): A list, tuple or something can be - converted to a ParamAttr instance by `ParamAttr._to_attr`. - n (int): The times to repeat to construct a list when `param_attr` - is not a list or tuple. - - Returns: - list: A list composed of each including cell's `param_attr`. - """ - if isinstance(param_attr, (list, tuple)): - assert len(param_attr) == n, "length of param_attr should be %d when it is a list/tuple" % n - param_attrs = [] - for attr in param_attr: - if isinstance(attr, bool): - if attr: - param_attrs.append(ParamAttr._to_attr(None)) - else: - param_attrs.append(False) - else: - param_attrs.append(ParamAttr._to_attr(attr)) - # param_attrs = [ParamAttr._to_attr(attr) for attr in param_attr] - elif isinstance(param_attr, bool): - param_attrs = [] - if param_attr: - param_attrs = [ParamAttr._to_attr(None) for i in range(n)] - else: - param_attrs = [False] * n - else: - param_attrs = [] - attr = ParamAttr._to_attr(param_attr) - for i in range(n): - attr_i = copy.deepcopy(attr) - if attr.name: - attr_i.name = attr_i.name + "_" + str(i) - param_attrs.append(attr_i) - return param_attrs - - -def _convert_attention_mask(attn_mask, dtype): - """ - Convert the attention mask to the target dtype we expect. - - Parameters: - attn_mask (Tensor, optional): A tensor used in multi-head attention - to prevents attention to some unwanted positions, usually the - paddings or the subsequent positions. It is a tensor with shape - broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`. - When the data type is bool, the unwanted positions have `False` - values and the others have `True` values. When the data type is - int, the unwanted positions have 0 values and the others have 1 - values. When the data type is float, the unwanted positions have - `-INF` values and the others have 0 values. It can be None when - nothing wanted or needed to be prevented attention to. Default None. - dtype (VarType): The target type of `attn_mask` we expect. - - Returns: - Tensor: A Tensor with shape same as input `attn_mask`, with data type `dtype`. - """ - if attn_mask is not None and attn_mask.dtype != dtype: - attn_mask_dtype = convert_dtype(attn_mask.dtype) - if attn_mask_dtype == "bool" or "int" in attn_mask_dtype: - attn_mask = (paddle.cast(attn_mask, dtype) - 1.0) * 1e9 - else: - attn_mask = paddle.cast(attn_mask, dtype) - return attn_mask - - -class MultiHeadAttention(Layer): - """ - Attention mapps queries and a set of key-value pairs to outputs, and - Multi-Head Attention performs multiple parallel attention to jointly attending - to information from different representation subspaces. - - Please refer to `Attention Is All You Need `_ - for more details. - - Parameters: - embed_dim (int): The expected feature size in the input and output. - num_heads (int): The number of heads in multi-head attention. - dropout (float, optional): The dropout probability used on attention - weights to drop some attention targets. 0 for no dropout. Default 0 - kdim (int, optional): The feature size in key. If None, assumed equal to - `embed_dim`. Default None. - vdim (int, optional): The feature size in value. If None, assumed equal to - `embed_dim`. Default None. - need_weights (bool, optional): Indicate whether to return the attention - weights. Default False. - weight_attr(ParamAttr, optional): To specify the weight parameter property. - Default: None, which means the default weight parameter property is used. - See usage for details in :code:`ParamAttr` . - bias_attr (ParamAttr|bool, optional): To specify the bias parameter property. - Default: None, which means the default bias parameter property is used. - If it is set to False, this layer will not have trainable bias parameter. - See usage for details in :code:`ParamAttr` . - - Examples: - - .. code-block:: python - - import paddle - - # encoder input: [batch_size, sequence_length, d_model] - query = paddle.rand((2, 4, 128)) - # self attention mask: [batch_size, num_heads, query_len, query_len] - attn_mask = paddle.rand((2, 2, 4, 4)) - multi_head_attn = paddle.nn.MultiHeadAttention(128, 2) - output = multi_head_attn(query, None, None, attn_mask=attn_mask) # [2, 4, 128] - """ - - Cache = collections.namedtuple("Cache", ["k", "v"]) - StaticCache = collections.namedtuple("StaticCache", ["k", "v"]) - - def __init__( - self, - embed_dim, - num_heads, - dropout=0.0, - kdim=None, - vdim=None, - need_weights=False, - weight_attr=None, - bias_attr=None, - ): - super(MultiHeadAttention, self).__init__() - self.embed_dim = embed_dim - self.kdim = kdim if kdim is not None else embed_dim - self.vdim = vdim if vdim is not None else embed_dim - self.num_heads = num_heads - self.dropout = dropout - self.need_weights = need_weights - - self.head_dim = embed_dim // num_heads - assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads" - - self.q_proj = Linear(embed_dim, embed_dim, weight_attr, bias_attr=bias_attr) - self.k_proj = Linear(self.kdim, embed_dim, weight_attr, bias_attr=bias_attr) - self.v_proj = Linear(self.vdim, embed_dim, weight_attr, bias_attr=bias_attr) - self.out_proj = Linear(embed_dim, embed_dim, weight_attr, bias_attr=bias_attr) - - def _prepare_qkv(self, query, key, value, cache=None): - r""" - Prapares linear projected queries, keys and values for usage of subsequnt - multiple parallel attention. If `cache` is not None, using cached results - to reduce redundant calculations. - - Parameters: - query (Tensor): The queries for multi-head attention. It is a - tensor with shape `[batch_size, query_length, embed_dim]`. The - data type should be float32 or float64. - key (Tensor): The keys for multi-head attention. It is - a tensor with shape `[batch_size, key_length, kdim]`. The - data type should be float32 or float64. If None, use `query` as - `key`. - value (Tensor): The values for multi-head attention. It - is a tensor with shape `[batch_size, value_length, vdim]`. - The data type should be float32 or float64. If None, use `query` as - `value`. - cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional): - It is a namedtuple with `k` and `v` as fields, and stores tensors - shaped `[batch_size, num_heads, length, embed_dim]` which are results - of linear projection, reshape and transpose calculations in - MultiHeadAttention. If is an instance of `Cache`, `k` and `v` - fields reserve intermediate results of previous positions, which - mostly used for decoder self attention. If it is an instance of - `StaticCache`, `key` and `value` args would be ignored, `k` and - `v` fields would be used as calculated results on `key` and - `value`, which mostly used for decoder-encoder cross attention. - It is only used for inference and should be None for training. - Default None. - - Returns: - tuple: A tuple including linear projected keys and values. These two \ - tensors have shapes `[batch_size, n_head, sequence_length, d_key]` \ - and `[batch_size, n_head, sequence_length, d_value]` separately, \ - and their data types are same as inputs. - """ - q = self.q_proj(query) - q = tensor.reshape(x=q, shape=[0, 0, self.num_heads, self.head_dim]) - q = tensor.transpose(x=q, perm=[0, 2, 1, 3]) - - if isinstance(cache, self.StaticCache): - # for encoder-decoder attention in inference and has cached - k, v = cache.k, cache.v - else: - k, v = self.compute_kv(key, value) - - if isinstance(cache, self.Cache): - # for decoder self-attention in inference - k = tensor.concat([cache.k, k], axis=2) - v = tensor.concat([cache.v, v], axis=2) - cache = self.Cache(k, v) - - return (q, k, v) if cache is None else (q, k, v, cache) - - def compute_kv(self, key, value): - r""" - Applies linear projection on input keys and values, then splits heads - (reshape and transpose) to get keys and values from different representation - subspaces. The results are used as key-values pairs for subsequent multiple - parallel attention. - - It is part of calculations in multi-head attention, and is provided as - a method to pre-compute and prefetch these results, thus we can use them - to construct cache for inference. - - Parameters: - key (Tensor): The keys for multi-head attention. It is a tensor - with shape `[batch_size, sequence_length, kdim]`. The data type - should be float32 or float64. - value (Tensor): The values for multi-head attention. It is a tensor - with shape `[batch_size, sequence_length, vdim]`. The data type - should be float32 or float64. - - Returns: - tuple: A tuple including transformed keys and values. Their shapes \ - both are `[batch_size, num_heads, sequence_length, embed_dim // num_heads]`, \ - and their data types are same as inputs. - """ - k = self.k_proj(key) - v = self.v_proj(value) - k = tensor.reshape(x=k, shape=[0, 0, self.num_heads, self.head_dim]) - k = tensor.transpose(x=k, perm=[0, 2, 1, 3]) - v = tensor.reshape(x=v, shape=[0, 0, self.num_heads, self.head_dim]) - v = tensor.transpose(x=v, perm=[0, 2, 1, 3]) - return k, v - - def gen_cache(self, key, value=None, type=Cache): - """ - Generates cache for `forward` usage in inference accroding to arguments. - The generated cache is an instance of `MultiHeadAttention.Cache` or an - instance of `MultiHeadAttention.StaticCache`. - - `Cache` or `StaticCache` is namedtuple with `k` and `v` as fields, - and it stores tensors shaped `[batch_size, num_heads, length, embed_dim]` - which are results of linear projection, reshape and transpose calculations - in MultiHeadAttention. - - If the generated cache is an instance of `Cache`, `k` and `v` fields - reserve intermediate result tensors of previous positions, and the tensors - are incremental among decoding steps, which mostly are used for decoder - decoder self attention. - - If the generated cache is an instance of `StaticCache`, `k` and `v` fields - would be used as calculated result tensors on keys an values in `forward`, - and the tensors keep unchanged among decoding steps, which are mostly used - for decoder-encoder cross attention. - - The cache is generated as follows: - - 1. If `type` is `StaticCache`, apply `compute_kv(key, value)` and use the - results to create an instance of `StaticCache`. - - 2. If `type` is `Cache` and `value` is None, generate empty tensors shaped - `[batch_size, num_heads, 0, embed_dim // num_heads]` and use the results - to create an instance of `Cache`, where `batch_size` is from the first - dimension of `key`. - - 3. If `type` is `Cache` and `value` is not None, use `key`, `value` to create - an instance of `Cache`. - - Parameters: - key (Tensor): The keys for multi-head attention. It is - a tensor with shape `[batch_size, key_length, kdim]`. The - data type should be float32 or float64. If `value` is None, - it is only for batch size and data type reference. - value (Tensor, optional): The values for multi-head attention. It - is a tensor with shape `[batch_size, value_length, vdim]`. - The data type should be float32 or float64. If None, `key` is only - for batch size reference. Default None. - type (type): It should be `MultiHeadAttention.StaticCache` or - `MultiHeadAttention.Cache` to indicate the cache type to generate. - - Returns: - namedtuple: an instance of `Cache` or `StaticCache` accordingly. - """ - if type == MultiHeadAttention.StaticCache: # static_kv - k, v = self.compute_kv(key, value) - return self.StaticCache(k, v) - elif value is None: # incremental_state - k = paddle.full(shape=[key.shape[0], self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0) - v = paddle.full(shape=[key.shape[0], 2, self.num_heads, 0, self.head_dim], dtype=key.dtype, fill_value=0) - return self.Cache(k, v) - else: - # incremental_state with initial value, mainly for usage like UniLM - return self.Cache(key, value) - - def forward(self, query, key=None, value=None, attn_mask=None, cache=None): - r""" - Applies multi-head attention to map queries and a set of key-value pairs - to outputs. - - Parameters: - query (Tensor): The queries for multi-head attention. It is a - tensor with shape `[batch_size, query_length, embed_dim]`. The - data type should be float32 or float64. - key (Tensor, optional): The keys for multi-head attention. It is - a tensor with shape `[batch_size, key_length, kdim]`. The - data type should be float32 or float64. If None, use `query` as - `key`. Default None. - value (Tensor, optional): The values for multi-head attention. It - is a tensor with shape `[batch_size, value_length, vdim]`. - The data type should be float32 or float64. If None, use `query` as - `value`. Default None. - attn_mask (Tensor, optional): A tensor used in multi-head attention - to prevents attention to some unwanted positions, usually the - paddings or the subsequent positions. It is a tensor with shape - broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`. - When the data type is bool, the unwanted positions have `False` - values and the others have `True` values. When the data type is - int, the unwanted positions have 0 values and the others have 1 - values. When the data type is float, the unwanted positions have - `-INF` values and the others have 0 values. It can be None when - nothing wanted or needed to be prevented attention to. Default None. - cache (MultiHeadAttention.Cache|MultiHeadAttention.StaticCache, optional): - It is a namedtuple with `k` and `v` as fields, and stores tensors - shaped `[batch_size, num_heads, length, embed_dim]` which are results - of linear projection, reshape and transpose calculations in - MultiHeadAttention. If it is an instance of `Cache`, `k` and `v` - fields reserve intermediate results of previous positions, which - mostly used for decoder self attention. If it is an instance of - `StaticCache`, `key` and `value` args would be ignored, `k` and - `v` fields would be used as calculated results on `key` and - `value`, which mostly used for decoder-encoder cross attention. - It is only used for inference and should be None for training. - Default None. - - Returns: - Tensor|tuple: It is a tensor that has the same shape and data type \ - as `query`, representing attention output. Or a tuple if \ - `need_weights` is True or `cache` is not None. If `need_weights` \ - is True, except for attention output, the tuple also includes \ - the attention weights tensor shaped `[batch_size, num_heads, query_length, key_length]`. \ - If `cache` is not None, the tuple then includes the new cache \ - having the same type as `cache`, and if it is `StaticCache`, it \ - is same as the input `cache`, if it is `Cache`, the new cache \ - reserves tensors concatanating raw tensors with intermediate \ - results of current query. - """ - key = query if key is None else key - value = query if value is None else value - # compute q ,k ,v - if cache is None: - q, k, v = self._prepare_qkv(query, key, value, cache) - else: - q, k, v, cache = self._prepare_qkv(query, key, value, cache) - - # scale dot product attention - product = paddle.matmul(x=q * (self.head_dim**-0.5), y=k, transpose_y=True) - if attn_mask is not None: - # Support bool or int mask - attn_mask = _convert_attention_mask(attn_mask, product.dtype) - product = product + attn_mask - weights = F.softmax(product) - if self.dropout: - weights = F.dropout(weights, self.dropout, training=self.training, mode="upscale_in_train") - - out = tensor.matmul(weights, v) - - # combine heads - out = tensor.transpose(out, perm=[0, 2, 1, 3]) - out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]]) - - # project to output - out = self.out_proj(out) - - outs = [out] - if self.need_weights: - outs.append(weights) - if cache is not None: - outs.append(cache) - return out if len(outs) == 1 else tuple(outs) - - -class TransformerEncoderLayer(Layer): - """ - TransformerEncoderLayer is composed of two sub-layers which are self (multi-head) - attention and feedforward network. Before and after each sub-layer, pre-process - and post-precess would be applied on the input and output accordingly. If - `normalize_before` is True, pre-process is layer normalization and post-precess - includes dropout, residual connection. Otherwise, no pre-process and post-precess - includes dropout, residual connection, layer normalization. - - Parameters: - d_model (int): The expected feature size in the input and output. - nhead (int): The number of heads in multi-head attention(MHA). - dim_feedforward (int): The hidden layer size in the feedforward network(FFN). - dropout (float, optional): The dropout probability used in pre-process - and post-precess of MHA and FFN sub-layer. Default 0.1 - activation (str, optional): The activation function in the feedforward - network. Default relu. - attn_dropout (float, optional): The dropout probability used - in MHA to drop some attention target. If None, use the value of - `dropout`. Default None - act_dropout (float, optional): The dropout probability used after FFN - activition. If None, use the value of `dropout`. Default None - normalize_before (bool, optional): Indicate whether to put layer normalization - into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer - normalization and post-precess includes dropout, residual connection. - Otherwise, no pre-process and post-precess includes dropout, residual - connection, layer normalization. Default False - weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property. - If it is a list/tuple, `weight_attr[0]` would be used as `weight_attr` for - MHA, and `weight_attr[1]` would be used as `weight_attr` for linear in FFN. - Otherwise, MHA and FFN both use it as `weight_attr` to create parameters. - Default: None, which means the default weight parameter property is used. - See usage for details in :code:`ParamAttr` . - bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property. - If it is a list/tuple, `bias_attr[0]` would be used as `bias_attr` for - MHA, and `bias_attr[1]` would be used as `bias_attr` for linear in FFN. - Otherwise, MHA and FFN both use it as `bias_attr` to create parameters. - The `False` value means the corresponding layer would not have trainable - bias parameter. See usage for details in :code:`ParamAttr` . Default: None, - which means the default bias parameter property is used. - - - Examples: - - .. code-block:: python - - import paddle - from paddle.nn import TransformerEncoderLayer - - # encoder input: [batch_size, src_len, d_model] - enc_input = paddle.rand((2, 4, 128)) - # self attention mask: [batch_size, n_head, src_len, src_len] - attn_mask = paddle.rand((2, 2, 4, 4)) - encoder_layer = TransformerEncoderLayer(128, 2, 512) - enc_output = encoder_layer(enc_input, attn_mask) # [2, 4, 128] - """ - - def __init__( - self, - d_model, - nhead, - dim_feedforward, - dropout=0.1, - activation="relu", - attn_dropout=None, - act_dropout=None, - normalize_before=False, - weight_attr=None, - bias_attr=None, - ): - self._config = locals() - self._config.pop("self") - self._config.pop("__class__", None) # py3 - - super(TransformerEncoderLayer, self).__init__() - attn_dropout = dropout if attn_dropout is None else attn_dropout - act_dropout = dropout if act_dropout is None else act_dropout - self.normalize_before = normalize_before - - weight_attrs = _convert_param_attr_to_list(weight_attr, 2) - bias_attrs = _convert_param_attr_to_list(bias_attr, 2) - - self.self_attn = MultiHeadAttention( - d_model, - nhead, - dropout=attn_dropout, - need_weights=True, # interpret - weight_attr=weight_attrs[0], - bias_attr=bias_attrs[0], - ) - self.linear1 = Linear(d_model, dim_feedforward, weight_attrs[1], bias_attr=bias_attrs[1]) - self.dropout = Dropout(act_dropout, mode="upscale_in_train") - self.linear2 = Linear(dim_feedforward, d_model, weight_attrs[1], bias_attr=bias_attrs[1]) - self.norm1 = LayerNorm(d_model) - self.norm2 = LayerNorm(d_model) - self.dropout1 = Dropout(dropout, mode="upscale_in_train") - self.dropout2 = Dropout(dropout, mode="upscale_in_train") - self.activation = getattr(F, activation) - - def forward(self, src, src_mask=None, cache=None): - r""" - Applies a Transformer encoder layer on the input. - - Parameters: - src (Tensor): The input of Transformer encoder layer. It is - a tensor with shape `[batch_size, sequence_length, d_model]`. - The data type should be float32 or float64. - src_mask (Tensor, optional): A tensor used in multi-head attention - to prevents attention to some unwanted positions, usually the - paddings or the subsequent positions. It is a tensor with shape - broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`. - When the data type is bool, the unwanted positions have `False` - values and the others have `True` values. When the data type is - int, the unwanted positions have 0 values and the others have 1 - values. When the data type is float, the unwanted positions have - `-INF` values and the others have 0 values. It can be None when - nothing wanted or needed to be prevented attention to. Default None. - cache (Tensor, optional): It is an instance of `MultiHeadAttention.Cache`. - See `TransformerEncoderLayer.gen_cache` for more details. It is - only used for inference and should be None for training. Default - None. - - Returns: - Tensor|tuple: It is a tensor that has the same shape and data type \ - as `enc_input`, representing the output of Transformer encoder \ - layer. Or a tuple if `cache` is not None, except for encoder \ - layer output, the tuple includes the new cache which is same \ - as input `cache` argument but `incremental_cache` has an \ - incremental length. See `MultiHeadAttention.gen_cache` and \ - `MultiHeadAttention.forward` for more details. - """ - src_mask = _convert_attention_mask(src_mask, src.dtype) - - residual = src - if self.normalize_before: - src = self.norm1(src) - # Add cache for encoder for the usage like UniLM - if cache is None: - # src = self.self_attn(src, src, src, src_mask) - src, att_weights = self.self_attn(src, src, src, src_mask) # interpret - else: - # src, incremental_cache = self.self_attn(src, src, src, src_mask, cache) - src, att_weights, incremental_cache = self.self_attn(src, src, src, src_mask, cache) # interpret - - src = residual + self.dropout1(src) - if not self.normalize_before: - src = self.norm1(src) - - residual = src - if self.normalize_before: - src = self.norm2(src) - src = self.linear2(self.dropout(self.activation(self.linear1(src)))) - src = residual + self.dropout2(src) - if not self.normalize_before: - src = self.norm2(src) - # return src if cache is None else (src, incremental_cache) - return (src, att_weights) if cache is None else (src, att_weights, incremental_cache) # interpret - - def gen_cache(self, src): - r""" - Generates cache for `forward` usage. The generated cache is an - instance of `MultiHeadAttention.Cache`. - - Parameters: - src (Tensor): The input of Transformer encoder. It is a tensor - with shape `[batch_size, source_length, d_model]`. The data - type should be float32 or float64. - - Returns: - incremental_cache: It is an instance of `MultiHeadAttention.Cache` \ - produced by `self_attn.gen_cache`, it reserves two tensors - shaped `[batch_size, nhead, 0, d_model // nhead]`. See \ - `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \ - for more details. - """ - incremental_cache = self.self_attn.gen_cache(src, type=self.self_attn.Cache) - return incremental_cache - - -class TransformerEncoder(Layer): - """ - TransformerEncoder is a stack of N encoder layers. - - Parameters: - encoder_layer (Layer): an instance of the `TransformerEncoderLayer`. It - would be used as the first layer, and the other layers would be created - according to the configurations of it. - num_layers (int): The number of encoder layers to be stacked. - norm (LayerNorm, optional): the layer normalization component. If provided, - apply layer normalization on the output of last encoder layer. - - Examples: - - .. code-block:: python - - import paddle - from paddle.nn import TransformerEncoderLayer, TransformerEncoder - - # encoder input: [batch_size, src_len, d_model] - enc_input = paddle.rand((2, 4, 128)) - # self attention mask: [batch_size, n_head, src_len, src_len] - attn_mask = paddle.rand((2, 2, 4, 4)) - encoder_layer = TransformerEncoderLayer(128, 2, 512) - encoder = TransformerEncoder(encoder_layer, 2) - enc_output = encoder(enc_input, attn_mask) # [2, 4, 128] - """ - - def __init__(self, encoder_layer, num_layers, norm=None): - super(TransformerEncoder, self).__init__() - self.layers = LayerList( - [(encoder_layer if i == 0 else type(encoder_layer)(**encoder_layer._config)) for i in range(num_layers)] - ) - self.num_layers = num_layers - self.norm = norm - - def forward(self, src, src_mask=None, cache=None): - r""" - Applies a stack of N Transformer encoder layers on inputs. If `norm` is - provided, also applies layer normalization on the output of last encoder - layer. - - Parameters: - src (Tensor): The input of Transformer encoder. It is a tensor - with shape `[batch_size, sequence_length, d_model]`. The data - type should be float32 or float64. - src_mask (Tensor, optional): A tensor used in multi-head attention - to prevents attention to some unwanted positions, usually the - paddings or the subsequent positions. It is a tensor with shape - broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`. - When the data type is bool, the unwanted positions have `False` - values and the others have `True` values. When the data type is - int, the unwanted positions have 0 values and the others have 1 - values. When the data type is float, the unwanted positions have - `-INF` values and the others have 0 values. It can be None when - nothing wanted or needed to be prevented attention to. Default None. - cache (list, optional): It is a list, and each element in the list - is `incremental_cache` produced by `TransformerEncoderLayer.gen_cache`. - See `TransformerEncoder.gen_cache` for more details. It is only - used for inference and should be None for training. Default None. - - Returns: - Tensor|tuple: It is a tensor that has the same shape and data type \ - as `src`, representing the output of Transformer encoder. \ - Or a tuple if `cache` is not None, except for encoder output, \ - the tuple includes the new cache which is same as input `cache` \ - argument but `incremental_cache` in it has an incremental length. \ - See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \ - for more details. - """ - src_mask = _convert_attention_mask(src_mask, src.dtype) - - output = src - att_weights_list = [] # interpret - new_caches = [] - for i, mod in enumerate(self.layers): - if cache is None: - # output = mod(output, src_mask=src_mask) - output, att_weights = mod(output, src_mask=src_mask) # interpret - att_weights_list.append(att_weights) - else: - # output, new_cache = mod(output, src_mask=src_mask, cache=cache[i]) - output, att_weights, new_cache = mod(output, src_mask=src_mask, cache=cache[i]) # interpret - att_weights_list.append(att_weights) - new_caches.append(new_cache) - - if self.norm is not None: - output = self.norm(output) - - # return output if cache is None else (output, new_caches) - return (output, att_weights_list) if cache is None else (output, att_weights_list, new_caches) # interpret - - def gen_cache(self, src): - r""" - Generates cache for `forward` usage. The generated cache is a list, and - each element in it is `incremental_cache` produced by - `TransformerEncoderLayer.gen_cache`. See `TransformerEncoderLayer.gen_cache` - for more details. - - Parameters: - src (Tensor): The input of Transformer encoder. It is a tensor - with shape `[batch_size, source_length, d_model]`. The data type - should be float32 or float64. - - Returns: - list: It is a list, and each element in the list is `incremental_cache` - produced by `TransformerEncoderLayer.gen_cache`. See - `TransformerEncoderLayer.gen_cache` for more details. - """ - cache = [layer.gen_cache(src) for layer in self.layers] - return cache - - -class TransformerDecoderLayer(Layer): - """ - TransformerDecoderLayer is composed of three sub-layers which are decoder - self (multi-head) attention, decoder-encoder cross attention and feedforward - network. Before and after each sub-layer, pre-process and post-precess would - be applied on the input and output accordingly. If `normalize_before` is True, - pre-process is layer normalization and post-precess includes dropout, residual - connection. Otherwise, no pre-process and post-precess includes dropout, residual - connection, layer normalization. - - Parameters: - d_model (int): The expected feature size in the input and output. - nhead (int): The number of heads in multi-head attention(MHA). - dim_feedforward (int): The hidden layer size in the feedforward network(FFN). - dropout (float, optional): The dropout probability used in pre-process - and post-precess of MHA and FFN sub-layer. Default 0.1 - activation (str, optional): The activation function in the feedforward - network. Default relu. - attn_dropout (float, optional): The dropout probability used - in MHA to drop some attention target. If None, use the value of - `dropout`. Default None - act_dropout (float, optional): The dropout probability used after FFN - activition. If None, use the value of `dropout`. Default None - normalize_before (bool, optional): Indicate whether to put layer normalization - into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer - normalization and post-precess includes dropout, residual connection. - Otherwise, no pre-process and post-precess includes dropout, residual - connection, layer normalization. Default False - weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property. - If it is a list/tuple, `weight_attr[0]` would be used as `weight_attr` for - self attention, `weight_attr[1]` would be used as `weight_attr` for - cross attention, and `weight_attr[2]` would be used as `weight_attr` - for linear in FFN. Otherwise, the three sub-layers all uses it as - `weight_attr` to create parameters. Default: None, which means the - default weight parameter property is used. See usage for details - in :ref:`api_paddle_ParamAttr` . - bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property. - If it is a list/tuple, `bias_attr[0]` would be used as `bias_attr` for - self attention, `bias_attr[1]` would be used as `bias_attr` for - cross attention, and `bias_attr[2]` would be used as `bias_attr` - for linear in FFN. Otherwise, the three sub-layers all uses it as - `bias_attr` to create parameters. The `False` value means the - corresponding layer would not have trainable bias parameter. See - usage for details in :code:`ParamAttr` . Default: None,which means - the default bias parameter property is used. - - Examples: - - .. code-block:: python - - import paddle - from paddle.nn import TransformerDecoderLayer - - # decoder input: [batch_size, tgt_len, d_model] - dec_input = paddle.rand((2, 4, 128)) - # encoder output: [batch_size, src_len, d_model] - enc_output = paddle.rand((2, 6, 128)) - # self attention mask: [batch_size, n_head, tgt_len, tgt_len] - self_attn_mask = paddle.rand((2, 2, 4, 4)) - # cross attention mask: [batch_size, n_head, tgt_len, src_len] - cross_attn_mask = paddle.rand((2, 2, 4, 6)) - decoder_layer = TransformerDecoderLayer(128, 2, 512) - output = decoder_layer(dec_input, - enc_output, - self_attn_mask, - cross_attn_mask) # [2, 4, 128] - """ - - def __init__( - self, - d_model, - nhead, - dim_feedforward, - dropout=0.1, - activation="relu", - attn_dropout=None, - act_dropout=None, - normalize_before=False, - weight_attr=None, - bias_attr=None, - ): - self._config = locals() - self._config.pop("self") - self._config.pop("__class__", None) # py3 - - super(TransformerDecoderLayer, self).__init__() - attn_dropout = dropout if attn_dropout is None else attn_dropout - act_dropout = dropout if act_dropout is None else act_dropout - self.normalize_before = normalize_before - - weight_attrs = _convert_param_attr_to_list(weight_attr, 3) - bias_attrs = _convert_param_attr_to_list(bias_attr, 3) - - self.self_attn = MultiHeadAttention( - d_model, nhead, dropout=attn_dropout, weight_attr=weight_attrs[0], bias_attr=bias_attrs[0] - ) - self.cross_attn = MultiHeadAttention( - d_model, nhead, dropout=attn_dropout, weight_attr=weight_attrs[1], bias_attr=bias_attrs[1] - ) - self.linear1 = Linear(d_model, dim_feedforward, weight_attrs[2], bias_attr=bias_attrs[2]) - self.dropout = Dropout(act_dropout, mode="upscale_in_train") - self.linear2 = Linear(dim_feedforward, d_model, weight_attrs[2], bias_attr=bias_attrs[2]) - self.norm1 = LayerNorm(d_model) - self.norm2 = LayerNorm(d_model) - self.norm3 = LayerNorm(d_model) - self.dropout1 = Dropout(dropout, mode="upscale_in_train") - self.dropout2 = Dropout(dropout, mode="upscale_in_train") - self.dropout3 = Dropout(dropout, mode="upscale_in_train") - self.activation = getattr(F, activation) - - def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None): - r""" - Applies a Transformer decoder layer on the input. - - Parameters: - tgt (Tensor): The input of Transformer decoder layer. It is a tensor - with shape `[batch_size, target_length, d_model]`. The data type - should be float32 or float64. - memory (Tensor): The output of Transformer encoder. It is a tensor - with shape `[batch_size, source_length, d_model]`. The data type - should be float32 or float64. - tgt_mask (Tensor, optional): A tensor used in self attention - to prevents attention to some unwanted positions, usually the - the subsequent positions. It is a tensor with shape broadcasted - to `[batch_size, n_head, target_length, target_length]`. - When the data type is bool, the unwanted positions have `False` - values and the others have `True` values. When the data type is - int, the unwanted positions have 0 values and the others have 1 - values. When the data type is float, the unwanted positions have - `-INF` values and the others have 0 values. It can be None when - nothing wanted or needed to be prevented attention to. Default None. - memory_mask (Tensor, optional): A tensor used in decoder-encoder - cross attention to prevents attention to some unwanted positions, - usually the paddings. It is a tensor with shape broadcasted to - `[batch_size, n_head, target_length, source_length]`. When the - data type is bool, the unwanted positions have `False` values - and the others have `True` values. When the data type is int, - the unwanted positions have 0 values and the others have 1 - values. When the data type is float, the unwanted positions have - `-INF` values and the others have 0 values. It can be None when - nothing wanted or needed to be prevented attention to. Default None. - cache (tuple, optional): It is a tuple( :code:`(incremental_cache, static_cache)` ), - `incremental_cache` is an instance of `MultiHeadAttention.Cache`, - `static_cache` is an instance of `MultiHeadAttention.StaticCache. - See `TransformerDecoderLayer.gen_cache` for more details. It is - only used for inference and should be None for training. Default - None. - - Returns: - Tensor|tuple: It is a tensor that has the same shape and data type \ - as `tgt`, representing the output of Transformer decoder layer. \ - Or a tuple if `cache` is not None, except for decoder layer output, \ - the tuple includes the new cache which is same as input `cache` \ - argument but `incremental_cache` in it has an incremental length. \ - See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \ - for more details. - """ - tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype) - memory_mask = _convert_attention_mask(memory_mask, memory.dtype) - - residual = tgt - if self.normalize_before: - tgt = self.norm1(tgt) - if cache is None: - tgt = self.self_attn(tgt, tgt, tgt, tgt_mask, None) - else: - tgt, incremental_cache = self.self_attn(tgt, tgt, tgt, tgt_mask, cache[0]) - tgt = residual + self.dropout1(tgt) - if not self.normalize_before: - tgt = self.norm1(tgt) - - residual = tgt - if self.normalize_before: - tgt = self.norm2(tgt) - if cache is None: - tgt = self.cross_attn(tgt, memory, memory, memory_mask, None) - else: - tgt, static_cache = self.cross_attn(tgt, memory, memory, memory_mask, cache[1]) - tgt = residual + self.dropout2(tgt) - if not self.normalize_before: - tgt = self.norm2(tgt) - - residual = tgt - if self.normalize_before: - tgt = self.norm3(tgt) - tgt = self.linear2(self.dropout(self.activation(self.linear1(tgt)))) - tgt = residual + self.dropout3(tgt) - if not self.normalize_before: - tgt = self.norm3(tgt) - return tgt if cache is None else (tgt, (incremental_cache, static_cache)) - - def gen_cache(self, memory): - r""" - Generates cache for `forward` usage. The generated cache is a tuple - composed of an instance of `MultiHeadAttention.Cache` and an instance - of `MultiHeadAttention.StaticCache`. - - Parameters: - memory (Tensor): The output of Transformer encoder. It is a tensor - with shape `[batch_size, source_length, d_model]`. The data type - should be float32 or float64. - - Returns: - tuple: It is a tuple( :code:`(incremental_cache, static_cache)` ). \ - `incremental_cache` is an instance of `MultiHeadAttention.Cache` \ - produced by `self_attn.gen_cache(memory, MultiHeadAttention.Cache)`, \ - it reserves two tensors shaped `[batch_size, nhead, 0, d_model // nhead]`. \ - `static_cache` is an instance of `MultiHeadAttention.StaticCache` \ - produced by `cross_attn.gen_cache(memory, MultiHeadAttention.StaticCache)`, \ - it reserves two tensors shaped `[batch_size, nhead, source_length, d_model // nhead]`. - See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \ - for more details. - """ - incremental_cache = self.self_attn.gen_cache(memory, type=self.self_attn.Cache) - static_cache = self.cross_attn.gen_cache(memory, memory, type=self.cross_attn.StaticCache) - return incremental_cache, static_cache - - -class TransformerDecoder(Layer): - """ - TransformerDecoder is a stack of N decoder layers. - - Parameters: - decoder_layer (Layer): an instance of the `TransformerDecoderLayer`. It - would be used as the first layer, and the other layers would be created - according to the configurations of it. - num_layers (int): The number of decoder layers to be stacked. - norm (LayerNorm, optional): the layer normalization component. If provided, - apply layer normalization on the output of last encoder layer. - - Examples: - - .. code-block:: python - - import paddle - from paddle.nn import TransformerDecoderLayer, TransformerDecoder - - # decoder input: [batch_size, tgt_len, d_model] - dec_input = paddle.rand((2, 4, 128)) - # encoder output: [batch_size, src_len, d_model] - enc_output = paddle.rand((2, 6, 128)) - # self attention mask: [batch_size, n_head, tgt_len, tgt_len] - self_attn_mask = paddle.rand((2, 2, 4, 4)) - # cross attention mask: [batch_size, n_head, tgt_len, src_len] - cross_attn_mask = paddle.rand((2, 2, 4, 6)) - decoder_layer = TransformerDecoderLayer(128, 2, 512) - decoder = TransformerDecoder(decoder_layer, 2) - output = decoder(dec_input, - enc_output, - self_attn_mask, - cross_attn_mask) # [2, 4, 128] - """ - - def __init__(self, decoder_layer, num_layers, norm=None): - super(TransformerDecoder, self).__init__() - self.layers = LayerList( - [(decoder_layer if i == 0 else type(decoder_layer)(**decoder_layer._config)) for i in range(num_layers)] - ) - self.num_layers = num_layers - self.norm = norm - - def forward(self, tgt, memory, tgt_mask=None, memory_mask=None, cache=None): - r""" - Applies a stack of N Transformer decoder layers on inputs. If `norm` is - provided, also applies layer normalization on the output of last decoder - layer. - - Parameters: - tgt (Tensor): The input of Transformer decoder. It is a tensor - with shape `[batch_size, target_length, d_model]`. The data type - should be float32 or float64. - memory (Tensor): The output of Transformer encoder. It is a tensor - with shape `[batch_size, source_length, d_model]`. The data type - should be float32 or float64. - tgt_mask (Tensor, optional): A tensor used in self attention - to prevents attention to some unwanted positions, usually the - the subsequent positions. It is a tensor with shape broadcasted - to `[batch_size, n_head, target_length, target_length]`. When - the data type is bool, the unwanted positions have `False` - values and the others have `True` values. When the data type is - int, the unwanted positions have 0 values and the others have 1 - values. When the data type is float, the unwanted positions have - `-INF` values and the others have 0 values. It can be None when - nothing wanted or needed to be prevented attention to. Default None. - memory_mask (Tensor, optional): A tensor used in decoder-encoder - cross attention to prevents attention to some unwanted positions, - usually the paddings. It is a tensor with shape broadcasted to - `[batch_size, n_head, target_length, source_length]`. When the - data type is bool, the unwanted positions have `False` values - and the others have `True` values. When the data type is int, - the unwanted positions have 0 values and the others have 1 - values. When the data type is float, the unwanted positions have - `-INF` values and the others have 0 values. It can be None when - nothing wanted or needed to be prevented attention to. Default None. - cache (list, optional): It is a list, and each element in the list - is a tuple( :code:`(incremental_cache, static_cache)` ). See - `TransformerDecoder.gen_cache` for more details. It is only - used for inference and should be None for training. Default None. - - Returns: - Tensor|tuple: It is a tensor that has the same shape and data type \ - as `tgt`, representing the output of Transformer decoder. \ - Or a tuple if `cache` is not None, except for decoder output, \ - the tuple includes the new cache which is same as input `cache` \ - argument but `incremental_cache` in it has an incremental length. \ - See `MultiHeadAttention.gen_cache` and `MultiHeadAttention.forward` \ - for more details. - """ - tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype) - memory_mask = _convert_attention_mask(memory_mask, memory.dtype) - - output = tgt - new_caches = [] - for i, mod in enumerate(self.layers): - if cache is None: - output = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=None) - else: - output, new_cache = mod(output, memory, tgt_mask=tgt_mask, memory_mask=memory_mask, cache=cache[i]) - new_caches.append(new_cache) - - if self.norm is not None: - output = self.norm(output) - - return output if cache is None else (output, new_caches) - - def gen_cache(self, memory, do_zip=False): - r""" - Generates cache for `forward` usage. The generated cache is a list, and - each element in it is a tuple( :code:`(incremental_cache, static_cache)` ) - produced by `TransformerDecoderLayer.gen_cache`. See `TransformerDecoderLayer.gen_cache` - for more details. If `do_zip` is True, apply `zip` on these tuples to get - a list with two elements. - - - Parameters: - memory (Tensor): The output of Transformer encoder. It is a tensor - with shape `[batch_size, source_length, d_model]`. The data type - should be float32 or float64. - do_zip (bool, optional): Indicate whether to apply `zip` on the tuples. - If True, return a list with two elements. Default False - - Returns: - list: It is a list, and each element in the list is a tuple produced \ - by `TransformerDecoderLayer.gen_cache(memory)`. See `TransformerDecoderLayer.gen_cache` \ - for more details. If `do_zip` is True, apply `zip` on these tuples \ - and return a list with two elements. - """ - cache = [layer.gen_cache(memory) for layer in self.layers] - if do_zip: - cache = list(zip(*cache)) - return cache - - -class Transformer(Layer): - """ - A Transformer model composed of an instance of `TransformerEncoder` and an - instance of `TransformerDecoder`. While the embedding layer and output layer - are not included. - - Please refer to `Attention is all you need `_ , - and see `TransformerEncoder` and `TransformerDecoder` for more details. - - Users can configurate the model architecture with corresponding parameters. - Note the usage of `normalize_before` representing where to apply layer - normalization (in pre-process or post-precess of multi-head attention or FFN), - and some transformer like models are different on this, such as - `BERT `_ and `GPT2 `_ . - The default architecture here places layer normalization in post-process and - applies another layer normalization on the output of last encoder/decoder layer. - - Parameters: - d_model (int, optional): The expected feature size in the encoder/decoder input - and output. Default 512 - nhead (int, optional): The number of heads in multi-head attention(MHA). Default 8 - num_encoder_layers (int, optional): The number of layers in encoder. Default 6 - num_decoder_layers (int, optional): The number of layers in decoder. Default 6 - dim_feedforward (int, optional): The hidden layer size in the feedforward network(FFN). Default 2048 - dropout (float, optional): The dropout probability used in pre-process - and post-precess of MHA and FFN sub-layer. Default 0.1 - activation (str, optional): The activation function in the feedforward - network. Default relu. - attn_dropout (float, optional): The dropout probability used - in MHA to drop some attention target. If None, use the value of - `dropout`. Default None - act_dropout (float, optional): The dropout probability used after FFN - activition. If None, use the value of `dropout`. Default None - normalize_before (bool, optional): Indicate whether to put layer normalization - into preprocessing of MHA and FFN sub-layers. If True, pre-process is layer - normalization and post-precess includes dropout, residual connection. - Otherwise, no pre-process and post-precess includes dropout, residual - connection, layer normalization. Default False - weight_attr(ParamAttr|list|tuple, optional): To specify the weight parameter property. - If it is a list/tuple, the length of `weight_attr` could be 1, 2 or 3. If it is 3, - `weight_attr[0]` would be used as `weight_attr` for self attention, `weight_attr[1]` - would be used as `weight_attr` for cross attention of `TransformerDecoder`, - and `weight_attr[2]` would be used as `weight_attr` for linear in FFN. - If it is 2, `weight_attr[0]` would be used as `weight_attr` both for self attention - and cross attntion and `weight_attr[1]` would be used as `weight_attr` for - linear in FFN. If it is 1, `weight_attr[0]` would be used as `weight_attr` - for self attention, cross attention and linear in FFN. Otherwise, - the three sub-layers all uses it as `weight_attr` to create parameters. - Default: None, which means the default weight parameter property is used. - See usage for details - in :code:`ParamAttr` . - bias_attr (ParamAttr|list|tuple|bool, optional): To specify the bias parameter property. - If it is a list/tuple, the length of `bias_attr` could be 1, 2 or 3. If it is 3, - `bias_attr[0]` would be used as `bias_attr` for self attention, `bias_attr[1]` - would be used as `bias_attr` for cross attention of `TransformerDecoder`, - and `bias_attr[2]` would be used as `bias_attr` for linear in FFN. - If it is 2, `bias_attr[0]` would be used as `bias_attr` both for self attention - and cross attntion and `bias_attr[1]` would be used as `bias_attr` for - linear in FFN. If it is 1, `bias_attr[0]` would be used as `bias_attr` - for self attention, cross attention and linear in FFN. Otherwise, - the three sub-layers all uses it as `bias_attr` to create parameters. - The `False` value means the corresponding layer would not have trainable - bias parameter. See usage for details in :code:`ParamAttr` . - Default: None,which means the default bias parameter property is used. - custom_encoder (Layer, optional): If custom encoder is provided, use it as the encoder. - Default None - custom_decoder (Layer, optional): If custom decoder is provided, use it as the decoder. - Default None - - Examples: - - .. code-block:: python - - import paddle - from paddle.nn import Transformer - - # src: [batch_size, tgt_len, d_model] - enc_input = paddle.rand((2, 4, 128)) - # tgt: [batch_size, src_len, d_model] - dec_input = paddle.rand((2, 6, 128)) - # src_mask: [batch_size, n_head, src_len, src_len] - enc_self_attn_mask = paddle.rand((2, 2, 4, 4)) - # tgt_mask: [batch_size, n_head, tgt_len, tgt_len] - dec_self_attn_mask = paddle.rand((2, 2, 6, 6)) - # memory_mask: [batch_size, n_head, tgt_len, src_len] - cross_attn_mask = paddle.rand((2, 2, 6, 4)) - transformer = Transformer(128, 2, 4, 4, 512) - output = transformer(enc_input, - dec_input, - enc_self_attn_mask, - dec_self_attn_mask, - cross_attn_mask) # [2, 6, 128] - """ - - def __init__( - self, - d_model=512, - nhead=8, - num_encoder_layers=6, - num_decoder_layers=6, - dim_feedforward=2048, - dropout=0.1, - activation="relu", - attn_dropout=None, - act_dropout=None, - normalize_before=False, - weight_attr=None, - bias_attr=None, - custom_encoder=None, - custom_decoder=None, - ): - super(Transformer, self).__init__() - - if isinstance(bias_attr, (list, tuple)): - if len(bias_attr) == 1: - encoder_bias_attr = [bias_attr[0]] * 2 - decoder_bias_attr = [bias_attr[0]] * 3 - elif len(bias_attr) == 2: - encoder_bias_attr = bias_attr - decoder_bias_attr = [bias_attr[0], bias_attr[0], bias_attr[-1]] - elif len(bias_attr) == 3: - encoder_bias_attr = [bias_attr[0], bias_attr[-1]] - decoder_bias_attr = bias_attr - else: - assert False, "length of bias_attr should be 1 or 2 or 3 when it is a list/tuple" - else: - encoder_bias_attr = bias_attr - decoder_bias_attr = bias_attr - - if isinstance(weight_attr, (list, tuple)): - if len(weight_attr) == 1: - encoder_weight_attr = [weight_attr[0]] * 2 - decoder_weight_attr = [weight_attr[0]] * 3 - elif len(weight_attr) == 2: - encoder_weight_attr = weight_attr - decoder_weight_attr = [weight_attr[0], weight_attr[0], weight_attr[-1]] - elif len(weight_attr) == 3: - encoder_weight_attr = [weight_attr[0], weight_attr[-1]] - decoder_weight_attr = weight_attr - else: - assert False, "length of weight_attr should be 1 or 2 or 3 when it is a list/tuple" - else: - encoder_weight_attr = weight_attr - decoder_weight_attr = weight_attr - - if custom_encoder is not None: - self.encoder = custom_encoder - else: - encoder_layer = TransformerEncoderLayer( - d_model, - nhead, - dim_feedforward, - dropout, - activation, - attn_dropout, - act_dropout, - normalize_before, - encoder_weight_attr, - encoder_bias_attr, - ) - encoder_norm = LayerNorm(d_model) - self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm) - - if custom_decoder is not None: - self.decoder = custom_decoder - else: - decoder_layer = TransformerDecoderLayer( - d_model, - nhead, - dim_feedforward, - dropout, - activation, - attn_dropout, - act_dropout, - normalize_before, - decoder_weight_attr, - decoder_bias_attr, - ) - decoder_norm = LayerNorm(d_model) - self.decoder = TransformerDecoder(decoder_layer, num_decoder_layers, decoder_norm) - - self.d_model = d_model - self.nhead = nhead - - def forward(self, src, tgt, src_mask=None, tgt_mask=None, memory_mask=None): - r""" - Applies a Transformer model on the inputs. - - Parameters: - src (Tensor): The input of Transformer encoder. It is a tensor - with shape `[batch_size, source_length, d_model]`. The data type - should be float32 or float64. - tgt (Tensor): The input of Transformer decoder. It is a tensor - with shape `[batch_size, target_length, d_model]`. The data type - should be float32 or float64. - memory (Tensor): The output of Transformer encoder. It is a tensor - with shape `[batch_size, source_length, d_model]`. The data type - should be float32 or float64. - src_mask (Tensor, optional): A tensor used in multi-head attention - to prevents attention to some unwanted positions, usually the - paddings or the subsequent positions. It is a tensor with shape - broadcasted to `[batch_size, n_head, sequence_length, sequence_length]`. - When the data type is bool, the unwanted positions have `False` - values and the others have `True` values. When the data type is - int, the unwanted positions have 0 values and the others have 1 - values. When the data type is float, the unwanted positions have - `-INF` values and the others have 0 values. It can be None when - nothing wanted or needed to be prevented attention to. Default None. - tgt_mask (Tensor, optional): A tensor used in self attention - to prevents attention to some unwanted positions, usually the - the subsequent positions. It is a tensor with shape broadcasted - to `[batch_size, n_head, target_length, target_length]`. When - the data type is bool, the unwanted positions have `False` - values and the others have `True` values. When the data type is - int, the unwanted positions have 0 values and the others have 1 - values. When the data type is float, the unwanted positions have - `-INF` values and the others have 0 values. It can be None when - nothing wanted or needed to be prevented attention to. Default None. - memory_mask (Tensor, optional): A tensor used in decoder-encoder - cross attention to prevents attention to some unwanted positions, - usually the paddings. It is a tensor with shape broadcasted to - `[batch_size, n_head, target_length, source_length]`. When the - data type is bool, the unwanted positions have `False` values - and the others have `True` values. When the data type is int, - the unwanted positions have 0 values and the others have 1 - values. When the data type is float, the unwanted positions have - `-INF` values and the others have 0 values. It can be None when - nothing wanted or needed to be prevented attention to. Default None. - - Returns: - Tensor: It is a tensor that has the same shape and data type \ - as `tgt`, representing the output of Transformer decoder. - """ - src_mask = _convert_attention_mask(src_mask, src.dtype) - memory = self.encoder(src, src_mask=src_mask) - - tgt_mask = _convert_attention_mask(tgt_mask, tgt.dtype) - memory_mask = _convert_attention_mask(memory_mask, memory.dtype) - output = self.decoder(tgt, memory, tgt_mask=tgt_mask, memory_mask=memory_mask) - return output - - def generate_square_subsequent_mask(self, length): - """ - Generate a square mask for the sequence. The mask ensures that the - predictions for position i can depend only on the known outputs at - positions less than i. - - Parameters: - length (int|Tensor): The length of sequence. - - Returns: - Tensor: Generated square mask according to the given length. - - Examples: - .. code-block:: python - - import paddle - from paddle.nn.layer.transformer import Transformer - length = 5 - d_model, n_head, dim_feedforward = 8, 4, 64 - transformer_paddle = Transformer( - d_model, n_head, dim_feedforward=dim_feedforward) - mask = transformer_paddle.generate_square_subsequent_mask(length) - print(mask) - - # [[ 0. -inf -inf -inf -inf] - # [ 0. 0. -inf -inf -inf] - # [ 0. 0. 0. -inf -inf] - # [ 0. 0. 0. 0. -inf] - # [ 0. 0. 0. 0. 0.]] - - """ - return paddle.tensor.triu((paddle.ones((length, length), dtype=paddle.get_default_dtype()) * -np.inf), 1) diff --git a/examples/model_interpretation/utils.py b/examples/model_interpretation/utils.py deleted file mode 100644 index 469dc6f797f1..000000000000 --- a/examples/model_interpretation/utils.py +++ /dev/null @@ -1,88 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -"""This file contains some public functions -""" - - -def convert_tokenizer_res_to_old_version(tokenized_res): - if isinstance(tokenized_res, list): - return tokenized_res - if isinstance(tokenized_res, dict): - if len(tokenized_res["input_ids"]) == 0 or not isinstance(tokenized_res["input_ids"][0], list): - return tokenized_res - else: - res = [] - for idx in range(len(tokenized_res["input_ids"])): - temp_dict = {} - key_list = list(tokenized_res.keys()) - for key in key_list: - temp_dict[key] = tokenized_res[key][idx] - res.append(temp_dict) - return res - else: - raise ValueError("unsupported result type") - - -def cal_score(match_list, sorted_token): - over_all = [] - miss = 0 - for i in match_list: - over_all.extend(i[0]) - - score_dic = {} - for i in sorted_token: - split_time = over_all.count(i[0]) - if split_time: - score_dic[i[0]] = i[2] / split_time - else: - score_dic[i[0]] = 0.0 - if miss != 0: - print(miss) - - score = [] - for i in range(len(match_list)): - cur_score = 0.0 - for j in match_list[i][0]: - if j == -1: - continue - cur_score += score_dic[j] - score.append([str(match_list[i][1]), match_list[i][2], cur_score]) - return score - - -def match(context, context_seg, sorted_token): - result = [] - pointer1 = 0 # point at the context - pointer2 = 0 # point at the sorted_token array - for i in range(len(context_seg)): - seg_start_idx = context.find(context_seg[i], pointer1) - if seg_start_idx < 0: - print("Error: token not in context") - seg_end_idx = seg_start_idx + len(context_seg[i]) - - cur_set = [] - while pointer2 < len(sorted_token): - while pointer2 < len(sorted_token) and sorted_token[pointer2][1][1] <= seg_start_idx: - pointer2 += 1 - if pointer2 >= len(sorted_token): - break - if sorted_token[pointer2][1][0] >= seg_end_idx: - break - cur_set.append(sorted_token[pointer2][0]) - pointer2 += 1 - result.append([cur_set, i, context_seg[i]]) - pointer2 -= 1 - pointer1 = seg_end_idx - score = cal_score(result, sorted_token) - return score diff --git a/examples/multimodal/layoutlm/README.md b/examples/multimodal/layoutlm/README.md deleted file mode 100644 index f1f46392d4cd..000000000000 --- a/examples/multimodal/layoutlm/README.md +++ /dev/null @@ -1,44 +0,0 @@ -# LayoutLM - -## 模型简介 -本项目是 [LayoutLM:Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/pdf/1912.13318v5.pdf) 在 Paddle 2.2上的开源实现, -包含了在 [FUNSD数据集](https://guillaumejaume.github.io/FUNSD/) 上的微调代码。 - -## 快速开始 -### 配置环境 -环境依赖 -- cv2 -- sentencepiece -- yacs - -安装命令: -```shell -pip install opencv-python -pip install sentencepiece -pip install yacs -``` - -### 数据准备 -处理好的FUNSD中文数据集下载地址:https://bj.bcebos.com/v1/paddlenlp/datasets/FUNSD.zip 。 - -下载并解压该数据集,解压后将数据集放置在当前目录下。 - -### 执行Fine-tuning -1. ``Sequence Labeling`` 任务启动Fine-tuning的方式如下: - ```shell - bash train_funsd.sh - - # 结果如下: - # best metrics: {'precision': 0.7642124883504194, 'recall': 0.8204102051025512, 'f1': 0.7913148371531967} - ``` - -### 数据处理 -FUNSD数据集是常用的表格理解数据集,原始的数据集下载地址:https://guillaumejaume.github.io/FUNSD/dataset.zip. -包括training_data和test_dataing两个子文件夹,包括149个训练数据和50个测试数据。数据预处理方式如下: -```shell - bash preprocess.sh -``` - -## Reference -- [LayoutLM: Pre-training of Text and Layout for Document Image Understanding](https://arxiv.org/pdf/1912.13318v5.pdf) -- [microsoft/unilm/layoutlm](https://github.com/microsoft/unilm/tree/master/layoutlm) diff --git a/examples/multimodal/layoutlm/funsd.py b/examples/multimodal/layoutlm/funsd.py deleted file mode 100644 index 4421cd3710b4..000000000000 --- a/examples/multimodal/layoutlm/funsd.py +++ /dev/null @@ -1,317 +0,0 @@ -# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import logging -import os - -import paddle -from paddle.io import Dataset - -logger = logging.getLogger(__name__) - - -class FunsdDataset(Dataset): - def __init__(self, args, tokenizer, labels, pad_token_label_id, mode): - logger.info("Creating features from dataset file at %s", args.data_dir) - examples = read_examples_from_file(args.data_dir, mode) - features = convert_examples_to_features( - examples, - labels, - args.max_seq_length, - tokenizer, - cls_token_at_end=False, - cls_token=tokenizer.cls_token, - cls_token_segment_id=0, - sep_token=tokenizer.sep_token, - sep_token_extra=False, - pad_on_left=False, - pad_token=tokenizer.convert_tokens_to_ids([tokenizer.pad_token])[0], - pad_token_segment_id=0, - pad_token_label_id=pad_token_label_id, - ) - - self.features = features - # Convert to Tensors and build dataset - self.all_input_ids = paddle.to_tensor([f.input_ids for f in features], dtype="int64") - self.all_input_mask = paddle.to_tensor([f.input_mask for f in features], dtype="int64") - self.all_segment_ids = paddle.to_tensor([f.segment_ids for f in features], dtype="int64") - self.all_label_ids = paddle.to_tensor([f.label_ids for f in features], dtype="int64") - self.all_bboxes = paddle.to_tensor([f.boxes for f in features], dtype="int64") - - def __len__(self): - return len(self.features) - - def __getitem__(self, index): - return ( - self.all_input_ids[index], - self.all_input_mask[index], - self.all_segment_ids[index], - self.all_label_ids[index], - self.all_bboxes[index], - ) - - -class InputExample(object): - """A single training/test example for token classification.""" - - def __init__(self, guid, words, labels, boxes, actual_bboxes, file_name, page_size): - """Constructs a InputExample. - Args: - guid: Unique id for the example. - words: list. The words of the sequence. - labels: (Optional) list. The labels for each word of the sequence. This should be - specified for train and dev examples, but not for test examples. - """ - self.guid = guid - self.words = words - self.labels = labels - self.boxes = boxes - self.actual_bboxes = actual_bboxes - self.file_name = file_name - self.page_size = page_size - - -class InputFeatures(object): - """A single set of features of data.""" - - def __init__( - self, - input_ids, - input_mask, - segment_ids, - label_ids, - boxes, - actual_bboxes, - file_name, - page_size, - ): - assert ( - 0 <= all(boxes) <= 1000 - ), "Error with input bbox ({}): the coordinate value is not between 0 and 1000".format(boxes) - self.input_ids = input_ids - self.input_mask = input_mask - self.segment_ids = segment_ids - self.label_ids = label_ids - self.boxes = boxes - self.actual_bboxes = actual_bboxes - self.file_name = file_name - self.page_size = page_size - - -def read_examples_from_file(data_dir, mode): - file_path = os.path.join(data_dir, "{}.txt".format(mode)) - box_file_path = os.path.join(data_dir, "{}_box.txt".format(mode)) - image_file_path = os.path.join(data_dir, "{}_image.txt".format(mode)) - guid_index = 1 - examples = [] - with open(file_path, encoding="utf-8") as f, open(box_file_path, encoding="utf-8") as fb, open( - image_file_path, encoding="utf-8" - ) as fi: - words = [] - boxes = [] - actual_bboxes = [] - file_name = None - page_size = None - labels = [] - for line, bline, iline in zip(f, fb, fi): - if line.startswith("-DOCSTART-") or line == "" or line == "\n": - if words: - examples.append( - InputExample( - guid="{}-{}".format(mode, guid_index), - words=words, - labels=labels, - boxes=boxes, - actual_bboxes=actual_bboxes, - file_name=file_name, - page_size=page_size, - ) - ) - guid_index += 1 - words = [] - boxes = [] - actual_bboxes = [] - file_name = None - page_size = None - labels = [] - else: - splits = line.split("\t") - bsplits = bline.split("\t") - isplits = iline.split("\t") - assert len(splits) == 2 - assert len(bsplits) == 2 - assert len(isplits) == 4 - assert splits[0] == bsplits[0] - words.append(splits[0]) - if len(splits) > 1: - labels.append(splits[-1].replace("\n", "")) - box = bsplits[-1].replace("\n", "") - box = [int(b) for b in box.split()] - boxes.append(box) - actual_bbox = [int(b) for b in isplits[1].split()] - actual_bboxes.append(actual_bbox) - page_size = [int(i) for i in isplits[2].split()] - file_name = isplits[3].strip() - else: - # Examples could have no label for mode = "test" - labels.append("O") - if words: - examples.append( - InputExample( - guid=f"{mode}-{guid_index}", - words=words, - labels=labels, - boxes=boxes, - actual_bboxes=actual_bboxes, - file_name=file_name, - page_size=page_size, - ) - ) - return examples - - -def convert_examples_to_features( - examples, - label_list, - max_seq_length, - tokenizer, - cls_token_at_end=False, - cls_token="[CLS]", - cls_token_segment_id=1, - sep_token="[SEP]", - sep_token_extra=False, - pad_on_left=False, - pad_token=0, - cls_token_box=[0, 0, 0, 0], - sep_token_box=[1000, 1000, 1000, 1000], - pad_token_box=[0, 0, 0, 0], - pad_token_segment_id=0, - pad_token_label_id=-1, - sequence_a_segment_id=0, - mask_padding_with_zero=True, -): - - label_map = {label: i for i, label in enumerate(label_list)} - - features = [] - for (ex_index, example) in enumerate(examples): - file_name = example.file_name - page_size = example.page_size - width, height = page_size - if ex_index % 10000 == 0: - logger.info("Writing example %d of %d", ex_index, len(examples)) - - tokens = [] - token_boxes = [] - actual_bboxes = [] - label_ids = [] - for word, label, box, actual_bbox in zip(example.words, example.labels, example.boxes, example.actual_bboxes): - word_tokens = tokenizer.tokenize(word) - tokens.extend(word_tokens) - token_boxes.extend([box] * len(word_tokens)) - actual_bboxes.extend([actual_bbox] * len(word_tokens)) - # Use the real label id for the first token of the word, and padding ids for the remaining tokens - label_ids.extend([label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1)) - - # Account for [CLS] and [SEP] with "- 2" and with "- 3" for RoBERTa. - special_tokens_count = 3 if sep_token_extra else 2 - if len(tokens) > max_seq_length - special_tokens_count: - tokens = tokens[: (max_seq_length - special_tokens_count)] - token_boxes = token_boxes[: (max_seq_length - special_tokens_count)] - actual_bboxes = actual_bboxes[: (max_seq_length - special_tokens_count)] - label_ids = label_ids[: (max_seq_length - special_tokens_count)] - - # The convention in BERT is: - # (a) For sequence pairs: - # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] - # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 - # (b) For single sequences: - # tokens: [CLS] the dog is hairy . [SEP] - # type_ids: 0 0 0 0 0 0 0 - # - # Where "type_ids" are used to indicate whether this is the first - # sequence or the second sequence. The embedding vectors for `type=0` and - # `type=1` were learned during pre-training and are added to the wordpiece - # embedding vector (and position vector). This is not *strictly* necessary - # since the [SEP] token unambiguously separates the sequences, but it makes - # it easier for the model to learn the concept of sequences. - # - # For classification tasks, the first vector (corresponding to [CLS]) is - # used as the "sentence vector". Note that this only makes sense because - # the entire model is fine-tuned. - tokens += [sep_token] - token_boxes += [sep_token_box] - actual_bboxes += [[0, 0, width, height]] - label_ids += [pad_token_label_id] - if sep_token_extra: - # roberta uses an extra separator b/w pairs of sentences - tokens += [sep_token] - token_boxes += [sep_token_box] - actual_bboxes += [[0, 0, width, height]] - label_ids += [pad_token_label_id] - segment_ids = [sequence_a_segment_id] * len(tokens) - - if cls_token_at_end: - tokens += [cls_token] - token_boxes += [cls_token_box] - actual_bboxes += [[0, 0, width, height]] - label_ids += [pad_token_label_id] - segment_ids += [cls_token_segment_id] - else: - tokens = [cls_token] + tokens - token_boxes = [cls_token_box] + token_boxes - actual_bboxes = [[0, 0, width, height]] + actual_bboxes - label_ids = [pad_token_label_id] + label_ids - segment_ids = [cls_token_segment_id] + segment_ids - - input_ids = tokenizer.convert_tokens_to_ids(tokens) - - # The mask has 1 for real tokens and 0 for padding tokens. Only real - # tokens are attended to. - input_mask = [1 if mask_padding_with_zero else 0] * len(input_ids) - - # Zero-pad up to the sequence length. - padding_length = max_seq_length - len(input_ids) - if pad_on_left: - input_ids = ([pad_token] * padding_length) + input_ids - input_mask = ([0 if mask_padding_with_zero else 1] * padding_length) + input_mask - segment_ids = ([pad_token_segment_id] * padding_length) + segment_ids - label_ids = ([pad_token_label_id] * padding_length) + label_ids - token_boxes = ([pad_token_box] * padding_length) + token_boxes - else: - input_ids += [pad_token] * padding_length - input_mask += [0 if mask_padding_with_zero else 1] * padding_length - segment_ids += [pad_token_segment_id] * padding_length - label_ids += [pad_token_label_id] * padding_length - token_boxes += [pad_token_box] * padding_length - - assert len(input_ids) == max_seq_length - assert len(input_mask) == max_seq_length - assert len(segment_ids) == max_seq_length - assert len(label_ids) == max_seq_length - assert len(token_boxes) == max_seq_length - - features.append( - InputFeatures( - input_ids=input_ids, - input_mask=input_mask, - segment_ids=segment_ids, - label_ids=label_ids, - boxes=token_boxes, - actual_bboxes=actual_bboxes, - file_name=file_name, - page_size=page_size, - ) - ) - return features diff --git a/examples/multimodal/layoutlm/preprocess.py b/examples/multimodal/layoutlm/preprocess.py deleted file mode 100644 index 28b07e5ca5d8..000000000000 --- a/examples/multimodal/layoutlm/preprocess.py +++ /dev/null @@ -1,166 +0,0 @@ -import argparse -import json -import os - -from PIL import Image -from paddlenlp.transformers import AutoTokenizer - - -def bbox_string(box, width, length): - return ( - str(int(1000 * (box[0] / width))) - + " " - + str(int(1000 * (box[1] / length))) - + " " - + str(int(1000 * (box[2] / width))) - + " " - + str(int(1000 * (box[3] / length))) - ) - - -def actual_bbox_string(box, width, length): - return ( - str(box[0]) + " " + str(box[1]) + " " + str(box[2]) + " " + str(box[3]) + "\t" + str(width) + " " + str(length) - ) - - -def convert(args): - with open(os.path.join(args.output_dir, args.data_split + ".txt.tmp"), "w", encoding="utf8",) as fw, open( - os.path.join(args.output_dir, args.data_split + "_box.txt.tmp"), - "w", - encoding="utf8", - ) as fbw, open( - os.path.join(args.output_dir, args.data_split + "_image.txt.tmp"), - "w", - encoding="utf8", - ) as fiw: - for file in os.listdir(args.data_dir): - file_path = os.path.join(args.data_dir, file) - with open(file_path, "r", encoding="utf8") as f: - data = json.load(f) - image_path = file_path.replace("annotations", "images") - image_path = image_path.replace("json", "png") - file_name = os.path.basename(image_path) - image = Image.open(image_path) - width, length = image.size - for item in data["form"]: - words, label = item["words"], item["label"] - words = [w for w in words if w["text"].strip() != ""] - if len(words) == 0: - continue - if label == "other": - for w in words: - fw.write(w["text"] + "\tO\n") - fbw.write(w["text"] + "\t" + bbox_string(w["box"], width, length) + "\n") - fiw.write( - w["text"] + "\t" + actual_bbox_string(w["box"], width, length) + "\t" + file_name + "\n" - ) - else: - if len(words) == 1: - fw.write(words[0]["text"] + "\tS-" + label.upper() + "\n") - fbw.write(words[0]["text"] + "\t" + bbox_string(words[0]["box"], width, length) + "\n") - fiw.write( - words[0]["text"] - + "\t" - + actual_bbox_string(words[0]["box"], width, length) - + "\t" - + file_name - + "\n" - ) - else: - fw.write(words[0]["text"] + "\tB-" + label.upper() + "\n") - fbw.write(words[0]["text"] + "\t" + bbox_string(words[0]["box"], width, length) + "\n") - fiw.write( - words[0]["text"] - + "\t" - + actual_bbox_string(words[0]["box"], width, length) - + "\t" - + file_name - + "\n" - ) - for w in words[1:-1]: - fw.write(w["text"] + "\tI-" + label.upper() + "\n") - fbw.write(w["text"] + "\t" + bbox_string(w["box"], width, length) + "\n") - fiw.write( - w["text"] - + "\t" - + actual_bbox_string(w["box"], width, length) - + "\t" - + file_name - + "\n" - ) - fw.write(words[-1]["text"] + "\tE-" + label.upper() + "\n") - fbw.write(words[-1]["text"] + "\t" + bbox_string(words[-1]["box"], width, length) + "\n") - fiw.write( - words[-1]["text"] - + "\t" - + actual_bbox_string(words[-1]["box"], width, length) - + "\t" - + file_name - + "\n" - ) - fw.write("\n") - fbw.write("\n") - fiw.write("\n") - - -def seg_file(file_path, tokenizer, max_len): - subword_len_counter = 0 - output_path = file_path[:-4] - with open(file_path, "r", encoding="utf8") as f_p, open(output_path, "w", encoding="utf8") as fw_p: - for line in f_p: - line = line.rstrip() - - if not line: - fw_p.write(line + "\n") - subword_len_counter = 0 - continue - token = line.split("\t")[0] - - current_subwords_len = len(tokenizer.tokenize(token)) - - # Token contains strange control characters like \x96 or \x95 - # Just filter out the complete line - if current_subwords_len == 0: - continue - - if (subword_len_counter + current_subwords_len) > max_len: - fw_p.write("\n" + line + "\n") - subword_len_counter = current_subwords_len - continue - - subword_len_counter += current_subwords_len - - fw_p.write(line + "\n") - - -def seg(args): - tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, do_lower_case=True) - seg_file( - os.path.join(args.output_dir, args.data_split + ".txt.tmp"), - tokenizer, - args.max_len, - ) - seg_file( - os.path.join(args.output_dir, args.data_split + "_box.txt.tmp"), - tokenizer, - args.max_len, - ) - seg_file( - os.path.join(args.output_dir, args.data_split + "_image.txt.tmp"), - tokenizer, - args.max_len, - ) - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument("--data_dir", type=str, default="data/training_data/annotations") - parser.add_argument("--data_split", type=str, default="train") - parser.add_argument("--output_dir", type=str, default="data") - parser.add_argument("--model_name_or_path", type=str, default="bert-base-uncased") - parser.add_argument("--max_len", type=int, default=510) - args = parser.parse_args() - - convert(args) - seg(args) diff --git a/examples/multimodal/layoutlm/preprocess.sh b/examples/multimodal/layoutlm/preprocess.sh deleted file mode 100644 index 2ff8dc4e317a..000000000000 --- a/examples/multimodal/layoutlm/preprocess.sh +++ /dev/null @@ -1,13 +0,0 @@ -python preprocess.py --data_dir data/training_data/annotations \ - --data_split train \ - --output_dir data \ - --model_name_or_path bert-base-uncased \ - --max_len 510 - -python preprocess.py --data_dir data/testing_data/annotations \ - --data_split test \ - --output_dir data \ - --model_name_or_path bert-base-uncased \ - --max_len 510 - -cat data/train.txt | cut -d$'\t' -f 2 | grep -v "^$"| sort | uniq > data/labels.txt \ No newline at end of file diff --git a/examples/multimodal/layoutlm/train_funsd.py b/examples/multimodal/layoutlm/train_funsd.py deleted file mode 100644 index 8021e0f752f1..000000000000 --- a/examples/multimodal/layoutlm/train_funsd.py +++ /dev/null @@ -1,282 +0,0 @@ -# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import logging -import os -import random - -import numpy as np -import paddle -from funsd import FunsdDataset -from seqeval.metrics import ( - classification_report, - f1_score, - precision_score, - recall_score, -) -from tqdm import tqdm, trange - -# relative reference -from utils import parse_args - -from paddlenlp.transformers import ( - LayoutLMForTokenClassification, - LayoutLMModel, - LayoutLMTokenizer, -) - -logger = logging.getLogger(__name__) - - -def get_labels(path): - with open(path, "r") as f: - labels = f.read().splitlines() - if "O" not in labels: - labels = ["O"] + labels - return labels - - -def set_seed(args): - random.seed(args.seed) - np.random.seed(args.seed) - paddle.seed(args.seed) - - -def train(args): - logging.basicConfig( - filename=os.path.join(args.output_dir, "train.log") if paddle.distributed.get_rank() == 0 else None, - format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", - datefmt="%m/%d/%Y %H:%M:%S", - level=logging.INFO if paddle.distributed.get_rank() == 0 else logging.WARN, - ) - - all_labels = get_labels(args.labels) - - pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index - - tokenizer = LayoutLMTokenizer.from_pretrained(args.model_name_or_path) - - # for training process, model is needed for the bert class - # else it can directly loaded for the downstream task - if not args.do_train: - model = LayoutLMForTokenClassification.from_pretrained(args.model_name_or_path) - else: - model = LayoutLMModel.from_pretrained(args.model_name_or_path) - model = LayoutLMForTokenClassification(model, num_classes=len(all_labels), dropout=None) - - train_dataset = FunsdDataset(args, tokenizer, all_labels, pad_token_label_id, mode="train") - train_sampler = paddle.io.DistributedBatchSampler( - train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=True - ) - - args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size()) - train_dataloader = paddle.io.DataLoader( - train_dataset, - batch_sampler=train_sampler, - collate_fn=None, - ) - - t_total = len(train_dataloader) // args.gradient_accumulation_steps * args.num_train_epochs - - # build linear decay with warmup lr sch - lr_scheduler = paddle.optimizer.lr.PolynomialDecay( - learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0 - ) - if args.warmup_steps > 0: - lr_scheduler = paddle.optimizer.lr.LinearWarmup( - lr_scheduler, - args.warmup_steps, - start_lr=0, - end_lr=args.learning_rate, - ) - - optimizer = paddle.optimizer.AdamW( - learning_rate=lr_scheduler, - parameters=model.parameters(), - epsilon=args.adam_epsilon, - weight_decay=args.weight_decay, - ) - - loss_fct = paddle.nn.loss.CrossEntropyLoss(ignore_index=pad_token_label_id) - - # Train - logger.info("***** Running training *****") - logger.info(" Num examples = %d", len(train_dataset)) - logger.info(" Num Epochs = %d", args.num_train_epochs) - logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size) - logger.info( - " Total train batch size (w. parallel, distributed & accumulation) = %d", - args.train_batch_size * paddle.distributed.get_world_size(), - ) - logger.info(" Gradient Accumulation steps = %d", args.gradient_accumulation_steps) - logger.info(" Total optimization steps = %d", t_total) - - global_step = 0 - tr_loss = 0.0 - model.clear_gradients() - train_iterator = trange(int(args.num_train_epochs), desc="Epoch", disable=args.local_rank not in [-1, 0]) - set_seed(args) - for _ in train_iterator: - epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=args.local_rank not in [-1, 0]) - for step, batch in enumerate(epoch_iterator): - model.train() - inputs = { - "input_ids": batch[0], - "attention_mask": batch[1], - "token_type_ids": batch[2], - "bbox": batch[4], - } - labels = batch[3] - logits = model(**inputs) - loss = loss_fct( - logits.reshape([-1, len(all_labels)]), - labels.reshape( - [ - -1, - ] - ), - ) - - loss = loss.mean() - logger.info("train loss: {}".format(loss.numpy())) - loss.backward() - - tr_loss += loss.item() - if (step + 1) % args.gradient_accumulation_steps == 0: - optimizer.step() - lr_scheduler.step() # Update learning rate schedule - model.clear_gradients() - global_step += 1 - - if ( - paddle.distributed.get_rank() == 0 - and args.logging_steps > 0 - and global_step % args.logging_steps == 0 - ): - # Log metrics - if ( - paddle.distributed.get_rank() == 0 and args.evaluate_during_training - ): # Only evaluate when single GPU otherwise metrics may not average well - results, _ = evaluate( - args, - model, - tokenizer, - all_labels, - loss_fct, - pad_token_label_id, - mode="test", - ) - logger.info("results: {}".format(results)) - - if args.local_rank in [-1, 0] and args.save_steps > 0 and global_step % args.save_steps == 0: - # Save model checkpoint - output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step)) - os.makedirs(output_dir, exist_ok=True) - if paddle.distributed.get_rank() == 0: - model.save_pretrained(output_dir) - tokenizer.save_pretrained(output_dir) - paddle.save(args, os.path.join(output_dir, "training_args.bin")) - logger.info("Saving model checkpoint to %s", output_dir) - - if args.max_steps > 0 and global_step > args.max_steps: - epoch_iterator.close() - break - if args.max_steps > 0 and global_step > args.max_steps: - train_iterator.close() - break - - return global_step, tr_loss / global_step - - -def evaluate(args, model, tokenizer, all_labels, loss_fct, pad_token_label_id, mode, prefix=""): - eval_dataset = FunsdDataset(args, tokenizer, all_labels, pad_token_label_id, mode=mode) - args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, paddle.distributed.get_world_size()) - eval_dataloader = paddle.io.DataLoader( - eval_dataset, - batch_size=args.eval_batch_size, - collate_fn=None, - ) - - # Eval - logger.info("***** Running evaluation %s *****", prefix) - logger.info(" Num examples = %d", len(eval_dataset)) - logger.info(" Batch size = %d", args.eval_batch_size) - eval_loss = 0.0 - nb_eval_steps = 0 - preds = None - out_label_ids = None - model.eval() - for batch in tqdm(eval_dataloader, desc="Evaluating"): - with paddle.no_grad(): - inputs = { - "input_ids": batch[0], - "attention_mask": batch[1], - "token_type_ids": batch[2], - "bbox": batch[4], - } - labels = batch[3] - logits = model(**inputs) - tmp_eval_loss = loss_fct( - logits.reshape([-1, len(all_labels)]), - labels.reshape( - [ - -1, - ] - ), - ) - tmp_eval_loss = tmp_eval_loss.mean() - eval_loss += tmp_eval_loss.item() - - nb_eval_steps += 1 - if preds is None: - preds = logits.numpy() - out_label_ids = labels.numpy() - else: - preds = np.append(preds, logits.numpy(), axis=0) - out_label_ids = np.append(out_label_ids, labels.numpy(), axis=0) - - eval_loss = eval_loss / nb_eval_steps - preds = np.argmax(preds, axis=2) - - label_map = {i: label for i, label in enumerate(all_labels)} - out_label_list = [[] for _ in range(out_label_ids.shape[0])] - preds_list = [[] for _ in range(out_label_ids.shape[0])] - - for i in range(out_label_ids.shape[0]): - for j in range(out_label_ids.shape[1]): - if out_label_ids[i, j] != pad_token_label_id: - out_label_list[i].append(label_map[out_label_ids[i][j]]) - preds_list[i].append(label_map[preds[i][j]]) - - results = { - "loss": eval_loss, - "precision": precision_score(out_label_list, preds_list), - "recall": recall_score(out_label_list, preds_list), - "f1": f1_score(out_label_list, preds_list), - } - - report = classification_report(out_label_list, preds_list) - logger.info("\n" + report) - - logger.info("***** Eval results %s *****", prefix) - for key in sorted(results.keys()): - logger.info(" %s = %s", key, str(results[key])) - - return results, preds - - -if __name__ == "__main__": - args = parse_args() - os.makedirs(args.output_dir, exist_ok=True) - train(args) diff --git a/examples/multimodal/layoutlm/train_funsd.sh b/examples/multimodal/layoutlm/train_funsd.sh deleted file mode 100644 index cfd65d6c3ba1..000000000000 --- a/examples/multimodal/layoutlm/train_funsd.sh +++ /dev/null @@ -1,17 +0,0 @@ -export CUDA_VISIBLE_DEVICES=7 - -python3.7 train_funsd.py \ - --data_dir "./data/" \ - --model_name_or_path "layoutlm-base-uncased" \ - --do_lower_case \ - --max_seq_length 512 \ - --do_train \ - --do_eval \ - --num_train_epochs 100 \ - --logging_steps 10 \ - --save_steps 500 \ - --output_dir "output/" \ - --labels "./data/labels.txt" \ - --per_gpu_train_batch_size 16 \ - --per_gpu_eval_batch_size 16 \ - --evaluate_during_training diff --git a/examples/multimodal/layoutlm/utils.py b/examples/multimodal/layoutlm/utils.py deleted file mode 100644 index 6e3c4bce7404..000000000000 --- a/examples/multimodal/layoutlm/utils.py +++ /dev/null @@ -1,188 +0,0 @@ -# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from __future__ import absolute_import, division, print_function - -import argparse - - -def parse_args(): - parser = argparse.ArgumentParser() - - # Required parameters - parser.add_argument( - "--data_dir", - default=None, - type=str, - required=True, - help="The input data dir. Should contain the training files for the CoNLL-2003 NER task.", - ) - parser.add_argument( - "--model_name_or_path", - default=None, - type=str, - required=True, - ) - parser.add_argument( - "--weights_path", - default=None, - type=str, - required=False, - ) - - parser.add_argument( - "--output_dir", - default=None, - type=str, - required=True, - help="The output directory where the model predictions and checkpoints will be written.", - ) - - # Other parameters - parser.add_argument( - "--labels", - default="", - type=str, - help="Path to a file containing all labels. If not specified, CoNLL-2003 labels are used.", - ) - parser.add_argument( - "--config_name", - default="", - type=str, - help="Pretrained config name or path if not the same as model_name", - ) - parser.add_argument( - "--tokenizer_name", - default="", - type=str, - help="Pretrained tokenizer name or path if not the same as model_name", - ) - parser.add_argument( - "--cache_dir", - default="", - type=str, - help="Where do you want to store the pre-trained models downloaded from s3", - ) - parser.add_argument( - "--max_seq_length", - default=512, - type=int, - help="The maximum total input sequence length after tokenization. Sequences longer " - "than this will be truncated, sequences shorter will be padded.", - ) - parser.add_argument("--do_train", action="store_true", help="Whether to run training.") - parser.add_argument("--do_eval", action="store_true", help="Whether to run eval on the dev set.") - parser.add_argument( - "--do_predict", - action="store_true", - help="Whether to run predictions on the test set.", - ) - parser.add_argument( - "--evaluate_during_training", - action="store_true", - help="Whether to run evaluation during training at each logging step.", - ) - parser.add_argument( - "--do_lower_case", - action="store_true", - help="Set this flag if you are using an uncased model.", - ) - - parser.add_argument( - "--per_gpu_train_batch_size", - default=8, - type=int, - help="Batch size per GPU/CPU for training.", - ) - parser.add_argument( - "--per_gpu_eval_batch_size", - default=8, - type=int, - help="Batch size per GPU/CPU for evaluation.", - ) - parser.add_argument( - "--gradient_accumulation_steps", - type=int, - default=1, - help="Number of updates steps to accumulate before performing a backward/update pass.", - ) - parser.add_argument( - "--learning_rate", - default=5e-5, - type=float, - help="The initial learning rate for Adam.", - ) - parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") - parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") - parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") - parser.add_argument( - "--num_train_epochs", - default=3, - type=int, - help="Total number of training epochs to perform.", - ) - parser.add_argument( - "--max_steps", - default=-1, - type=int, - help="If > 0: set total number of training steps to perform. Override num_train_epochs.", - ) - parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.") - - parser.add_argument("--logging_steps", type=int, default=10, help="Log every X updates steps.") - parser.add_argument( - "--save_steps", - type=int, - default=50, - help="Save checkpoint every X updates steps.", - ) - parser.add_argument( - "--eval_all_checkpoints", - action="store_true", - help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number", - ) - parser.add_argument("--no_cuda", action="store_true", help="Avoid using CUDA when available") - parser.add_argument( - "--overwrite_output_dir", - action="store_true", - help="Overwrite the content of the output directory", - ) - parser.add_argument( - "--overwrite_cache", - action="store_true", - help="Overwrite the cached training and evaluation sets", - ) - parser.add_argument("--seed", type=int, default=42, help="random seed for initialization") - parser.add_argument( - "--fp16", - action="store_true", - help="Whether to use 16-bit (mixed) precision instead of 32-bit", - ) - parser.add_argument( - "--fp16_opt_level", - type=str, - default="O1", - help="For fp16: AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." - "See details at https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/api/paddle/amp/auto_cast_cn.html", - ) - parser.add_argument( - "--local_rank", - type=int, - default=-1, - help="For distributed training: local_rank", - ) - parser.add_argument("--server_ip", type=str, default="", help="For distant debugging.") - parser.add_argument("--server_port", type=str, default="", help="For distant debugging.") - args = parser.parse_args() - return args diff --git a/examples/multimodal/layoutxlm/README.md b/examples/multimodal/layoutxlm/README.md deleted file mode 100644 index 03c0a93c6a20..000000000000 --- a/examples/multimodal/layoutxlm/README.md +++ /dev/null @@ -1,45 +0,0 @@ -# LayoutXLM - -## 模型简介 -本项目是 [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf) 在 Paddle 2.2上的开源实现, -包含了在 [XFUND数据集](https://github.com/doc-analysis/XFUND) 上的微调代码。 - -## 快速开始 -### 配置环境 -环境依赖 -- cv2 -- sentencepiece -- yacs - -安装命令: -```shell -pip install opencv-python -pip install sentencepiece -pip install yacs -``` - -### 数据准备 -处理好的XFUND中文数据集下载地址:https://bj.bcebos.com/v1/paddlenlp/datasets/XFUND.zip 。 - -下载并解压该数据集,解压后将数据集放置在当前目录下。 - -### 执行Fine-tuning -1. ``Semantic Entity Recognition`` 任务启动Fine-tuning的方式如下: - ```shell - bash run_xfun_ser.sh - - # 结果如下: - # best metrics: {'precision': 0.8514686248331108, 'recall': 0.9354602126879354, 'f1': 0.8914904770225406} - ``` - -2. ``Relation Extraction`` 任务启动Fine-tuning的方式如下: - ```shell - bash run_xfun_re.sh - - # 结果如下: - # best metrics: {'precision': 0.6788935658448587, 'recall': 0.7743484224965707, 'f1': 0.7234860621595642} - ``` - -## Reference -- [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf) -- [microsoft/unilm/layoutxlm](https://github.com/microsoft/unilm/tree/master/layoutxlm) diff --git a/examples/multimodal/layoutxlm/compare.py b/examples/multimodal/layoutxlm/compare.py deleted file mode 100644 index 120f651c177a..000000000000 --- a/examples/multimodal/layoutxlm/compare.py +++ /dev/null @@ -1,105 +0,0 @@ -# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import sys - -import numpy as np -import paddle -import torch - -sys.path.insert(0, "../../../") - - -def get_input_demo(platform="paddle", device="cpu"): - info = paddle.load("fake_input_paddle_xlm.data") - # imgs = np.random.rand(info["input_ids"].shape[0], 3, 224, 224).astype(np.float32) - # info["image"] = paddle.to_tensor(imgs) - if platform == "torch": - info = {key: torch.tensor(info[key].numpy()) for key in info} - if device == "gpu": - info = {key: info[key].cuda() for key in info} - return info - - -def test_layoutlm_paddle(): - from paddlenlp.transformers import LayoutXLMModel - - model = LayoutXLMModel.from_pretrained("layoutxlm-base-uncased") - model.eval() - - paddle.save(model.state_dict(), "v2.pdparams") - - batch_input = get_input_demo(platform="paddle", device="gpu") - with paddle.no_grad(): - outputs = model( - input_ids=batch_input["input_ids"], - bbox=batch_input["bbox"], - image=batch_input["image"], - attention_mask=batch_input["attention_mask"], - ) - sequence_output = outputs[0] - pooled_output = outputs[1] - return sequence_output, pooled_output - - -def test_layoutlm_torch(): - # import pytorch models - from layoutlmft.models.layoutxlm import LayoutXLMModel - - model = LayoutXLMModel.from_pretrained("microsoft/layoutxlm-base") - model.eval() - model = model.cuda() - - batch_input = get_input_demo(platform="torch", device="gpu") - - outputs = model( - input_ids=batch_input["input_ids"], - bbox=batch_input["bbox"], - image=batch_input["image"], - attention_mask=batch_input["attention_mask"], - ) - sequence_output = outputs[0] - pooled_output = outputs[1] - return sequence_output, pooled_output - - -def get_statistic_info(x, y): - mean_abs_diff = np.mean(np.abs(x - y)) - max_abs_diff = np.max(np.abs(x - y)) - return mean_abs_diff, max_abs_diff - - -if __name__ == "__main__": - - print("\n====test_layoutxlm_torch=====") - torch_hidden_out, torch_pool_out = test_layoutlm_torch() - torch_hidden_out = torch_hidden_out.cpu().detach().numpy() - torch_pool_out = torch_pool_out.cpu().detach().numpy() - print(torch_hidden_out.shape, torch_pool_out.shape) - - print("\n====test_layoutxlm_paddle=====") - paddle_hidden_out, paddle_pool_out = test_layoutlm_paddle() - paddle_hidden_out = paddle_hidden_out.numpy() - paddle_pool_out = paddle_pool_out.numpy() - print(paddle_hidden_out.shape, paddle_pool_out.shape) - - mean_abs_diff, max_abs_diff = get_statistic_info(torch_hidden_out, paddle_hidden_out) - print("======hidden_out diff info====") - print("\t mean_abs_diff: {}".format(mean_abs_diff)) - print("\t max_abs_diff: {}".format(max_abs_diff)) - - mean_abs_diff, max_abs_diff = get_statistic_info(torch_pool_out, paddle_pool_out) - print("======pool_out diff info====") - print("\t mean_abs_diff: {}".format(mean_abs_diff)) - print("\t max_abs_diff: {}".format(max_abs_diff)) diff --git a/examples/multimodal/layoutxlm/run_xfun_re.py b/examples/multimodal/layoutxlm/run_xfun_re.py deleted file mode 100644 index 13e31b27b99c..000000000000 --- a/examples/multimodal/layoutxlm/run_xfun_re.py +++ /dev/null @@ -1,406 +0,0 @@ -import sys -import os -import random -import numbers -import logging - -import argparse -import paddle -import numpy as np -from paddlenlp.transformers import LayoutXLMModel, LayoutXLMTokenizer, LayoutXLMForRelationExtraction -from xfun import XFUN - -# Todo: delete the following line after the release of v2.2 -sys.path.insert(0, "../../../") -logger = logging.getLogger(__name__) - - -class DataCollator: - def __call__(self, batch): - data_dict = {} - to_tensor_keys = [] - for sample in batch: - for k, v in sample.items(): - if k not in data_dict: - data_dict[k] = [] - if isinstance(v, (np.ndarray, paddle.Tensor, numbers.Number)): - if k not in to_tensor_keys: - to_tensor_keys.append(k) - data_dict[k].append(v) - for k in to_tensor_keys: - data_dict[k] = paddle.to_tensor(data_dict[k]) - return data_dict - - -def parse_args(): - parser = argparse.ArgumentParser() - # Required parameters - # yapf: disable - parser.add_argument("--model_name_or_path", default=None, type=str, required=True,) - parser.add_argument("--train_data_dir", default=None, type=str, required=False,) - parser.add_argument("--train_label_path", default=None, type=str, required=False,) - parser.add_argument("--eval_data_dir", default=None, type=str, required=False,) - parser.add_argument("--eval_label_path", default=None, type=str, required=False,) - parser.add_argument("--use_vdl", default=False, type=bool, required=False,) - parser.add_argument("--output_dir", default=None, type=str, required=True,) - parser.add_argument("--max_seq_length", default=512, type=int,) - parser.add_argument("--evaluate_during_training", action="store_true",) - parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",) - parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for eval.",) - parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.",) - parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.",) - parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.",) - parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.",) - parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.",) - parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.",) - parser.add_argument("--eval_steps", type=int, default=10, help="eval every X updates steps.",) - parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.",) - parser.add_argument("--seed", type=int, default=42, help="random seed for initialization",) - # yapf: enable - args = parser.parse_args() - return args - - -def set_seed(args): - random.seed(args.seed) - np.random.seed(args.seed) - paddle.seed(args.seed) - - -def get_label_maps(): - labels = ["O", "B-QUESTION", "B-ANSWER", "B-HEADER", "I-ANSWER", "I-QUESTION", "I-HEADER"] - label2id_map = {label: idx for idx, label in enumerate(labels)} - id2label_map = {idx: label for idx, label in enumerate(labels)} - return label2id_map, id2label_map - - -def cal_metric(re_preds, re_labels, entities): - gt_relations = [] - for b in range(len(re_labels)): - rel_sent = [] - for head, tail in zip(re_labels[b]["head"], re_labels[b]["tail"]): - rel = {} - rel["head_id"] = head - rel["head"] = (entities[b]["start"][rel["head_id"]], entities[b]["end"][rel["head_id"]]) - rel["head_type"] = entities[b]["label"][rel["head_id"]] - - rel["tail_id"] = tail - rel["tail"] = (entities[b]["start"][rel["tail_id"]], entities[b]["end"][rel["tail_id"]]) - rel["tail_type"] = entities[b]["label"][rel["tail_id"]] - - rel["type"] = 1 - rel_sent.append(rel) - gt_relations.append(rel_sent) - re_metrics = re_score(re_preds, gt_relations, mode="boundaries") - return re_metrics - - -def re_score(pred_relations, gt_relations, mode="strict"): - """Evaluate RE predictions - - Args: - pred_relations (list) : list of list of predicted relations (several relations in each sentence) - gt_relations (list) : list of list of ground truth relations - - rel = { "head": (start_idx (inclusive), end_idx (exclusive)), - "tail": (start_idx (inclusive), end_idx (exclusive)), - "head_type": ent_type, - "tail_type": ent_type, - "type": rel_type} - - vocab (Vocab) : dataset vocabulary - mode (str) : in 'strict' or 'boundaries'""" - - assert mode in ["strict", "boundaries"] - - relation_types = [v for v in [0, 1] if not v == 0] - scores = {rel: {"tp": 0, "fp": 0, "fn": 0} for rel in relation_types + ["ALL"]} - - # Count GT relations and Predicted relations - n_sents = len(gt_relations) - n_rels = sum([len([rel for rel in sent]) for sent in gt_relations]) - n_found = sum([len([rel for rel in sent]) for sent in pred_relations]) - - # Count TP, FP and FN per type - for pred_sent, gt_sent in zip(pred_relations, gt_relations): - for rel_type in relation_types: - # strict mode takes argument types into account - if mode == "strict": - pred_rels = { - (rel["head"], rel["head_type"], rel["tail"], rel["tail_type"]) - for rel in pred_sent - if rel["type"] == rel_type - } - gt_rels = { - (rel["head"], rel["head_type"], rel["tail"], rel["tail_type"]) - for rel in gt_sent - if rel["type"] == rel_type - } - - # boundaries mode only takes argument spans into account - elif mode == "boundaries": - pred_rels = {(rel["head"], rel["tail"]) for rel in pred_sent if rel["type"] == rel_type} - gt_rels = {(rel["head"], rel["tail"]) for rel in gt_sent if rel["type"] == rel_type} - - scores[rel_type]["tp"] += len(pred_rels & gt_rels) - scores[rel_type]["fp"] += len(pred_rels - gt_rels) - scores[rel_type]["fn"] += len(gt_rels - pred_rels) - - # Compute per entity Precision / Recall / F1 - for rel_type in scores.keys(): - if scores[rel_type]["tp"]: - scores[rel_type]["p"] = scores[rel_type]["tp"] / (scores[rel_type]["fp"] + scores[rel_type]["tp"]) - scores[rel_type]["r"] = scores[rel_type]["tp"] / (scores[rel_type]["fn"] + scores[rel_type]["tp"]) - else: - scores[rel_type]["p"], scores[rel_type]["r"] = 0, 0 - - if not scores[rel_type]["p"] + scores[rel_type]["r"] == 0: - scores[rel_type]["f1"] = ( - 2 * scores[rel_type]["p"] * scores[rel_type]["r"] / (scores[rel_type]["p"] + scores[rel_type]["r"]) - ) - else: - scores[rel_type]["f1"] = 0 - - # Compute micro F1 Scores - tp = sum([scores[rel_type]["tp"] for rel_type in relation_types]) - fp = sum([scores[rel_type]["fp"] for rel_type in relation_types]) - fn = sum([scores[rel_type]["fn"] for rel_type in relation_types]) - - if tp: - precision = tp / (tp + fp) - recall = tp / (tp + fn) - f1 = 2 * precision * recall / (precision + recall) - - else: - precision, recall, f1 = 0, 0, 0 - - scores["ALL"]["p"] = precision - scores["ALL"]["r"] = recall - scores["ALL"]["f1"] = f1 - scores["ALL"]["tp"] = tp - scores["ALL"]["fp"] = fp - scores["ALL"]["fn"] = fn - - # Compute Macro F1 Scores - scores["ALL"]["Macro_f1"] = np.mean([scores[ent_type]["f1"] for ent_type in relation_types]) - scores["ALL"]["Macro_p"] = np.mean([scores[ent_type]["p"] for ent_type in relation_types]) - scores["ALL"]["Macro_r"] = np.mean([scores[ent_type]["r"] for ent_type in relation_types]) - - logger.info(f"RE Evaluation in *** {mode.upper()} *** mode") - - logger.info( - "processed {} sentences with {} relations; found: {} relations; correct: {}.".format( - n_sents, n_rels, n_found, tp - ) - ) - logger.info( - "\tALL\t TP: {};\tFP: {};\tFN: {}".format(scores["ALL"]["tp"], scores["ALL"]["fp"], scores["ALL"]["fn"]) - ) - logger.info("\t\t(m avg): precision: {:.2f};\trecall: {:.2f};\tf1: {:.2f} (micro)".format(precision, recall, f1)) - logger.info( - "\t\t(M avg): precision: {:.2f};\trecall: {:.2f};\tf1: {:.2f} (Macro)\n".format( - scores["ALL"]["Macro_p"], scores["ALL"]["Macro_r"], scores["ALL"]["Macro_f1"] - ) - ) - - for rel_type in relation_types: - logger.info( - "\t{}: \tTP: {};\tFP: {};\tFN: {};\tprecision: {:.2f};\trecall: {:.2f};\tf1: {:.2f};\t{}".format( - rel_type, - scores[rel_type]["tp"], - scores[rel_type]["fp"], - scores[rel_type]["fn"], - scores[rel_type]["p"], - scores[rel_type]["r"], - scores[rel_type]["f1"], - scores[rel_type]["tp"] + scores[rel_type]["fp"], - ) - ) - - return scores - - -def evaluate(model, eval_dataloader, logger, prefix=""): - # Eval! - logger.info(f"***** Running evaluation {prefix} *****") - logger.info(f" Num examples = {len(eval_dataloader.dataset)}") - - re_preds = [] - re_labels = [] - entities = [] - eval_loss = 0.0 - model.eval() - for idx, batch in enumerate(eval_dataloader): - with paddle.no_grad(): - outputs = model(**batch) - loss = outputs["loss"].mean().item() - if paddle.distributed.get_rank() == 0: - logger.info(f"[Eval] process: {idx}/{len(eval_dataloader)}, loss: {loss:.5f}") - - eval_loss += loss - re_preds.extend(outputs["pred_relations"]) - re_labels.extend(batch["relations"]) - entities.extend(outputs["entities"]) - re_metrics = cal_metric(re_preds, re_labels, entities) - re_metrics = { - "precision": re_metrics["ALL"]["p"], - "recall": re_metrics["ALL"]["r"], - "f1": re_metrics["ALL"]["f1"], - } - model.train() - return re_metrics - - -def train(args): - os.makedirs(args.output_dir, exist_ok=True) - set_seed(args) - - label2id_map, id2label_map = get_label_maps() - pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index - - # dist mode - if paddle.distributed.get_world_size() > 1: - paddle.distributed.init_parallel_env() - - tokenizer = LayoutXLMTokenizer.from_pretrained(args.model_name_or_path) - base_model = LayoutXLMModel.from_pretrained(args.model_name_or_path) - model = LayoutXLMForRelationExtraction(base_model, dropout=None) - - # dist mode - if paddle.distributed.get_world_size() > 1: - model = paddle.DataParallel(model) - - train_dataset = XFUN( - tokenizer, - data_dir=args.train_data_dir, - label_path=args.train_label_path, - label2id_map=label2id_map, - img_size=(224, 224), - max_seq_len=args.max_seq_length, - pad_token_label_id=pad_token_label_id, - contains_re=True, - add_special_ids=False, - return_attention_mask=True, - load_mode="all", - ) - - eval_dataset = XFUN( - tokenizer, - data_dir=args.eval_data_dir, - label_path=args.eval_label_path, - label2id_map=label2id_map, - img_size=(224, 224), - max_seq_len=args.max_seq_length, - pad_token_label_id=pad_token_label_id, - contains_re=True, - add_special_ids=False, - return_attention_mask=True, - load_mode="all", - ) - - train_sampler = paddle.io.DistributedBatchSampler( - train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=True - ) - args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size()) - train_dataloader = paddle.io.DataLoader( - train_dataset, batch_sampler=train_sampler, num_workers=8, use_shared_memory=True, collate_fn=DataCollator() - ) - - eval_dataloader = paddle.io.DataLoader( - eval_dataset, batch_size=args.per_gpu_eval_batch_size, num_workers=8, shuffle=False, collate_fn=DataCollator() - ) - - t_total = len(train_dataloader) * args.num_train_epochs - - # build linear decay with warmup lr sch - lr_scheduler = paddle.optimizer.lr.PolynomialDecay( - learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0 - ) - if args.warmup_steps > 0: - lr_scheduler = paddle.optimizer.lr.LinearWarmup( - lr_scheduler, - args.warmup_steps, - start_lr=0, - end_lr=args.learning_rate, - ) - grad_clip = paddle.nn.ClipGradByNorm(clip_norm=10) - optimizer = paddle.optimizer.Adam( - learning_rate=args.learning_rate, - parameters=model.parameters(), - epsilon=args.adam_epsilon, - grad_clip=grad_clip, - weight_decay=args.weight_decay, - ) - - # Train! - logger.info("***** Running training *****") - logger.info(f" Num examples = {len(train_dataset)}") - logger.info(f" Num Epochs = {args.num_train_epochs}") - logger.info(f" Instantaneous batch size per GPU = {args.per_gpu_train_batch_size}") - logger.info( - f" Total train batch size (w. parallel, distributed & accumulation) = {args.train_batch_size * paddle.distributed.get_world_size()}" - ) - logger.info(f" Total optimization steps = {t_total}") - - global_step = 0 - train_dataloader_len = len(train_dataloader) - best_metirc = {"f1": 0} - model.train() - - for epoch in range(int(args.num_train_epochs)): - for step, batch in enumerate(train_dataloader): - outputs = model(**batch) - # model outputs are always tuple in ppnlp (see doc) - loss = outputs["loss"] - loss = loss.mean() - - logger.info( - f"epoch: [{epoch}/{args.num_train_epochs}], iter: [{step}/{train_dataloader_len}], global_step:{global_step}, train loss: {np.mean(loss.numpy())}, lr: {optimizer.get_lr()}" - ) - - loss.backward() - optimizer.step() - optimizer.clear_grad() - # lr_scheduler.step() # Update learning rate schedule - - global_step += 1 - - if paddle.distributed.get_rank() == 0 and args.eval_steps > 0 and global_step % args.eval_steps == 0: - # Log metrics - if paddle.distributed.get_rank() == 0 and args.evaluate_during_training: - results = evaluate(model, eval_dataloader, logger) - if results["f1"] > best_metirc["f1"]: - best_metirc = results - output_dir = os.path.join(args.output_dir, "checkpoint-best") - os.makedirs(output_dir, exist_ok=True) - model.save_pretrained(output_dir) - tokenizer.save_pretrained(output_dir) - paddle.save(args, os.path.join(output_dir, "training_args.bin")) - logger.info(f"Saving model checkpoint to {output_dir}") - logger.info(f"eval results: {results}") - logger.info(f"best_metirc: {best_metirc}") - - if paddle.distributed.get_rank() == 0 and args.save_steps > 0 and global_step % args.save_steps == 0: - # Save model checkpoint - output_dir = os.path.join(args.output_dir, "checkpoint-latest") - os.makedirs(output_dir, exist_ok=True) - if paddle.distributed.get_rank() == 0: - model.save_pretrained(output_dir) - tokenizer.save_pretrained(output_dir) - paddle.save(args, os.path.join(output_dir, "training_args.bin")) - logger.info(f"Saving model checkpoint to {output_dir}") - logger.info(f"best_metirc: {best_metirc}") - - -def print_arguments(args): - """print arguments""" - print("----------- Configuration Arguments -----------") - for arg, value in sorted(vars(args).items()): - print("%s: %s" % (arg, value)) - print("------------------------------------------------") - - -if __name__ == "__main__": - args = parse_args() - print_arguments(args) - train(args) diff --git a/examples/multimodal/layoutxlm/run_xfun_re.sh b/examples/multimodal/layoutxlm/run_xfun_re.sh deleted file mode 100644 index 4aeea52f5dc9..000000000000 --- a/examples/multimodal/layoutxlm/run_xfun_re.sh +++ /dev/null @@ -1,19 +0,0 @@ -export CUDA_VISIBLE_DEVICES=0 - -python ./run_xfun_re.py \ - --model_name_or_path "layoutxlm-base-uncased" \ - --max_seq_length 512 \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --num_train_epochs 200 \ - --eval_steps 50 \ - --save_steps 500 \ - --output_dir "./output/re/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --per_gpu_train_batch_size 8 \ - --per_gpu_eval_batch_size 8 \ - --evaluate_during_training \ - --seed 2048 diff --git a/examples/multimodal/layoutxlm/run_xfun_ser.py b/examples/multimodal/layoutxlm/run_xfun_ser.py deleted file mode 100644 index 36b0b988822d..000000000000 --- a/examples/multimodal/layoutxlm/run_xfun_ser.py +++ /dev/null @@ -1,353 +0,0 @@ -# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import copy -import logging -import os -import random -import sys - -import numpy as np -import paddle -from seqeval.metrics import ( - classification_report, - f1_score, - precision_score, - recall_score, -) -from xfun import XFUN - -from paddlenlp.transformers import ( - LayoutXLMForTokenClassification, - LayoutXLMModel, - LayoutXLMTokenizer, -) - -# Todo: delete the following line after the release of v2.2 -sys.path.insert(0, "../../../") -logger = logging.getLogger(__name__) - - -def parse_args(): - parser = argparse.ArgumentParser() - # Required parameters - # yapf: disable - parser.add_argument("--model_name_or_path", default=None, type=str, required=True,) - parser.add_argument("--train_data_dir", default=None, type=str, required=False,) - parser.add_argument("--train_label_path", default=None, type=str, required=False,) - parser.add_argument("--eval_data_dir", default=None, type=str, required=False,) - parser.add_argument("--eval_label_path", default=None, type=str, required=False,) - parser.add_argument("--use_vdl", default=False, type=bool, required=False,) - parser.add_argument("--output_dir", default=None, type=str, required=True,) - parser.add_argument("--max_seq_length", default=512, type=int,) - parser.add_argument("--evaluate_during_training", action="store_true",) - parser.add_argument("--per_gpu_train_batch_size", default=8, type=int, help="Batch size per GPU/CPU for training.",) - parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for eval.",) - parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.",) - parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.",) - parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.",) - parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.",) - parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.",) - parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.",) - parser.add_argument("--eval_steps", type=int, default=10, help="eval every X updates steps.",) - parser.add_argument("--save_steps", type=int, default=50, help="Save checkpoint every X updates steps.",) - parser.add_argument("--seed", type=int, default=42, help="random seed for initialization",) - # yapf: enable - args = parser.parse_args() - return args - - -def set_seed(args): - random.seed(args.seed) - np.random.seed(args.seed) - paddle.seed(args.seed) - - -def get_label_maps(): - labels = ["O", "B-QUESTION", "B-ANSWER", "B-HEADER", "I-ANSWER", "I-QUESTION", "I-HEADER"] - label2id_map = {label: idx for idx, label in enumerate(labels)} - id2label_map = {idx: label for idx, label in enumerate(labels)} - return label2id_map, id2label_map - - -def train(args): - os.makedirs(args.output_dir, exist_ok=True) - logging.basicConfig( - filename=os.path.join(args.output_dir, "train.log") if paddle.distributed.get_rank() == 0 else None, - format="%(asctime)s - %(levelname)s - %(name)s - %(message)s", - datefmt="%m/%d/%Y %H:%M:%S", - level=logging.INFO if paddle.distributed.get_rank() == 0 else logging.WARN, - ) - - ch = logging.StreamHandler() - ch.setLevel(logging.DEBUG) - logger.addHandler(ch) - - label2id_map, id2label_map = get_label_maps() - pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index - - # dist mode - if paddle.distributed.get_world_size() > 1: - paddle.distributed.init_parallel_env() - - tokenizer = LayoutXLMTokenizer.from_pretrained(args.model_name_or_path) - base_model = LayoutXLMModel.from_pretrained(args.model_name_or_path) - model = LayoutXLMForTokenClassification(base_model, num_classes=len(label2id_map), dropout=None) - - # dist mode - if paddle.distributed.get_world_size() > 1: - model = paddle.DataParallel(model) - - train_dataset = XFUN( - tokenizer, - data_dir=args.train_data_dir, - label_path=args.train_label_path, - label2id_map=label2id_map, - img_size=(224, 224), - pad_token_label_id=pad_token_label_id, - contains_re=False, - add_special_ids=False, - return_attention_mask=True, - load_mode="all", - ) - - train_sampler = paddle.io.DistributedBatchSampler( - train_dataset, batch_size=args.per_gpu_train_batch_size, shuffle=True - ) - - args.train_batch_size = args.per_gpu_train_batch_size * max(1, paddle.distributed.get_world_size()) - - train_dataloader = paddle.io.DataLoader( - train_dataset, - batch_sampler=train_sampler, - num_workers=0, - use_shared_memory=True, - collate_fn=None, - ) - - t_total = len(train_dataloader) * args.num_train_epochs - - # build linear decay with warmup lr sch - lr_scheduler = paddle.optimizer.lr.PolynomialDecay( - learning_rate=args.learning_rate, decay_steps=t_total, end_lr=0.0, power=1.0 - ) - if args.warmup_steps > 0: - lr_scheduler = paddle.optimizer.lr.LinearWarmup( - lr_scheduler, - args.warmup_steps, - start_lr=0, - end_lr=args.learning_rate, - ) - - optimizer = paddle.optimizer.AdamW( - learning_rate=lr_scheduler, - parameters=model.parameters(), - epsilon=args.adam_epsilon, - weight_decay=args.weight_decay, - ) - - # Train! - logger.info("***** Running training *****") - logger.info(" Num examples = %d", len(train_dataset)) - logger.info(" Num Epochs = %d", args.num_train_epochs) - logger.info(" Instantaneous batch size per GPU = %d", args.per_gpu_train_batch_size) - logger.info( - " Total train batch size (w. parallel, distributed) = %d", - args.train_batch_size * paddle.distributed.get_world_size(), - ) - logger.info(" Total optimization steps = %d", t_total) - - global_step = 0 - tr_loss = 0.0 - set_seed(args) - best_metrics = None - - for epoch_id in range(args.num_train_epochs): - for step, batch in enumerate(train_dataloader): - model.train() - outputs = model(**batch) - # model outputs are always tuple in ppnlp (see doc) - loss = outputs[0] - loss = loss.mean() - logger.info( - "[epoch {}/{}][iter: {}/{}] lr: {:.5f}, train loss: {:.5f}, ".format( - epoch_id, - args.num_train_epochs, - step, - len(train_dataloader), - lr_scheduler.get_lr(), - float(loss), - ) - ) - - loss.backward() - tr_loss += loss.item() - optimizer.step() - lr_scheduler.step() # Update learning rate schedule - optimizer.clear_grad() - global_step += 1 - - if paddle.distributed.get_rank() == 0 and args.eval_steps > 0 and global_step % args.eval_steps == 0: - # Log metrics - # Only evaluate when single GPU otherwise metrics may not average well - if paddle.distributed.get_rank() == 0 and args.evaluate_during_training: - results, _ = evaluate( - args, - model, - tokenizer, - label2id_map, - id2label_map, - pad_token_label_id, - ) - - if best_metrics is None or results["f1"] >= best_metrics["f1"]: - best_metrics = copy.deepcopy(results) - output_dir = os.path.join(args.output_dir, "best_model") - os.makedirs(output_dir, exist_ok=True) - if paddle.distributed.get_rank() == 0: - model.save_pretrained(output_dir) - tokenizer.save_pretrained(output_dir) - paddle.save(args, os.path.join(output_dir, "training_args.bin")) - logger.info("Saving model checkpoint to %s", output_dir) - - logger.info( - "[epoch {}/{}][iter: {}/{}] results: {}".format( - epoch_id, args.num_train_epochs, step, len(train_dataloader), results - ) - ) - if best_metrics is not None: - logger.info("best metrics: {}".format(best_metrics)) - - if paddle.distributed.get_rank() == 0 and args.save_steps > 0 and global_step % args.save_steps == 0: - # Save model checkpoint - output_dir = os.path.join(args.output_dir, "checkpoint-{}".format(global_step)) - os.makedirs(output_dir, exist_ok=True) - if paddle.distributed.get_rank() == 0: - model.save_pretrained(output_dir) - tokenizer.save_pretrained(output_dir) - paddle.save(args, os.path.join(output_dir, "training_args.bin")) - logger.info("Saving model checkpoint to %s", output_dir) - - return global_step, tr_loss / global_step - - -def evaluate(args, model, tokenizer, label2id_map, id2label_map, pad_token_label_id, prefix=""): - eval_dataset = XFUN( - tokenizer, - data_dir=args.eval_data_dir, - label_path=args.eval_label_path, - label2id_map=label2id_map, - img_size=(224, 224), - pad_token_label_id=pad_token_label_id, - contains_re=False, - add_special_ids=False, - return_attention_mask=True, - load_mode="all", - ) - - args.eval_batch_size = args.per_gpu_eval_batch_size * max(1, paddle.distributed.get_world_size()) - - eval_dataloader = paddle.io.DataLoader( - eval_dataset, - batch_size=args.eval_batch_size, - num_workers=0, - use_shared_memory=True, - collate_fn=None, - ) - - # Eval! - logger.info("***** Running evaluation %s *****", prefix) - logger.info(" Num examples = %d", len(eval_dataset)) - logger.info(" Batch size = %d", args.eval_batch_size) - eval_loss = 0.0 - nb_eval_steps = 0 - preds = None - out_label_ids = None - model.eval() - for idx, batch in enumerate(eval_dataloader): - with paddle.no_grad(): - outputs = model(**batch) - tmp_eval_loss, logits = outputs[:2] - - tmp_eval_loss = tmp_eval_loss.mean() - - if paddle.distributed.get_rank() == 0: - logger.info( - "[Eval]process: {}/{}, loss: {:.5f}".format(idx, len(eval_dataloader), float(tmp_eval_loss)) - ) - - eval_loss += tmp_eval_loss.item() - nb_eval_steps += 1 - if preds is None: - preds = logits.numpy() - out_label_ids = batch["labels"].numpy() - else: - preds = np.append(preds, logits.numpy(), axis=0) - out_label_ids = np.append(out_label_ids, batch["labels"].numpy(), axis=0) - - eval_loss = eval_loss / nb_eval_steps - preds = np.argmax(preds, axis=2) - - # label_map = {i: label.upper() for i, label in enumerate(labels)} - - out_label_list = [[] for _ in range(out_label_ids.shape[0])] - preds_list = [[] for _ in range(out_label_ids.shape[0])] - - for i in range(out_label_ids.shape[0]): - for j in range(out_label_ids.shape[1]): - if out_label_ids[i, j] != pad_token_label_id: - out_label_list[i].append(id2label_map[out_label_ids[i][j]]) - preds_list[i].append(id2label_map[preds[i][j]]) - - results = { - "loss": eval_loss, - "precision": precision_score(out_label_list, preds_list), - "recall": recall_score(out_label_list, preds_list), - "f1": f1_score(out_label_list, preds_list), - } - - with open(os.path.join(args.output_dir, "test_gt.txt"), "w") as fout: - for lbl in out_label_list: - for l in lbl: - fout.write(l + "\t") - fout.write("\n") - with open(os.path.join(args.output_dir, "test_pred.txt"), "w") as fout: - for lbl in preds_list: - for l in lbl: - fout.write(l + "\t") - fout.write("\n") - - report = classification_report(out_label_list, preds_list) - logger.info("\n" + report) - - logger.info("***** Eval results %s *****", prefix) - for key in sorted(results.keys()): - logger.info(" %s = %s", key, str(results[key])) - - return results, preds_list - - -def print_arguments(args): - """print arguments""" - print("----------- Configuration Arguments -----------") - for arg, value in sorted(vars(args).items()): - print("%s: %s" % (arg, value)) - print("------------------------------------------------") - - -if __name__ == "__main__": - args = parse_args() - print_arguments(args) - train(args) diff --git a/examples/multimodal/layoutxlm/run_xfun_ser.sh b/examples/multimodal/layoutxlm/run_xfun_ser.sh deleted file mode 100644 index 43454abfc264..000000000000 --- a/examples/multimodal/layoutxlm/run_xfun_ser.sh +++ /dev/null @@ -1,19 +0,0 @@ -export CUDA_VISIBLE_DEVICES=0 - -python ./run_xfun_ser.py \ - --model_name_or_path "layoutxlm-base-uncased" \ - --max_seq_length 512 \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --save_steps 500 \ - --output_dir "./output/ser/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --per_gpu_train_batch_size 8 \ - --per_gpu_eval_batch_size 8 \ - --evaluate_during_training \ - --seed 2048 diff --git a/examples/multimodal/layoutxlm/xfun.py b/examples/multimodal/layoutxlm/xfun.py deleted file mode 100644 index 3bb5be92e913..000000000000 --- a/examples/multimodal/layoutxlm/xfun.py +++ /dev/null @@ -1,410 +0,0 @@ -import json -import os -import cv2 -import numpy as np -import paddle -import copy -from paddle.io import Dataset - -__all__ = ["XFUN"] - - -class XFUN(Dataset): - """ - Example: - print("=====begin to build dataset=====") - from paddlenlp.transformers import LayoutXLMTokenizer - tokenizer = LayoutXLMTokenizer.from_pretrained("/paddle/models/transformers/layoutxlm-base-paddle/") - tok_res = tokenizer.tokenize("Maribyrnong") - # res = tokenizer.convert_ids_to_tokens(val_data["input_ids"][0]) - dataset = XfunDatasetForSer( - tokenizer, - data_dir="./zh.val/", - label_path="zh.val/xfun_normalize_val.json", - img_size=(224,224)) - print(len(dataset)) - - data = dataset[0] - print(data.keys()) - print("input_ids: ", data["input_ids"]) - print("labels: ", data["labels"]) - print("token_type_ids: ", data["token_type_ids"]) - print("words_list: ", data["words_list"]) - print("image shape: ", data["image"].shape) - """ - - def __init__( - self, - tokenizer, - data_dir, - label_path, - contains_re=False, - label2id_map=None, - img_size=(224, 224), - pad_token_label_id=None, - add_special_ids=False, - return_attention_mask=True, - load_mode="all", - max_seq_len=512, - ): - super(XFUN, self).__init__() - self.tokenizer = tokenizer - self.data_dir = data_dir - self.label_path = label_path - self.contains_re = contains_re - self.label2id_map = label2id_map - self.img_size = img_size - self.pad_token_label_id = pad_token_label_id - self.add_special_ids = add_special_ids - self.return_attention_mask = return_attention_mask - self.load_mode = load_mode - self.max_seq_len = max_seq_len - - if self.pad_token_label_id is None: - self.pad_token_label_id = paddle.nn.CrossEntropyLoss().ignore_index - - self.all_lines = self.read_all_lines() - - self.entities_labels = {"HEADER": 0, "QUESTION": 1, "ANSWER": 2} - self.return_keys = { - "bbox": "np", - "input_ids": "np", - "labels": "np", - "attention_mask": "np", - "image": "np", - "token_type_ids": "np", - "entities": "dict", - "relations": "dict", - } - - if load_mode == "all": - self.encoded_inputs_all = self._parse_label_file_all() - - def pad_sentences( - self, - encoded_inputs, - max_seq_len=512, - pad_to_max_seq_len=True, - return_attention_mask=True, - return_token_type_ids=True, - truncation_strategy="longest_first", - return_overflowing_tokens=False, - return_special_tokens_mask=False, - ): - # Padding - needs_to_be_padded = pad_to_max_seq_len and max_seq_len and len(encoded_inputs["input_ids"]) < max_seq_len - - if needs_to_be_padded: - difference = max_seq_len - len(encoded_inputs["input_ids"]) - if self.tokenizer.padding_side == "right": - if return_attention_mask: - encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) + [0] * difference - if return_token_type_ids: - encoded_inputs["token_type_ids"] = ( - encoded_inputs["token_type_ids"] + [self.tokenizer.pad_token_type_id] * difference - ) - if return_special_tokens_mask: - encoded_inputs["special_tokens_mask"] = encoded_inputs["special_tokens_mask"] + [1] * difference - encoded_inputs["input_ids"] = encoded_inputs["input_ids"] + [self.tokenizer.pad_token_id] * difference - encoded_inputs["labels"] = encoded_inputs["labels"] + [self.pad_token_label_id] * difference - encoded_inputs["bbox"] = encoded_inputs["bbox"] + [[0, 0, 0, 0]] * difference - elif self.tokenizer.padding_side == "left": - if return_attention_mask: - encoded_inputs["attention_mask"] = [0] * difference + [1] * len(encoded_inputs["input_ids"]) - if return_token_type_ids: - encoded_inputs["token_type_ids"] = [ - self.tokenizer.pad_token_type_id - ] * difference + encoded_inputs["token_type_ids"] - if return_special_tokens_mask: - encoded_inputs["special_tokens_mask"] = [1] * difference + encoded_inputs["special_tokens_mask"] - encoded_inputs["input_ids"] = [self.tokenizer.pad_token_id] * difference + encoded_inputs["input_ids"] - encoded_inputs["labels"] = [self.pad_token_label_id] * difference + encoded_inputs["labels"] - encoded_inputs["bbox"] = [[0, 0, 0, 0]] * difference + encoded_inputs["bbox"] - else: - if return_attention_mask: - encoded_inputs["attention_mask"] = [1] * len(encoded_inputs["input_ids"]) - - return encoded_inputs - - def truncate_inputs(self, encoded_inputs, max_seq_len=512): - for key in encoded_inputs: - if key == "sample_id": - continue - length = min(len(encoded_inputs[key]), max_seq_len) - encoded_inputs[key] = encoded_inputs[key][:length] - return encoded_inputs - - def read_all_lines( - self, - ): - with open(self.label_path, "r") as fin: - lines = fin.readlines() - return lines - - def _parse_label_file_all(self): - """ - parse all samples - """ - encoded_inputs_all = [] - for line in self.all_lines: - encoded_inputs_all.extend(self._parse_label_file(line)) - return encoded_inputs_all - - def _parse_label_file(self, line): - """ - parse single sample - """ - - image_name, info_str = line.split("\t") - image_path = os.path.join(self.data_dir, image_name) - - def add_imgge_path(x): - x["image_path"] = image_path - return x - - encoded_inputs = self._read_encoded_inputs_sample(info_str) - if self.contains_re: - encoded_inputs = self._chunk_re(encoded_inputs) - else: - encoded_inputs = self._chunk_ser(encoded_inputs) - encoded_inputs = list(map(add_imgge_path, encoded_inputs)) - return encoded_inputs - - def _read_encoded_inputs_sample(self, info_str): - """ - parse label info - """ - # read text info - info_dict = json.loads(info_str) - height = info_dict["height"] - width = info_dict["width"] - - words_list = [] - bbox_list = [] - input_ids_list = [] - token_type_ids_list = [] - gt_label_list = [] - - if self.contains_re: - # for re - entities = [] - relations = [] - id2label = {} - entity_id_to_index_map = {} - empty_entity = set() - for info in info_dict["ocr_info"]: - if self.contains_re: - # for re - if len(info["text"]) == 0: - empty_entity.add(info["id"]) - continue - id2label[info["id"]] = info["label"] - relations.extend([tuple(sorted(l)) for l in info["linking"]]) - - # x1, y1, x2, y2 - bbox = info["bbox"] - label = info["label"] - bbox[0] = int(bbox[0] * 1000.0 / width) - bbox[2] = int(bbox[2] * 1000.0 / width) - bbox[1] = int(bbox[1] * 1000.0 / height) - bbox[3] = int(bbox[3] * 1000.0 / height) - - text = info["text"] - encode_res = self.tokenizer.encode( - text, pad_to_max_seq_len=False, return_token_type_ids=True, return_attention_mask=True - ) - - gt_label = [] - if not self.add_special_ids: - # TODO: use tok.all_special_ids to remove - encode_res["input_ids"] = encode_res["input_ids"][1:-1] - encode_res["token_type_ids"] = encode_res["token_type_ids"][1:-1] - encode_res["attention_mask"] = encode_res["attention_mask"][1:-1] - if label.lower() == "other": - gt_label.extend([0] * len(encode_res["input_ids"])) - else: - gt_label.append(self.label2id_map[("b-" + label).upper()]) - gt_label.extend([self.label2id_map[("i-" + label).upper()]] * (len(encode_res["input_ids"]) - 1)) - if self.contains_re: - if gt_label[0] != self.label2id_map["O"]: - entity_id_to_index_map[info["id"]] = len(entities) - entities.append( - { - "start": len(input_ids_list), - "end": len(input_ids_list) + len(encode_res["input_ids"]), - "label": label.upper(), - } - ) - input_ids_list.extend(encode_res["input_ids"]) - token_type_ids_list.extend(encode_res["token_type_ids"]) - bbox_list.extend([bbox] * len(encode_res["input_ids"])) - gt_label_list.extend(gt_label) - words_list.append(text) - - encoded_inputs = { - "input_ids": input_ids_list, - "labels": gt_label_list, - "token_type_ids": token_type_ids_list, - "bbox": bbox_list, - "attention_mask": [1] * len(input_ids_list), - # "words_list": words_list, - } - encoded_inputs = self.pad_sentences( - encoded_inputs, max_seq_len=self.max_seq_len, return_attention_mask=self.return_attention_mask - ) - encoded_inputs = self.truncate_inputs(encoded_inputs) - - if self.contains_re: - relations = self._relations(entities, relations, id2label, empty_entity, entity_id_to_index_map) - encoded_inputs["relations"] = relations - encoded_inputs["entities"] = entities - return encoded_inputs - - def _chunk_ser(self, encoded_inputs): - encoded_inputs_all = [] - seq_len = len(encoded_inputs["input_ids"]) - chunk_size = 512 - for chunk_id, index in enumerate(range(0, seq_len, chunk_size)): - chunk_beg = index - chunk_end = min(index + chunk_size, seq_len) - encoded_inputs_example = {} - for key in encoded_inputs: - encoded_inputs_example[key] = encoded_inputs[key][chunk_beg:chunk_end] - - encoded_inputs_all.append(encoded_inputs_example) - return encoded_inputs_all - - def _chunk_re(self, encoded_inputs): - # prepare data - entities = encoded_inputs.pop("entities") - relations = encoded_inputs.pop("relations") - encoded_inputs_all = [] - chunk_size = 512 - for chunk_id, index in enumerate(range(0, len(encoded_inputs["input_ids"]), chunk_size)): - item = {} - for k in encoded_inputs: - item[k] = encoded_inputs[k][index : index + chunk_size] - - # select entity in current chunk - entities_in_this_span = [] - global_to_local_map = {} # - for entity_id, entity in enumerate(entities): - if index <= entity["start"] < index + chunk_size and index <= entity["end"] < index + chunk_size: - entity["start"] = entity["start"] - index - entity["end"] = entity["end"] - index - global_to_local_map[entity_id] = len(entities_in_this_span) - entities_in_this_span.append(entity) - - # select relations in current chunk - relations_in_this_span = [] - for relation in relations: - if ( - index <= relation["start_index"] < index + chunk_size - and index <= relation["end_index"] < index + chunk_size - ): - relations_in_this_span.append( - { - "head": global_to_local_map[relation["head"]], - "tail": global_to_local_map[relation["tail"]], - "start_index": relation["start_index"] - index, - "end_index": relation["end_index"] - index, - } - ) - item.update( - { - "entities": reformat(entities_in_this_span), - "relations": reformat(relations_in_this_span), - } - ) - item["entities"]["label"] = [self.entities_labels[x] for x in item["entities"]["label"]] - encoded_inputs_all.append(item) - return encoded_inputs_all - - def _relations(self, entities, relations, id2label, empty_entity, entity_id_to_index_map): - """ - build relations - """ - relations = list(set(relations)) - relations = [rel for rel in relations if rel[0] not in empty_entity and rel[1] not in empty_entity] - kv_relations = [] - for rel in relations: - pair = [id2label[rel[0]], id2label[rel[1]]] - if pair == ["question", "answer"]: - kv_relations.append({"head": entity_id_to_index_map[rel[0]], "tail": entity_id_to_index_map[rel[1]]}) - elif pair == ["answer", "question"]: - kv_relations.append({"head": entity_id_to_index_map[rel[1]], "tail": entity_id_to_index_map[rel[0]]}) - else: - continue - relations = sorted( - [ - { - "head": rel["head"], - "tail": rel["tail"], - "start_index": get_relation_span(rel, entities)[0], - "end_index": get_relation_span(rel, entities)[1], - } - for rel in kv_relations - ], - key=lambda x: x["head"], - ) - return relations - - def load_img(self, image_path): - # read img - img = cv2.imread(image_path) - img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) - resize_h, resize_w = self.img_size - im_shape = img.shape[0:2] - im_scale_y = resize_h / im_shape[0] - im_scale_x = resize_w / im_shape[1] - img_new = cv2.resize(img, None, None, fx=im_scale_x, fy=im_scale_y, interpolation=2) - mean = np.array([0.485, 0.456, 0.406])[np.newaxis, np.newaxis, :] - std = np.array([0.229, 0.224, 0.225])[np.newaxis, np.newaxis, :] - img_new = img_new / 255.0 - img_new -= mean - img_new /= std - img = img_new.transpose((2, 0, 1)) - return img - - def __getitem__(self, idx): - if self.load_mode == "all": - data = copy.deepcopy(self.encoded_inputs_all[idx]) - else: - data = self._parse_label_file(self.all_lines[idx])[0] - - image_path = data.pop("image_path") - data["image"] = self.load_img(image_path) - - return_data = {} - for k, v in data.items(): - if k in self.return_keys: - if self.return_keys[k] == "np": - v = np.array(v) - return_data[k] = v - return return_data - - def __len__( - self, - ): - if self.load_mode == "all": - return len(self.encoded_inputs_all) - else: - return len(self.all_lines) - - -def get_relation_span(rel, entities): - bound = [] - for entity_index in [rel["head"], rel["tail"]]: - bound.append(entities[entity_index]["start"]) - bound.append(entities[entity_index]["end"]) - return min(bound), max(bound) - - -def reformat(data): - new_data = {} - for item in data: - for k, v in item.items(): - if k not in new_data: - new_data[k] = [] - new_data[k].append(v) - return new_data diff --git a/examples/multimodal/minigpt4/README.md b/examples/multimodal/minigpt4/README.md deleted file mode 100644 index 48c9f7384076..000000000000 --- a/examples/multimodal/minigpt4/README.md +++ /dev/null @@ -1,47 +0,0 @@ -# MiniGPT4 - -## 1. 模型简介 - -MiniGPT4 是一个具有图像理解能力的开源模型,其基于 Vicuna 大语言模型 以及 BLIP-2 中的VIT和Qformer模块进行训练,使得MiniGPT4 拥有类似于GPT4的非凡能力,例如详细的图像描述生成和从手写草稿创建网站。 此外 MiniGPT4 还具备一些的其他新的功能,包括根据给定图像写故事和诗歌,为图像中显示的问题提供解决方案,教用户如何根据食物照片做饭等。下图展示了MiniGPT4的模型结构, 更多信息请参考[MiniGPT4](https://arxiv.org/abs/2304.10592)。 - -
- - -## 2. 获取MiniGPT4 权重以及相关配置 -这里可以分两步:1. 获取MiniGPT4权重;2. 获取相关配置,包括模型参数说明以及tokenizer相关文件等。 -### 2.1 获取MiniGPT4权重 -目前需要用户手动下载MiniGPT4权重和并转换为相应的 Paddle 版权重,为方便转换,本项目提供了相应的操作说明和转换脚本,详情请参考[MiniGPT4 权重下载和转换说明](./paddle_minigpt4_instrction.md)。 - -### 2.2 获取相关配置 -下载相关的配置文件,这里提供了两版配置文件,请根据你的需要,点击下载即可。 -| files Aligned with MiniGPT4-7B | files Aligned with MiniGPT4-13B | -:-------------------------------------:|:-----------------------------------: - [Download](https://paddlenlp.bj.bcebos.com/models/community/minigpt4-7b/minigpt4_7b.tar.gz)|[Download](https://paddlenlp.bj.bcebos.com/models/community/minigpt4-13b/minigpt4_13b.tar.gz) | - - -下载之后进行解压,请将其中相关文件放至 与 MiniGPT4 权重相同的目录中。 - - -## 3. 模型预测 -在下载和转换好上述模型权重之后,可执行以下命令进行模型预测。其中参数 `pretrained_name_or_path` 用于指定 MiniGPT4 的保存目录。 - -``` -python run_predict.py \ - -- pretrained_name_or_path "your minigpt4 path" - -``` - -下图这个示例展示了在使用MiniGPT-7b时的效果: - -输入图片:
- -输入文本:“describe this image” - -输出: -``` -The image shows two mugs with cats on them, one is black and white and the other is blue and white. The mugs are sitting on a table with a book in the background. The mugs have a whimsical, cartoon-like appearance. The cats on the mugs are looking at each other with a playful expression. The overall mood of the image is lighthearted and fun.### -``` - - -## Reference -- [MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models](https://minigpt-4.github.io/) diff --git a/examples/multimodal/minigpt4/merge_weight.py b/examples/multimodal/minigpt4/merge_weight.py deleted file mode 100644 index 8f74d7c6a960..000000000000 --- a/examples/multimodal/minigpt4/merge_weight.py +++ /dev/null @@ -1,88 +0,0 @@ -# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os - -os.environ["CUDA_VISIBLE_DEVICES"] = "0" -os.environ["FLAGS_use_cuda_managed_memory"] = "true" - -import paddle -import torch - -from paddlenlp.transformers import LlamaForCausalLM - - -def merge(args): - model_dict = {} - # load the first item: blip2-flan-t5-xxl - state_dict = paddle.load(args.blip2_path) - for n, p in state_dict.items(): - if n.startswith("vision_model") or n.startswith("qformer") or n == "query_tokens": - model_dict[n] = p - print("[1/3] load ViT, qformer and query_tokens from blip2-flan-t5-xxl done!") - - # load the second item: vicuna - llama_model = LlamaForCausalLM.from_pretrained(args.vicuna_path) - - for n, p in llama_model.named_parameters(): - new_name = "language_model." + n - model_dict[new_name] = p - print("[2/3] load vicuna(llama typel) done!") - - # load the third item: minigpt4 - minigpt4_state_dict = torch.load(args.minigpt4_path) - for n, p in minigpt4_state_dict["model"].items(): - if n.startswith("llama_model.model"): - new_name = n.replace("llama_model.model", "language_model.llama") - new_p = paddle.to_tensor(p.cpu().numpy()) - model_dict[new_name] = new_p - - if n.startswith("llama_proj"): - new_name = n.replace("llama_proj", "language_projection") - if n.endswith("weight"): - new_p = paddle.to_tensor(p.cpu().numpy()).transpose([1, 0]) - else: - new_p = paddle.to_tensor(p.cpu().numpy()) - model_dict[new_name] = new_p - - print("[3/3] load language_projection, some llama weights from minigpt4 done!") - - save_path = os.path.join(args.save_path, "model_state.pdparams") - paddle.save(model_dict, save_path) - print("The checkpoint of minigpt4 has been saved to :{}".format(save_path)) - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - - parser.add_argument("--blip2_path", default="/blip2/dirname", type=str, help="The dir name of blip2-flan-t5-xxl.") - parser.add_argument("--vicuna_path", default="/vicuna/dirname", type=str, help="The dir name of vicuna.") - parser.add_argument( - "--minigpt4_path", default="/minigpt4/prerained_minigpt4.pth", type=str, help="The checkpoint path of vicuna." - ) - parser.add_argument("--save_path", default="/save/to/dirname", type=str, help="The saving path of minigpt4.") - args = parser.parse_args() - - args.blip2_path = os.path.join(args.blip2_path, "model_state.pdparams") - if not os.path.exists(args.blip2_path): - raise ValueError("Not found the file: {}".format(args.blip2_path)) - if not os.path.isdir(args.vicuna_path): - raise ValueError("It is not a directory: {}".format(args.vicuna_path)) - if not os.path.exists(args.minigpt4_path): - raise ValueError("Not found the file: {}".format(args.minigpt4_path)) - if not os.path.exists(args.save_path): - os.makedirs(args.save_path) - - merge(args) diff --git a/examples/multimodal/minigpt4/paddle_minigpt4_instrction.md b/examples/multimodal/minigpt4/paddle_minigpt4_instrction.md deleted file mode 100644 index 7b84aea48bd7..000000000000 --- a/examples/multimodal/minigpt4/paddle_minigpt4_instrction.md +++ /dev/null @@ -1,117 +0,0 @@ -# 获取和转换 Paddle 版 MiniGPT4 权重 - -## 1. 准备 MiniGPT4 中所有模块的权重 - -你需要下载3个权重,以获取最终 MiniGPT4的权重,分别是: -- Pretrained MiniGPT-4 -- Vicuna Weight -- Blip2 Weight - -### 1.1 下载 MiniGPT4 的预训练权重 - -根据你准备的Vicuna模型版本,下载预训练的MiniGPT4 权重。 - -| Checkpoint Aligned with Vicuna 7B | Checkpoint Aligned with Vicuna 13B | -:-------------------------------------:|:-----------------------------------: -[Download](https://drive.google.com/file/d/1RY9jV0dyqLX-o38LrumkKRh6Jtaop58R/view?usp=sharing) | [Download](https://drive.google.com/file/d/1a4zLvaiDBr-36pasffmgpvH5P7CKmpze/view?usp=share_link) - -### 1.2准备 ViT and Qformer 权重 -MiniGPT4中使用的ViT和Qformer Weight来自blip2-flan-t5-xxl,这个weight在PaddleNLP中进行了转换。 所以你可以从 PaddleNLP 下载它,你有两种下载方式进行下载: - -#### 1.2.1 通过 paddlenlp 方式加载 -直接通过paddlenlp的模型加载方法进行下载,下载后一般会存入 `PPNLP_HOME` 指定的目录。 - -```python -import os -os.environ["CUDA_VISIBLE_DEVICES"]="0" - -import paddle -from paddlenlp.transformers import Blip2Model, Blip2VisionModel, Blip2VisionConfig, Blip2QFormerConfig, Blip2QFormerModel - -Blip2Model.from_pretrained("Salesforce/blip2-flan-t5-xxl") -``` - -#### 1.2.2 直接点击下载 -可以直接进行点击下载: - -| blip2-flan-t5-xxl 权重 | 点击下载 | -:-------------------------------------:|:-----------------------------------: -| model_state.pdparams | [Download](https://paddlenlp.bj.bcebos.com/models/community/Salesforce/blip2-flan-t5-xxl/model_state.pdparams) | - -### 1.3 准备 Vicuna 权重 - -这里需要下载两个权重:Vicuna delta Weight和huggingface-formated Llama Weight。 然后你应该结合这两个重量来获得可以使用的Vicuna 权重。 - -#### 1.3.1 下载 Vicuna delta 权重 - -这里展示两种Vicuna delta 权重,请根据需要选择一种并点击下载。 - -| vicuna-7b-delta-v0 | vicuna-13b-delta-v0 | -:-------------------------------------:|:-----------------------------------: - [Download](https://huggingface.co/lmsys/vicuna-7b-delta-v0/tree/main) | [Download](https://huggingface.co/lmsys/vicuna-13b-delta-v0g) - -#### 1.3.2 根据以上选择的vicuna delta 权重,下载 相应的 llama 权重。 - -| llama-7b | llama-13b | -:-------------------------------------:|:-----------------------------------: - [Download](https://huggingface.co/decapoda-research/llama-7b-hf/tree/main) | [Download](https://huggingface.co/decapoda-research/llama-13b-hf) - - -#### 1.3.3 结合上面的两个权重,得到可以使用的 vicuna 权重 -- 为组合如上两个权重,请安装以下工具: - -```shell -pip install git+https://github.com/lm-sys/FastChat.git@v0.1.10 -``` -- 运行以下命令,获取最终可用的vicuna 权重 - -```shell -python -m fastchat.model.apply_delta --base /path/to/llama-13bOR7b-hf/ --target /path/to/save/working/vicuna-13b/weight/ --delta /path/to/vicuna-13bOR7b-delta-v0/ -``` - -## 2. 将多个 pytorch 子权重文件合并为一个权重文件 - -Pytorch版的权重文件可能是由多个子权重文件组合而成,为使用PaddleNLP进行加载并自动转换为Paddle版,需要将其合并为一个文件: - -### 2.1 下载MiniGPT库 -在开始之前,请确保已经下载了 [MiniGPT4](https://github.com/Vision-CAIR/MiniGPT-4.git) 库: - -``` -git clone https://github.com/Vision-CAIR/MiniGPT-4.git -``` - -### 2.2 获取完整的 vicuna 权重 -进入到MiniGPT4文件夹,执行以下代码,获取完整的 vicuna 权重文件: -```python -import argparse -import os -os.environ["CUDA_VISIBLE_DEVICES"]="0" -os.environ["FLAGS_use_cuda_managed_memory"]="true" - -import torch -from minigpt4.models.modeling_llama import LlamaForCausalLM - -llama_model = LlamaForCausalLM.from_pretrained("/path/to/save/working/vicuna-13b/") -torch.save(llama_model.state_dict(), "/path/to/save/working/vicuna-13b/pytorch_model.bin") -``` - -## 3. 合并以上所有权重,获取最终的 Paddle 版 MiniGPT4 权重 -这里提供了一个合并以上权重的脚本,你可以通过设置相关权重路径 以获取最终的 MiniGPT4 权重。 - -```shell -python merge_weight.py \ - --blip2_path "your dir name of blip2" \ - --vicuna_path "your dir name of vicuna" \ - --minigpt4_path "your ckpt path of minigpt4" \ - --save_path "your dir name saving the final minigpt4" -``` - -**参数说明**: -- `blip2_path`: 存放 blip2 权重的目录名 -- `vicuna_path`: 存放 vicuna_path 权重的目录名 -- `minigpt4_path`: 存放 blip2 权重的文件地址,比如./prerained_minigpt4_7b.pth -- `save_path`: 保存 Paddle 版 MiniGPT3 权重的目录名 - -## 3. More Reference - -- [MiniGPT Official Site](https://github.com/Vision-CAIR/MiniGPT-4) diff --git a/examples/multimodal/minigpt4/run_predict.py b/examples/multimodal/minigpt4/run_predict.py deleted file mode 100644 index 4b36089f3c91..000000000000 --- a/examples/multimodal/minigpt4/run_predict.py +++ /dev/null @@ -1,68 +0,0 @@ -# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os - -os.environ["CUDA_VISIBLE_DEVICES"] = "0" -os.environ["FLAGS_use_cuda_managed_memory"] = "true" -import requests -from PIL import Image - -from paddlenlp.transformers import MiniGPT4ForConditionalGeneration, MiniGPT4Processor - - -def predict(args): - # load MiniGPT4 moel and processor - model = MiniGPT4ForConditionalGeneration.from_pretrained(args.pretrained_name_or_path) - model.eval() - processor = MiniGPT4Processor.from_pretrained(args.pretrained_name_or_path) - print("load processor and model done!") - - # prepare model inputs for MiniGPT4 - url = "https://paddlenlp.bj.bcebos.com/data/images/mugs.png" - image = Image.open(requests.get(url, stream=True).raw) - - text = "describe this image" - prompt = "Give the following image: ImageContent. You will be able to see the image once I provide it to you. Please answer my questions.###Human: ###Assistant:" - inputs = processor([image], text, prompt) - - # generate with MiniGPT4 - # breakpoint - generate_kwargs = { - "max_length": 300, - "num_beams": 1, - "top_p": 1.0, - "repetition_penalty": 1.0, - "length_penalty": 0, - "temperature": 1, - "decode_strategy": "greedy_search", - "eos_token_id": [[835], [2277, 29937]], - } - outputs = model.generate(**inputs, **generate_kwargs) - msg = processor.batch_decode(outputs[0]) - print("Inference result: ", msg) - - -if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument( - "--pretrained_name_or_path", - default="your directory of minigpt4", - type=str, - help="The dir name of minigpt4 checkpoint.", - ) - args = parser.parse_args() - - predict(args) diff --git a/examples/sentiment_analysis/textcnn/README.md b/examples/sentiment_analysis/textcnn/README.md deleted file mode 100644 index d4cd1599322a..000000000000 --- a/examples/sentiment_analysis/textcnn/README.md +++ /dev/null @@ -1,192 +0,0 @@ -# 使用TextCNN模型完成中文对话情绪识别任务 - -情感分析旨在自动识别和提取文本中的倾向、立场、评价、观点等主观信息。情感分析其中的一个任务就是对话情绪识别,针对智能对话中的用户文本,自动判断该文本的情绪类别并给出相应的置信度,情绪类型分为积极(positive)、消极(negative)和中性(neutral)。 - -本示例展示了如何用TextCNN预训练模型在机器人聊天数据集上进行Finetune完成中文对话情绪识别任务。 - -## 快速开始 - -### 代码结构说明 - -以下是本项目主要代码结构及说明: - -```text -textcnn/ -├── deploy # 部署 -│   └── python -│   └── predict.py # python预测部署示例 -├── data.py # 数据处理脚本 -├── export_model.py # 动态图参数导出静态图参数脚本 -├── model.py # 模型组网脚本 -├── predict.py # 模型预测脚本 -├── README.md # 文档说明 -└── train.py # 对话情绪识别任务训练脚本 -``` - -### 数据准备 - -这里我们提供一份已标注的机器人聊天数据集,包括训练集(train.tsv),开发集(dev.tsv)和测试集(test.tsv)。 -完整数据集可以通过以下命令下载并解压: - -```shell -wget https://bj.bcebos.com/paddlenlp/datasets/RobotChat.tar.gz -tar xvf RobotChat.tar.gz -``` - -### 词表下载 - -在模型训练之前,需要先下载词汇表文件word_dict.txt,用于构造词-id映射关系。 - -```shell -wget https://bj.bcebos.com/paddlenlp/robot_chat_word_dict.txt -``` - -**NOTE:** 词表的选择和实际应用数据相关,需根据实际数据选择词表。 - -### 预训练模型下载 - -这里我们提供了一个百度基于海量数据训练好的TextCNN模型,用户通过以下方式下载预训练模型。 - -```shell -wget https://bj.bcebos.com/paddlenlp/models/textcnn.pdparams -``` - -### 模型训练 - -在下载好词表和预训练模型后就可以在机器人聊天数据集上进行finetune,通过运行以下命令,在训练集(train.tsv)上进行模型训练,并在开发集(dev.tsv)验证,这里通过`--init_from_ckpt=./textcnn.pdparams`指定TextCNN预训练模型。 - -CPU 启动: - -```shell -python train.py --vocab_path=./robot_chat_word_dict.txt \ - --init_from_ckpt=./textcnn.pdparams \ - --device=cpu \ - --lr=5e-5 \ - --batch_size=64 \ - --epochs=10 \ - --save_dir=./checkpoints \ - --data_path=./RobotChat -``` - -GPU 启动: - -```shell -unset CUDA_VISIBLE_DEVICES -python -m paddle.distributed.launch --gpus "0" train.py \ - --vocab_path=./robot_chat_word_dict.txt \ - --init_from_ckpt=./textcnn.pdparams \ - --device=gpu \ - --lr=5e-5 \ - --batch_size=64 \ - --epochs=10 \ - --save_dir=./checkpoints \ - --data_path=./RobotChat -``` - -XPU启动: - -```shell -python train.py --vocab_path=./robot_chat_word_dict.txt \ - --init_from_ckpt=./textcnn.pdparams \ - --device=xpu \ - --lr=5e-5 \ - --batch_size=64 \ - --epochs=10 \ - --save_dir=./checkpoints \ - --data_path=./RobotChat -``` - -以上参数表示: - -* `vocab_path`: 词汇表文件路径。 -* `init_from_ckpt`: 恢复模型训练的断点路径。 -* `device`: 选用什么设备进行训练,可选cpu、gpu或xpu。如使用gpu训练则参数gpus指定GPU卡号。 -* `lr`: 学习率, 默认为5e-5。 -* `batch_size`: 运行一个batch大小,默认为64。 -* `epochs`: 训练轮次,默认为10。 -* `save_dir`: 训练保存模型的文件路径。 -* `data_path`: 数据集文件路径。 - - -程序运行时将会自动进行训练,评估,测试。同时训练过程中会自动保存模型在指定的`save_dir`中。 -如: -```text -checkpoints/ -├── 0.pdopt -├── 0.pdparams -├── 1.pdopt -├── 1.pdparams -├── ... -└── final.pdparams -``` - -**NOTE:** - -* 如需恢复模型训练,则init_from_ckpt只需指定到文件名即可,不需要添加文件尾缀。如`--init_from_ckpt=checkpoints/0`即可,程序会自动加载模型参数`checkpoints/0.pdparams`,也会自动加载优化器状态`checkpoints/0.pdopt`。 -* 使用动态图训练结束之后,还可以将动态图参数导出成静态图参数,具体代码见export_model.py。静态图参数保存在`output_path`指定路径中。 - 运行方式: - -```shell -python export_model.py --vocab_path=./robot_chat_word_dict.txt --params_path=./checkpoints/final.pdparams --output_path=./static_graph_params -``` - -其中`params_path`是指动态图训练保存的参数路径,`output_path`是指静态图参数导出路径。 - -导出模型之后,可以用于部署,deploy/python/predict.py文件提供了python部署预测示例。运行方式: - -```shell -python deploy/python/predict.py --model_file=static_graph_params.pdmodel --params_file=static_graph_params.pdiparams -``` - -### 模型预测 - -启动预测: - -CPU启动: - -```shell -python predict.py --vocab_path=./robot_chat_word_dict.txt \ - --device=cpu \ - --params_path=./checkpoints/final.pdparams -``` - -GPU启动: - -```shell -export CUDA_VISIBLE_DEVICES=0 -python predict.py --vocab_path=./robot_chat_word_dict.txt \ - --device=gpu \ - --params_path=./checkpoints/final.pdparams -``` - -XPU启动: - -```shell -python predict.py --vocab_path=./robot_chat_word_dict.txt \ - --device=xpu \ - --params_path=./checkpoints/final.pdparams -``` - -待预测数据如以下示例: - -```text -你再骂我我真的不跟你聊了 -你看看我附近有什么好吃的 -我喜欢画画也喜欢唱歌 -``` - -经过`preprocess_prediction_data`函数处理后,调用`predict`函数即可输出预测结果。 - -如 - -```text -Data: 你再骂我我真的不跟你聊了 Label: negative -Data: 你看看我附近有什么好吃的 Label: neutral -Data: 我喜欢画画也喜欢唱歌 Label: positive -``` - -## Reference - -TextCNN参考论文: - -- [EMNLP2014-Convolutional Neural Networks for Sentence Classification](https://aclanthology.org/D14-1181.pdf) diff --git a/examples/sentiment_analysis/textcnn/data.py b/examples/sentiment_analysis/textcnn/data.py deleted file mode 100644 index 3426f065afec..000000000000 --- a/examples/sentiment_analysis/textcnn/data.py +++ /dev/null @@ -1,93 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import numpy as np -import paddle - - -def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): - """ - Create dataloader. - - Args: - dataset(obj:`paddle.io.Dataset`): Dataset instance. - mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly. - batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch. - batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging - the sample list, None for only stack each fields of sample in axis - 0(same as :attr::`np.stack(..., axis=0)`). - trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc. - - Returns: - dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches. - """ - if trans_fn: - dataset = dataset.map(trans_fn) - - shuffle = True if mode == "train" else False - if mode == "train": - sampler = paddle.io.DistributedBatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) - else: - sampler = paddle.io.BatchSampler(dataset=dataset, batch_size=batch_size, shuffle=shuffle) - dataloader = paddle.io.DataLoader(dataset, batch_sampler=sampler, collate_fn=batchify_fn) - return dataloader - - -def preprocess_prediction_data(data, tokenizer, pad_token_id=0, max_ngram_filter_size=3): - """ - It process the prediction data as the format used as training. - - Args: - data (obj:`list[str]`): The prediction data whose each element is a tokenized text. - tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. - pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. - max_ngram_filter_size (obj:`int`, optional, defaults to 3) Max n-gram size in TextCNN model. - Users should refer to the ngram_filter_sizes setting in TextCNN, if ngram_filter_sizes=(1, 2, 3) - then max_ngram_filter_size=3 - - Returns: - examples (obj:`list`): The processed data whose each element - is a `list` object, which contains - - - word_ids(obj:`list[int]`): The list of word ids. - """ - examples = [] - for text in data: - ids = tokenizer.encode(text) - seq_len = len(ids) - # Sequence length should larger or equal than the maximum ngram_filter_size in TextCNN model - if seq_len < max_ngram_filter_size: - ids.extend([pad_token_id] * (max_ngram_filter_size - seq_len)) - examples.append(ids) - return examples - - -def convert_example(example, tokenizer): - """convert_example""" - input_ids = tokenizer.encode(example["text"]) - input_ids = np.array(input_ids, dtype="int64") - - label = np.array(example["label"], dtype="int64") - return input_ids, label - - -def read_custom_data(filename): - """Reads data.""" - with open(filename, "r", encoding="utf-8") as f: - # Skip head - next(f) - for line in f: - data = line.strip().split("\t") - label, text = data - yield {"text": text, "label": label} diff --git a/examples/sentiment_analysis/textcnn/deploy/python/predict.py b/examples/sentiment_analysis/textcnn/deploy/python/predict.py deleted file mode 100644 index 35d6e15ecf2c..000000000000 --- a/examples/sentiment_analysis/textcnn/deploy/python/predict.py +++ /dev/null @@ -1,141 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse - -import numpy as np -import paddle -import paddle.nn.functional as F - -from paddlenlp.data import JiebaTokenizer, Pad, Vocab - -parser = argparse.ArgumentParser() -parser.add_argument( - "--model_file", - type=str, - required=True, - default="./static_graph_params.pdmodel", - help="The path to model info in static graph.", -) -parser.add_argument( - "--params_file", - type=str, - required=True, - default="./static_graph_params.pdiparams", - help="The path to parameters in static graph.", -) -parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The path to vocabulary.") -parser.add_argument( - "--max_seq_length", - default=128, - type=int, - help="The maximum total input sequence length after tokenization. " - "Sequences longer than this will be truncated, sequences shorter will be padded.", -) -parser.add_argument("--batch_size", default=2, type=int, help="Batch size per GPU/CPU for training.") -parser.add_argument( - "--device", - choices=["cpu", "gpu", "xpu"], - default="gpu", - help="Select which device to train model, defaults to gpu.", -) -args = parser.parse_args() - - -def convert_example(data, tokenizer, pad_token_id=0, max_ngram_filter_size=3): - """convert_example""" - input_ids = tokenizer.encode(data) - seq_len = len(input_ids) - # Sequence length should larger or equal than the maximum ngram_filter_size in TextCNN model - if seq_len < max_ngram_filter_size: - input_ids.extend([pad_token_id] * (max_ngram_filter_size - seq_len)) - input_ids = np.array(input_ids, dtype="int64") - return input_ids - - -class Predictor(object): - def __init__(self, model_file, params_file, device, max_seq_length): - self.max_seq_length = max_seq_length - - config = paddle.inference.Config(model_file, params_file) - if device == "gpu": - # set GPU configs accordingly - config.enable_use_gpu(100, 0) - elif device == "cpu": - # set CPU configs accordingly, - # such as enable_mkldnn, set_cpu_math_library_num_threads - config.disable_gpu() - elif device == "xpu": - # set XPU configs accordingly - config.enable_xpu(100) - config.switch_use_feed_fetch_ops(False) - self.predictor = paddle.inference.create_predictor(config) - - self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] - - self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) - - def predict(self, data, tokenizer, label_map, batch_size=1, pad_token_id=0): - """ - Predicts the data labels. - - Args: - data (obj:`list(str)`): Data to be predicted. - tokenizer(obj: paddlenlp.data.JiebaTokenizer): It use jieba to cut the chinese string. - label_map(obj:`dict`): The label id (key) to label str (value) map. - batch_size(obj:`int`, defaults to 1): The number of batch. - pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. - - Returns: - results(obj:`dict`): All the predictions labels. - """ - examples = [] - for text in data: - input_ids = convert_example(text, tokenizer) - examples.append(input_ids) - - # Separates data into some batches. - batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] - - batchify_fn = lambda samples, fn=Pad(axis=0, pad_val=pad_token_id): fn(samples) # input - - results = [] - for batch in batches: - input_ids = batchify_fn(batch) - self.input_handles[0].copy_from_cpu(input_ids) - self.predictor.run() - logits = paddle.to_tensor(self.output_handle.copy_to_cpu()) - probs = F.softmax(logits, axis=1) - idx = paddle.argmax(probs, axis=1).numpy() - idx = idx.tolist() - labels = [label_map[i] for i in idx] - results.extend(labels) - return results - - -if __name__ == "__main__": - # Define predictor to do prediction. - predictor = Predictor(args.model_file, args.params_file, args.device, args.max_seq_length) - - vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") - pad_token_id = vocab.to_indices("[PAD]") - tokenizer = JiebaTokenizer(vocab) - label_map = {0: "negative", 1: "neutral", 2: "positive"} - - # Firstly pre-processing prediction data and then do predict. - data = ["你再骂我我真的不跟你聊了", "你看看我附近有什么好吃的", "我喜欢画画也喜欢唱歌"] - - results = predictor.predict(data, tokenizer, label_map, batch_size=args.batch_size, pad_token_id=pad_token_id) - for idx, text in enumerate(data): - print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/sentiment_analysis/textcnn/export_model.py b/examples/sentiment_analysis/textcnn/export_model.py deleted file mode 100644 index 0953a4002053..000000000000 --- a/examples/sentiment_analysis/textcnn/export_model.py +++ /dev/null @@ -1,60 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os - -import paddle -from paddlenlp.data import Vocab -from model import TextCNNModel - -# yapf: disable -parser = argparse.ArgumentParser(__doc__) -parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The path to vocabulary.") -parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") -parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.") -parser.add_argument("--output_path", type=str, default='./static_graph_params', help="The path of model parameter in static graph to be saved.") -args = parser.parse_args() -# yapf: enable - - -def main(): - # Load vocab. - if not os.path.exists(args.vocab_path): - raise RuntimeError("The vocab_path can not be found in the path %s" % args.vocab_path) - - vocab = Vocab.load_vocabulary(args.vocab_path) - label_map = {0: "negative", 1: "neutral", 2: "positive"} - - # Construct the newtork. - vocab_size = len(vocab) - num_classes = len(label_map) - pad_token_id = vocab.to_indices("[PAD]") - - model = TextCNNModel(vocab_size, num_classes, padding_idx=pad_token_id, ngram_filter_sizes=(1, 2, 3)) - - # Load model parameters. - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - model.eval() - - inputs = [paddle.static.InputSpec(shape=[None, None], dtype="int64")] - - model = paddle.jit.to_static(model, input_spec=inputs) - # Save in static graph model. - paddle.jit.save(model, args.output_path) - - -if __name__ == "__main__": - main() diff --git a/examples/sentiment_analysis/textcnn/model.py b/examples/sentiment_analysis/textcnn/model.py deleted file mode 100644 index 655f1e2b8492..000000000000 --- a/examples/sentiment_analysis/textcnn/model.py +++ /dev/null @@ -1,60 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -import paddle.nn as nn - -from paddlenlp.seq2vec import CNNEncoder - - -class TextCNNModel(nn.Layer): - """ - This class implements the Text Convolution Neural Network model. - At a high level, the model starts by embedding the tokens and running them through - a word embedding. Then, we encode these representations with a `CNNEncoder`. - The CNN has one convolution layer for each ngram filter size. Each convolution operation gives - out a vector of size num_filter. The number of times a convolution layer will be used - is `num_tokens - ngram_size + 1`. The corresponding maxpooling layer aggregates all these - outputs from the convolution layer and outputs the max. - Lastly, we take the output of the encoder to create a final representation, - which is passed through some feed-forward layers to output a logits (`output_layer`). - - """ - - def __init__( - self, - vocab_size, - num_classes, - emb_dim=128, - padding_idx=0, - num_filter=128, - ngram_filter_sizes=(1, 2, 3), - fc_hidden_size=96, - ): - super().__init__() - self.embedder = nn.Embedding(vocab_size, emb_dim, padding_idx=padding_idx) - self.encoder = CNNEncoder(emb_dim=emb_dim, num_filter=num_filter, ngram_filter_sizes=ngram_filter_sizes) - self.fc = nn.Linear(self.encoder.get_output_dim(), fc_hidden_size) - self.output_layer = nn.Linear(fc_hidden_size, num_classes) - - def forward(self, text): - # Shape: (batch_size, num_tokens, embedding_dim) - embedded_text = self.embedder(text) - # Shape: (batch_size, len(ngram_filter_sizes) * num_filter) - encoder_out = paddle.tanh(self.encoder(embedded_text)) - # Shape: (batch_size, fc_hidden_size) - fc_out = paddle.tanh(self.fc(encoder_out)) - # Shape: (batch_size, num_classes) - logits = self.output_layer(fc_out) - return logits diff --git a/examples/sentiment_analysis/textcnn/predict.py b/examples/sentiment_analysis/textcnn/predict.py deleted file mode 100644 index ba3dd2958149..000000000000 --- a/examples/sentiment_analysis/textcnn/predict.py +++ /dev/null @@ -1,94 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License" -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import argparse - -import paddle -import paddle.nn.functional as F -from paddlenlp.data import JiebaTokenizer, Pad, Vocab - -from model import TextCNNModel -from data import preprocess_prediction_data - -# yapf: disable -parser = argparse.ArgumentParser(__doc__) -parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") -parser.add_argument("--batch_size", type=int, default=1, help="Total examples' number of a batch for training.") -parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The path to vocabulary.") -parser.add_argument("--params_path", type=str, default='./checkpoints/final.pdparams', help="The path of model parameter to be loaded.") -args = parser.parse_args() -# yapf: enable - - -def predict(model, data, label_map, batch_size=1, pad_token_id=0): - """ - Predicts the data labels. - - Args: - model (obj:`paddle.nn.Layer`): A model to classify texts. - data (obj:`list`): The processed data whose each element - is a `list` object, which contains - - - word_ids(obj:`list[int]`): The list of word ids. - label_map(obj:`dict`): The label id (key) to label str (value) map. - batch_size(obj:`int`, defaults to 1): The number of batch. - pad_token_id(obj:`int`, optional, defaults to 0): The pad token index. - - Returns: - results(obj:`dict`): All the predictions labels. - """ - - # Separates data into some batches. - batches = [data[idx : idx + batch_size] for idx in range(0, len(data), batch_size)] - batchify_fn = lambda samples, fn=Pad(axis=0, pad_val=pad_token_id): [data for data in fn(samples)] - - results = [] - model.eval() - for batch in batches: - texts = paddle.to_tensor(batchify_fn(batch)) - logits = model(texts) - probs = F.softmax(logits, axis=1) - idx = paddle.argmax(probs, axis=1).numpy() - idx = idx.tolist() - labels = [label_map[i] for i in idx] - results.extend(labels) - return results - - -if __name__ == "__main__": - paddle.set_device(args.device) - - # Load vocab. - vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") - label_map = {0: "negative", 1: "neutral", 2: "positive"} - - # Construct the newtork. - vocab_size = len(vocab) - num_classes = len(label_map) - pad_token_id = vocab.to_indices("[PAD]") - - model = TextCNNModel(vocab_size, num_classes, padding_idx=pad_token_id, ngram_filter_sizes=(1, 2, 3)) - - # Load model parameters. - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - print("Loaded parameters from %s" % args.params_path) - - # Firstly pre-processing prediction data and then do predict. - data = ["你再骂我我真的不跟你聊了", "你看看我附近有什么好吃的", "我喜欢画画也喜欢唱歌"] - tokenizer = JiebaTokenizer(vocab) - examples = preprocess_prediction_data(data, tokenizer, pad_token_id) - - results = predict(model, examples, label_map=label_map, batch_size=args.batch_size, pad_token_id=pad_token_id) - for idx, text in enumerate(data): - print("Data: {} \t Label: {}".format(text, results[idx])) diff --git a/examples/sentiment_analysis/textcnn/train.py b/examples/sentiment_analysis/textcnn/train.py deleted file mode 100644 index e80d2180af75..000000000000 --- a/examples/sentiment_analysis/textcnn/train.py +++ /dev/null @@ -1,108 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License" -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -from functools import partial -import argparse -import os -import random - -import numpy as np -import paddle -from paddlenlp.datasets import load_dataset -from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab - -from data import create_dataloader, convert_example, read_custom_data -from model import TextCNNModel - -# yapf: disable -parser = argparse.ArgumentParser(__doc__) -parser.add_argument("--epochs", type=int, default=10, help="Number of epoches for training.") -parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") -parser.add_argument("--lr", type=float, default=5e-5, help="Learning rate used to train.") -parser.add_argument("--save_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint") -parser.add_argument("--data_path", type=str, default='./RobotChat', help="The path of datasets to be loaded") -parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.") -parser.add_argument("--vocab_path", type=str, default="./robot_chat_word_dict.txt", help="The directory to dataset.") -parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") -args = parser.parse_args() -# yapf: enable - - -def set_seed(seed=1000): - """Sets random seed.""" - random.seed(seed) - np.random.seed(seed) - paddle.seed(seed) - - -if __name__ == "__main__": - paddle.set_device(args.device) - set_seed() - - # Load vocab. - if not os.path.exists(args.vocab_path): - raise RuntimeError("The vocab_path can not be found in the path %s" % args.vocab_path) - - vocab = Vocab.load_vocabulary(args.vocab_path, unk_token="[UNK]", pad_token="[PAD]") - - # Load datasets. - dataset_names = ["train.tsv", "dev.tsv", "test.tsv"] - train_ds, dev_ds, test_ds = [ - load_dataset(read_custom_data, filename=os.path.join(args.data_path, dataset_name), lazy=False) - for dataset_name in dataset_names - ] - - tokenizer = JiebaTokenizer(vocab) - trans_fn = partial(convert_example, tokenizer=tokenizer) - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=vocab.token_to_idx.get("[PAD]", 0)), Stack(dtype="int64") # label - ): [data for data in fn(samples)] - train_loader = create_dataloader( - train_ds, batch_size=args.batch_size, mode="train", batchify_fn=batchify_fn, trans_fn=trans_fn - ) - dev_loader = create_dataloader( - dev_ds, batch_size=args.batch_size, mode="validation", batchify_fn=batchify_fn, trans_fn=trans_fn - ) - test_loader = create_dataloader( - test_ds, batch_size=args.batch_size, mode="test", batchify_fn=batchify_fn, trans_fn=trans_fn - ) - - label_map = {0: "negative", 1: "neutral", 2: "positive"} - vocab_size = len(vocab) - num_classes = len(label_map) - pad_token_id = vocab.to_indices("[PAD]") - - model = TextCNNModel(vocab_size, num_classes, padding_idx=pad_token_id, ngram_filter_sizes=(1, 2, 3)) - - if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): - state_dict = paddle.load(args.init_from_ckpt) - model.set_dict(state_dict) - - model = paddle.Model(model) - - optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=args.lr) - - # Define loss and metric. - criterion = paddle.nn.CrossEntropyLoss() - metric = paddle.metric.Accuracy() - - model.prepare(optimizer, criterion, metric) - - # Start training and evaluating. - callback = paddle.callbacks.ProgBarLogger(log_freq=10, verbose=3) - model.fit(train_loader, dev_loader, epochs=args.epochs, save_dir=args.save_dir, callbacks=callback) - - # Evaluate on test dataset - print("Start to evaluate on test dataset...") - model.evaluate(test_loader, log_freq=len(test_loader)) diff --git a/examples/simultaneous_translation/stacl/utils/__init__.py b/examples/simultaneous_translation/stacl/utils/__init__.py deleted file mode 100644 index e69de29bb2d1..000000000000 diff --git a/examples/text_graph/erniesage/README.md b/examples/text_graph/erniesage/README.md deleted file mode 100644 index a25780be0586..000000000000 --- a/examples/text_graph/erniesage/README.md +++ /dev/null @@ -1,59 +0,0 @@ -# 基于PaddleNLP的ErnieSage模型介绍 - -## 背景介绍 - -在很多工业应用中,往往出现如下图所示的一种特殊的图:Text Graph。顾名思义,图的节点属性由文本构成,而边的构建提供了结构信息。如搜索场景下的Text Graph,节点可由搜索词、网页标题、网页正文来表达,用户反馈和超链信息则可构成边关系。 -Text Graph - -**ErnieSage** 由飞桨PGL团队提出,是ERNIE SAmple aggreGatE的简称,该模型可以同时建模文本语义与图结构信息,有效提升 Text Graph 的应用效果。其中 [**ERNIE**](https://github.com/PaddlePaddle/ERNIE) 是百度推出的基于知识增强的持续学习语义理解框架。 - -**ErnieSage** 是 ERNIE 与 GraphSAGE 碰撞的结果,是 ERNIE SAmple aggreGatE 的简称,它的结构如下图所示,主要思想是通过 ERNIE 作为聚合函数(Aggregators),建模自身节点和邻居节点的语义与结构关系。ErnieSage 对于文本的建模是构建在邻居聚合的阶段,中心节点文本会与所有邻居节点文本进行拼接;然后通过预训练的 ERNIE 模型进行消息汇聚,捕捉中心节点以及邻居节点之间的相互关系;最后使用 ErnieSage 搭配独特的邻居互相看不见的 Attention Mask 和独立的 Position Embedding 体系,就可以轻松构建 TextGraph 中句子之间以及词之间的关系。 - -ERNIESage - -使用ID特征的GraphSAGE只能够建模图的结构信息,而单独的ERNIE只能处理文本信息。通过飞桨PGL搭建的图与文本的桥梁,**ErnieSage**能够很简单的把GraphSAGE以及ERNIE的优点结合一起。以下面TextGraph的场景,**ErnieSage**的效果能够比单独的ERNIE以及GraphSAGE模型都要好。 - -**ErnieSage**可以很轻松地在基于PaddleNLP构建基于Ernie的图神经网络,目前PaddleNLP提供了V2版本的ErnieSage模型: - -- **ErnieSage V2**: ERNIE 作用在text graph的边上; - -ERNIESage_v1_4 - -## 环境依赖 - -- pgl >= 2.1 -安装命令 `pip install pgl\>=2.1` - -## 数据准备 -示例数据```data.txt```中使用了NLPCC2016-DBQA的部分数据,格式为每行"query \t answer"。 -```text -NLPCC2016-DBQA 是由国际自然语言处理和中文计算会议 NLPCC 于 2016 年举办的评测任务,其目标是从候选中找到合适的文档作为问题的答案。[链接: http://tcci.ccf.org.cn/conference/2016/dldoc/evagline2.pdf] -``` - -## 如何运行 - -我们采用了[PaddlePaddle Fleet](https://github.com/PaddlePaddle/Fleet)作为我们的分布式训练框架,在```config/*.yaml```中,目前支持的[ERNIE](https://github.com/PaddlePaddle/ERNIE)预训练语义模型包括**ernie**以及**ernie_tiny**,通过config/erniesage_link_prediction.yaml中的ernie_name指定。 - - -```sh -# 数据预处理,建图 -python ./preprocessing/dump_graph.py --conf ./config/erniesage_link_prediction.yaml -# GPU多卡或单卡模式ErnieSage -python -m paddle.distributed.launch --gpus "0" link_prediction.py --conf ./config/erniesage_link_prediction.yaml -# 对图节点的embeding进行预测, 单卡或多卡 -python -m paddle.distributed.launch --gpus "0" link_prediction.py --conf ./config/erniesage_link_prediction.yaml --do_predict -``` - -## 超参数设置 - -- epochs: 训练的轮数 -- graph_data: 训练模型时用到的图结构数据,使用“text1 \t text"格式。 -- train_data: 训练时的边,与graph_data格式相同,一般可以直接用graph_data。 -- graph_work_path: 临时存储graph数据中间文件的目录。 -- samples: 采样邻居数 -- model_type: 模型类型,包括ErnieSageV2。 -- ernie_name: 热启模型类型,支持“ernie”和"ernie_tiny",后者速度更快,指定该参数后会自动从服务器下载预训练模型文件。 -- num_layers: 图神经网络层数。 -- hidden_size: 隐藏层大小。 -- batch_size: 训练时的batchsize。 -- infer_batch_size: 预测时batchsize。 diff --git a/examples/text_graph/erniesage/config/erniesage_link_prediction.yaml b/examples/text_graph/erniesage/config/erniesage_link_prediction.yaml deleted file mode 100644 index 970f5de365b7..000000000000 --- a/examples/text_graph/erniesage/config/erniesage_link_prediction.yaml +++ /dev/null @@ -1,40 +0,0 @@ -# Global Environment Settings - -# trainer config ------ -device: "gpu" # use cpu or gpu devices to train. -seed: 2020 - -task: "link_prediction" -model_name_or_path: "ernie-tiny" # ernie-tiny or ernie-1.0 avaiable -sample_workers: 1 -optimizer_type: "adam" -lr: 0.00005 -batch_size: 32 -CPU_NUM: 10 -epoch: 30 -log_per_step: 10 -save_per_step: 200 -output_path: "./output" - -# data config ------ -train_data: "./example_data/graph_data.txt" -graph_data: "./example_data/train_data.txt" -graph_work_path: "./graph_workdir" -input_type: "text" -encoding: "utf8" - -# model config ------ -samples: [10] -model_type: "ErnieSageV2" -max_seqlen: 40 -num_layers: 1 -hidden_size: 128 -final_fc: true -final_l2_norm: true -loss_type: "hinge" -margin: 0.1 -neg_type: "batch_neg" - -# infer config ------ -infer_model: "./output/last" -infer_batch_size: 128 diff --git a/examples/text_graph/erniesage/data/dataset.py b/examples/text_graph/erniesage/data/dataset.py deleted file mode 100644 index 2a3733851e63..000000000000 --- a/examples/text_graph/erniesage/data/dataset.py +++ /dev/null @@ -1,115 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import os - -import numpy as np -import paddle -import pgl -from paddle.io import Dataset -from pgl.sampling import graphsage_sample - -__all__ = [ - "TrainData", - "PredictData", - "batch_fn", -] - - -class TrainData(Dataset): - def __init__(self, graph_work_path): - trainer_id = paddle.distributed.get_rank() - trainer_count = paddle.distributed.get_world_size() - print("trainer_id: %s, trainer_count: %s." % (trainer_id, trainer_count)) - - edges = np.load(os.path.join(graph_work_path, "train_data.npy"), allow_pickle=True) - # edges is bidirectional. - train_src = edges[trainer_id::trainer_count, 0] - train_dst = edges[trainer_id::trainer_count, 1] - returns = {"train_data": [train_src, train_dst]} - - if os.path.exists(os.path.join(graph_work_path, "neg_samples.npy")): - neg_samples = np.load(os.path.join(graph_work_path, "neg_samples.npy"), allow_pickle=True) - if neg_samples.size != 0: - train_negs = neg_samples[trainer_id::trainer_count] - returns["train_data"].append(train_negs) - print("Load train_data done.") - self.data = returns - - def __getitem__(self, index): - return [data[index] for data in self.data["train_data"]] - - def __len__(self): - return len(self.data["train_data"][0]) - - -class PredictData(Dataset): - def __init__(self, num_nodes): - trainer_id = paddle.distributed.get_rank() - trainer_count = paddle.distributed.get_world_size() - self.data = np.arange(trainer_id, num_nodes, trainer_count) - - def __getitem__(self, index): - return [self.data[index], self.data[index]] - - def __len__(self): - return len(self.data) - - -def batch_fn(batch_ex, samples, base_graph, term_ids): - batch_src = [] - batch_dst = [] - batch_neg = [] - for batch in batch_ex: - batch_src.append(batch[0]) - batch_dst.append(batch[1]) - if len(batch) == 3: # default neg samples - batch_neg.append(batch[2]) - - batch_src = np.array(batch_src, dtype="int64") - batch_dst = np.array(batch_dst, dtype="int64") - if len(batch_neg) > 0: - batch_neg = np.unique(np.concatenate(batch_neg)) - else: - batch_neg = batch_dst - - nodes = np.unique(np.concatenate([batch_src, batch_dst, batch_neg], 0)) - subgraphs = graphsage_sample(base_graph, nodes, samples) - - subgraph, sample_index, node_index = subgraphs[0] - from_reindex = {int(x): i for i, x in enumerate(sample_index)} - - term_ids = term_ids[sample_index].astype(np.int64) - - sub_src_idx = pgl.graph_kernel.map_nodes(batch_src, from_reindex) - sub_dst_idx = pgl.graph_kernel.map_nodes(batch_dst, from_reindex) - sub_neg_idx = pgl.graph_kernel.map_nodes(batch_neg, from_reindex) - - user_index = np.array(sub_src_idx, dtype="int64") - pos_item_index = np.array(sub_dst_idx, dtype="int64") - neg_item_index = np.array(sub_neg_idx, dtype="int64") - - user_real_index = np.array(batch_src, dtype="int64") - pos_item_real_index = np.array(batch_dst, dtype="int64") - - return ( - np.array([subgraph.num_nodes], dtype="int32"), - subgraph.edges.astype("int32"), - term_ids, - user_index, - pos_item_index, - neg_item_index, - user_real_index, - pos_item_real_index, - ) diff --git a/examples/text_graph/erniesage/data/graph_reader.py b/examples/text_graph/erniesage/data/graph_reader.py deleted file mode 100755 index ca82d5c78f66..000000000000 --- a/examples/text_graph/erniesage/data/graph_reader.py +++ /dev/null @@ -1,59 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import pgl -from paddle.io import DataLoader - -__all__ = ["GraphDataLoader"] - - -class GraphDataLoader(object): - def __init__(self, dataset, batch_size=1, shuffle=True, num_workers=1, collate_fn=None, **kwargs): - self.loader = DataLoader( - dataset=dataset, - batch_size=batch_size, - shuffle=shuffle, - num_workers=num_workers, - collate_fn=collate_fn, - **kwargs, - ) - - def __iter__(self): - func = self.__callback__() - for data in self.loader(): - yield func(data) - - def __call__(self): - return self.__iter__() - - def __callback__(self): - """callback function, for recontruct a dict or graph.""" - - def construct(tensors): - """tensor list to ([graph_tensor, graph_tensor, ...], - other tensor) - """ - graph_num = 1 - start_len = 0 - data = [] - graph_list = [] - for graph in range(graph_num): - graph_list.append(pgl.Graph(num_nodes=tensors[start_len], edges=tensors[start_len + 1])) - start_len += 2 - - for i in range(start_len, len(tensors)): - data.append(tensors[i]) - return graph_list, data - - return construct diff --git a/examples/text_graph/erniesage/example_data/graph_data.txt b/examples/text_graph/erniesage/example_data/graph_data.txt deleted file mode 100644 index e9aead6c89fa..000000000000 --- a/examples/text_graph/erniesage/example_data/graph_data.txt +++ /dev/null @@ -1,1000 +0,0 @@ -黑缘粗角肖叶甲触角有多大? 体长卵形,棕红色;鞘翅棕黄或淡棕色,外缘和中缝黑色或黑褐色;触角基部3、4节棕黄,余节棕色。 -黑缘粗角肖叶甲触角有多大? 头部刻点粗大,分布不均匀,头顶刻点十分稀疏;触角基部的内侧有一个三角形光瘤,唇基前缘呈半圆形凹切。 -黑缘粗角肖叶甲触角有多大? 触角近于体长之半,第1节粗大,棒状,第2节短,椭圆形,3、4两节细长,稍短于第5节,第5节基细端粗,末端6节明显粗大。 -黑缘粗角肖叶甲触角有多大? 前胸背板横宽,宽约为长的两倍,侧缘敞出较宽,圆形,敞边与盘区之间有一条细纵沟;盘区刻点相当密,前半部刻点较大于后半部。 -黑缘粗角肖叶甲触角有多大? 小盾片舌形,光亮,末端圆钝。 -黑缘粗角肖叶甲触角有多大? 鞘翅刻点粗大,不规则排列,肩部之后的刻点更为粗大,具皱褶,近中缝的刻点较小,略呈纵行排列。 -黑缘粗角肖叶甲触角有多大? 前胸前侧片前缘直;前胸后侧片具粗大刻点。 -黑缘粗角肖叶甲触角有多大? 足粗壮;胫节具纵脊,外端角向外延伸,呈弯角状;爪具附齿。 -暮光闪闪的姐姐是谁? 暮光闪闪是一匹雌性独角兽,后来在神秘魔法的影响下变成了空角兽(公主),她是《我的小马驹:友情是魔法》(英文名:My Little Pony:Friendship is Magic)中的主角之一。 -暮光闪闪的姐姐是谁? 她是银甲闪闪(Shining Armor)的妹妹,同时也是韵律公主(Princess Cadance)的小姑子。 -暮光闪闪的姐姐是谁? 在该系列中,她与最好的朋友与助手斯派克(Spike)一起生活在小马镇(Ponyville)的金橡图书馆(Golden Oak Library),研究友谊的魔法。 -暮光闪闪的姐姐是谁? 在暮光闪闪成为天角兽之前(即S3E13前),常常给塞拉丝蒂娅公主(Princess Celestia)关于友谊的报告。[1] -暮光闪闪的姐姐是谁? 《我的小马驹:友谊是魔法》(英文名称:My Little Pony:Friendship is Magic)(简称MLP) -暮光闪闪的姐姐是谁? 动画讲述了一只名叫做暮光闪闪(Twilight Sparkle)的独角兽(在SE3E13 -暮光闪闪的姐姐是谁? My Little Pony:Friendship is Magic[2] -暮光闪闪的姐姐是谁? 后成为了天角兽),执行她的导师塞拉斯蒂娅公主(PrincessCelestia)的任务,在小马镇(Ponyville)学习关于友谊的知识。 -暮光闪闪的姐姐是谁? 她与另外五只小马,苹果杰克(Applejack)、瑞瑞(Rarity)、云宝黛西(Rainbow Dash)、小蝶(Fluttershy)与萍琪派(Pinkie Pie),成为了最要好的朋友。 -暮光闪闪的姐姐是谁? 每匹小马都分别代表了协律精华的6个元素:诚实,慷慨,忠诚,善良,欢笑,魔法,各自扮演着属于自己的重要角色。 -暮光闪闪的姐姐是谁? 此后,暮光闪闪(Twilight Sparkle)便与她认识的新朋友们开始了有趣的日常生活。 -暮光闪闪的姐姐是谁? 在动画中,随时可见她们在小马镇(Ponyville)的种种冒险、奇遇、日常等等。 -暮光闪闪的姐姐是谁? 同时,也在她们之间的互动和冲突中,寻找着最适合最合理的完美解决方案。 -暮光闪闪的姐姐是谁? “尽管小马国并不太平,六位主角之间也常常有这样那样的问题,但是他们之间的真情对待,使得这个童话世界已经成为不少人心中理想的世外桃源。” -暮光闪闪的姐姐是谁? 暮光闪闪在剧情刚开始的时候生活在中心城(Canterlot),后来在夏日 -暮光闪闪的姐姐是谁? 暮光闪闪与斯派克(Spike) -暮光闪闪的姐姐是谁? 庆典的时候被塞拉丝蒂娅公主派遣到小马镇执行检查夏日庆典的准备工作的任务。 -暮光闪闪的姐姐是谁? 在小马镇交到了朋友(即其余5个主角),并和她们一起使用协律精华(Elements of harmony)击败了梦魇之月。 -暮光闪闪的姐姐是谁? 并在塞拉丝蒂亚公主的许可下,留在小马镇继续研究友谊的魔法。 -暮光闪闪的姐姐是谁? 暮光闪闪的知识基本来自于书本,并且她相当不相信书本以外的“迷信”,因为这样她在S1E15里吃足了苦头。 -暮光闪闪的姐姐是谁? 在这之后,她也开始慢慢学会相信一些书本以外的东西。 -暮光闪闪的姐姐是谁? 暮光闪闪热爱学习,并且学习成绩相当好(从她可以立刻算出 -暮光闪闪的姐姐是谁? 的结果可以看 -暮光闪闪的姐姐是谁? 暮光闪闪的原型 -暮光闪闪的姐姐是谁? 出)。 -暮光闪闪的姐姐是谁? 相当敬爱自己的老师塞拉丝蒂亚公主甚至到了精神失常的地步。 -暮光闪闪的姐姐是谁? 在第二季中,曾因为无法交出关于友谊的报告而做出了疯狂的行为,后来被塞拉丝蒂亚公主制止,在这之后,暮光闪闪得到了塞拉丝蒂亚公主“不用定期交友谊报告”的许可。 -暮光闪闪的姐姐是谁? 于是暮光闪闪在后面的剧情中的主角地位越来越得不到明显的体现。 -暮光闪闪的姐姐是谁? 在SE3E13中,因为破解了白胡子星璇留下的神秘魔法而被加冕成为了天角兽(公主),被尊称为“闪闪公主”。 -暮光闪闪的姐姐是谁? 当小星座熊在小马镇引起恐慌的时候,暮光闪闪运用了自身强大的魔法将水库举起后装满牛奶,用牛奶将小星座熊安抚后,连着巨型奶瓶和小星座熊一起送回了小星座熊居住的山洞。 -我想知道红谷十二庭有哪些金融机构? 红谷十二庭是由汪氏集团旗下子公司江西尤金房地产开发有限公司携手城发投资共同开发的精品社区,项目占地面积约380亩,总建筑面积约41万平方米。 -我想知道红谷十二庭有哪些金融机构? 项目以建设人文型、生态型居住环境为规划目标;创造一个布局合理、功能齐全、交通便捷、绿意盎然、生活方便,有文化内涵的居住区。 -我想知道红谷十二庭有哪些金融机构? 金融机构:工商银行、建设银行、农业银行、中国银行红谷滩支行、商业银行红谷滩支行等 -我想知道红谷十二庭有哪些金融机构? 周边公园:沿乌砂河50米宽绿化带、乌砂河水岸公园、秋水广场、赣江市民公园 -我想知道红谷十二庭有哪些金融机构? 周边医院:新建县人民医院、开心人药店、中寰医院 -我想知道红谷十二庭有哪些金融机构? 周边学校:育新小学红谷滩校区、南师附小红谷滩校区、实验小学红谷滩校区中学:南昌二中红谷滩校区、南昌五中、新建二中、竞秀贵族学校 -我想知道红谷十二庭有哪些金融机构? 周边公共交通:112、204、211、219、222、227、238、501等20多辆公交车在本项目社区门前停靠 -我想知道红谷十二庭有哪些金融机构? 红谷十二庭处在南昌一江两城中的西城中心,位属红谷滩CBD文化公园中心——马兰圩中心组团,红谷滩中心区、红角洲、新建县三区交汇处,南临南期友好路、东接红谷滩中心区、西靠乌砂河水岸公园(50米宽,1000米长)。 -我想知道红谷十二庭有哪些金融机构? 交通便捷,景观资源丰富,生活配套设施齐全,出则繁华,入则幽静,是现代人居的理想地段。 -我想知道红谷十二庭有哪些金融机构? 红谷十二庭户型图 -苏琳最开始进入智通实业是担任什么职位? 现任广东智通人才连锁股份有限公司总裁,清华大学高级工商管理硕士。 -苏琳最开始进入智通实业是担任什么职位? 1994年,加入智通实业,从总经理秘书做起。 -苏琳最开始进入智通实业是担任什么职位? 1995年,智通实业决定进入人才服务行业,被启用去负责新公司的筹建及运营工作,在苏琳的努力下,智通人才智力开发有限公司成立。 -苏琳最开始进入智通实业是担任什么职位? 2003年,面对同城对手的激烈竞争,苏琳冷静对待,领导智通先后接管、并购了同城的腾龙、安达盛人才市场,,“品牌运作,连锁经营,差异制胜”成为苏琳屡屡制胜的法宝。 -苏琳最开始进入智通实业是担任什么职位? 2006年,苏琳先是将智通人才升级为“东莞市智通人才连锁有限公司”,一举成为广东省人才市场目前惟一的连锁机构,随后在东莞同时开设长安、松山湖、清溪等镇区分部,至此智通在东莞共有6个分部。 -苏琳最开始进入智通实业是担任什么职位? 一番大刀阔斧完成东莞布局后,苏琳确定下一个更为高远的目标——进军珠三角,向全国发展连锁机构。 -苏琳最开始进入智通实业是担任什么职位? 到2011年末,苏琳领导的智通人才已在珠三角的东莞、佛山、江门、中山等地,长三角的南京、宁波、合肥等地,中西部的南昌、长沙、武汉、重庆、西安等地设立了20多家连锁经营网点。 -苏琳最开始进入智通实业是担任什么职位? 除了财务副总裁之外,苏琳是智通人才核心管理高层当中唯一的女性,不管是要约采访的记者还是刚刚加入智通的员工,见到苏琳的第一面,都会有一种惊艳的感觉,“一位女企业家居然非常美丽和时尚?!” -苏琳最开始进入智通实业是担任什么职位? 智通管理高层的另外6位男性成员,有一次同时接受一家知名媒体采访时,共同表达了对自己老板的“爱慕”之情,苏琳听后莞尔一笑,指着在座的这几位高层说道“其实,我更爱他们!” -苏琳最开始进入智通实业是担任什么职位? 这种具有独特领导魅力的表述让这位记者唏嘘不已,同时由这样的一个细节让他感受到了智通管理团队的协作力量。 -谁知道黄沙中心小学的邮政编码是多少? 学校于1954年始建于棕树湾村,当时借用一间民房做教室,取名为“黄沙小学”,只有教师1人,学生8人。 -谁知道黄沙中心小学的邮政编码是多少? 1958年在大跃进精神的指导下,实行大集体,全乡集中办学,发展到12个班,300多学生,20名教职工。 -谁知道黄沙中心小学的邮政编码是多少? 1959年解散。 -谁知道黄沙中心小学的邮政编码是多少? 1959年下半年,在上级的扶持下,建了6间木房,搬到1960年学校所在地,有6名教师,3个班,60名学生。 -谁知道黄沙中心小学的邮政编码是多少? 1968年,开始招收一个初中班,“黄沙小学”改名为 “附小”。 -谁知道黄沙中心小学的邮政编码是多少? 当时已发展到5个班,8名教师,110多名学生。 -谁知道黄沙中心小学的邮政编码是多少? 增建土木结构教室两间。 -谁知道黄沙中心小学的邮政编码是多少? 1986年,初中、小学分开办学。 -谁知道黄沙中心小学的邮政编码是多少? 增建部分教师宿舍和教室,办学条件稍有改善,学校初具规模。 -谁知道黄沙中心小学的邮政编码是多少? 1996年,我校在市、县领导及希望工程主管部门的关怀下,决定改为“黄沙希望小学”并拨款32万元,新建一栋4层,12间教室的教学楼,教学条件大有改善。 -谁知道黄沙中心小学的邮政编码是多少? 当时发展到10个班,学生300多人,教职工19人,小学高级教师3人,一级教师7人,二级教师9人。 -谁知道黄沙中心小学的邮政编码是多少? 2003年下半年由于农村教育体制改革,撤销教育组,更名为“黄沙中心小学”。 -谁知道黄沙中心小学的邮政编码是多少? 学校现有在校生177人(含学前42人),设有学前至六年级共7个教学班。 -谁知道黄沙中心小学的邮政编码是多少? 有教师19人,其中大专以上学历11人,中师6人;小学高级教师14人,一级教师5人。 -谁知道黄沙中心小学的邮政编码是多少? 学校校园占地面积2050平方米,生均达15.29平方米,校舍建筑面积1645平方米,生均12.27平方米;设有教师办公室、自然实验、电教室(合二为一)、微机室、图书阅览室(合二为一)、体育室、广播室、少先队活动室。 -谁知道黄沙中心小学的邮政编码是多少? 广西壮族自治区桂林市临桂县黄沙瑶族乡黄沙街 邮编:541113[1] -伊藤实华的职业是什么? 伊藤实华(1984年3月25日-)是日本的女性声优。 -伊藤实华的职业是什么? THREE TREE所属,东京都出身,身长149cm,体重39kg,血型AB型。 -伊藤实华的职业是什么? ポルノグラフィティのLION(森男) -伊藤实华的职业是什么? 2000年 -伊藤实华的职业是什么? 犬夜叉(枫(少女时代)) -伊藤实华的职业是什么? 幻影死神(西亚梨沙) -伊藤实华的职业是什么? 2001年 -伊藤实华的职业是什么? NOIR(ロザリー) -伊藤实华的职业是什么? 2002年 -伊藤实华的职业是什么? 水瓶战记(柠檬) -伊藤实华的职业是什么? 返乡战士(エイファ) -伊藤实华的职业是什么? 2003年 -伊藤实华的职业是什么? 奇诺之旅(女子A(悲しい国)) -伊藤实华的职业是什么? 2004年 -伊藤实华的职业是什么? 爱你宝贝(坂下ミキ) -伊藤实华的职业是什么? Get Ride! アムドライバー(イヴァン・ニルギース幼少期) -伊藤实华的职业是什么? スクールランブル(花井春树(幼少时代)) -伊藤实华的职业是什么? 2005年 -伊藤实华的职业是什么? 光速蒙面侠21(虎吉) -伊藤实华的职业是什么? 搞笑漫画日和(男子トイレの精、パン美先生) -伊藤实华的职业是什么? 银牙伝说WEED(テル) -伊藤实华的职业是什么? 魔女的考验(真部カレン、守山太郎) -伊藤实华的职业是什么? BUZZER BEATER(レニー) -伊藤实华的职业是什么? 虫师(“眼福眼祸”さき、“草を踏む音”沢(幼少时代)) -伊藤实华的职业是什么? 2006年 -伊藤实华的职业是什么? 魔女之刃(娜梅) -伊藤实华的职业是什么? 反斗小王子(远藤レイラ) -伊藤实华的职业是什么? 搞笑漫画日和2(パン美先生、フグ子、ダンサー、ヤマトの妹、女性) -伊藤实华的职业是什么? 人造昆虫カブトボーグ V×V(ベネチアンの弟、东ルリ、园儿A) -伊藤实华的职业是什么? 2007年 -爆胎监测与安全控制系统英文是什么? 爆胎监测与安全控制系统(Blow-out Monitoring and Brake System),是吉利全球首创,并拥有自主知识产权及专利的一项安全技术。 -爆胎监测与安全控制系统英文是什么? 这项技术主要是出于防止高速爆胎所导致的车辆失控而设计。 -爆胎监测与安全控制系统英文是什么? BMBS爆胎监测与安全控制系统技术于2004年1月28日正式获得中国发明专利授权。 -爆胎监测与安全控制系统英文是什么? 2008年第一代BMBS系统正式与世人见面,BMBS汇集国内外汽车力学、控制学、人体生理学、电子信息学等方面的专家和工程技术人员经过一百余辆试验车累计行程超过五百万公里的可靠性验证,以确保产品的可靠性。 -爆胎监测与安全控制系统英文是什么? BMBS技术方案的核心即是采用智能化自动控制系统,弥补驾驶员生理局限,在爆胎后反应时间为0.5秒,替代驾驶员实施行车制动,保障行车安全。 -爆胎监测与安全控制系统英文是什么? BMBS系统由控制系统和显示系统两大部分组成,控制系统由BMBS开关、BMBS主机、BMBS分机、BMBS真空助力器四部分组成;显示系统由GPS显示、仪表指示灯、语言提示、制动双闪灯组成。 -爆胎监测与安全控制系统英文是什么? 当轮胎气压高于或低于限值时,控制器声光提示胎压异常。 -爆胎监测与安全控制系统英文是什么? 轮胎温度过高时,控制器发出信号提示轮胎温度过高。 -爆胎监测与安全控制系统英文是什么? 发射器电量不足时,控制器显示低电压报警。 -爆胎监测与安全控制系统英文是什么? 发射器受到干扰长期不发射信号时,控制器显示无信号报警。 -爆胎监测与安全控制系统英文是什么? 当汽车电门钥匙接通时,BMBS首先进入自检程序,检测系统各部分功能是否正常,如不正常,BMBS报警灯常亮。 -走读干部现象在哪里比较多? 走读干部一般是指县乡两级干部家住县城以上的城市,本人在县城或者乡镇工作,要么晚出早归,要么周一去单位上班、周五回家过周末。 -走读干部现象在哪里比较多? 对于这种现象,社会上的议论多是批评性的,认为这些干部脱离群众、作风漂浮、官僚主义,造成行政成本增加和腐败。 -走读干部现象在哪里比较多? 截至2014年10月,共有6484名“走读干部”在专项整治中被查处。 -走读干部现象在哪里比较多? 这是中央首次大规模集中处理这一长期遭诟病的干部作风问题。 -走读干部现象在哪里比较多? 干部“走读”问题主要在乡镇地区比较突出,城市地区则较少。 -走读干部现象在哪里比较多? 从历史成因和各地反映的情况来看,产生“走读”现象的主要原因大致有四种: -走读干部现象在哪里比较多? 现今绝大多数乡村都有通往乡镇和县城的石子公路甚至柏油公路,这无疑为农村干部的出行创造了便利条件,为“干部像候鸟,频往家里跑”创造了客观条件。 -走读干部现象在哪里比较多? 选调生、公务员队伍大多是学历较高的大学毕业生,曾在高校所在地的城市生活,不少人向往城市生活,他们不安心长期扎根基层,而是将基层当作跳板,因此他们往往成为“走读”的主力军。 -走读干部现象在哪里比较多? 公仆意识、服务意识淡化,是“走读”现象滋生的主观原因。 -走读干部现象在哪里比较多? 有些党员干部感到自己长期在基层工作,该为自己和家庭想想了。 -走读干部现象在哪里比较多? 于是,不深入群众认真调查研究、认真听取群众意见、认真解决群众的实际困难,也就不难理解了。 -走读干部现象在哪里比较多? 县级党政组织对乡镇领导干部管理的弱化和为基层服务不到位,导致“走读”问题得不到应有的制度约束,是“走读”问题滋长的组织原因。[2] -走读干部现象在哪里比较多? 近些年来,我国一些地方的“干部走读”现象较为普遍,社会上对此议走读干部论颇多。 -走读干部现象在哪里比较多? 所谓“干部走读”,一般是指县乡两级干部家住县城以上的城市,本人在县城或者乡镇工作,要么早出晚归,要么周一去单位上班、周五回家过周末。 -走读干部现象在哪里比较多? 对于这种现象,社会上的议论多是批评性的,认为这些干部脱离群众、作风漂浮、官僚主义,造成行政成本增加和腐败。 -走读干部现象在哪里比较多? 干部走读之所以成为“千夫所指”,是因为这种行为增加了行政成本。 -走读干部现象在哪里比较多? 从根子上说,干部走读是城乡发展不平衡的产物,“人往高处走,水往低处流”,有了更加舒适的生活环境,不管是为了自己生活条件改善也好,还是因为子女教育也好,农村人口向城镇转移,这是必然结果。 -走读干部现象在哪里比较多? “干部走读”的另一个重要原因,是干部人事制度改革。 -走读干部现象在哪里比较多? 目前公务员队伍“凡进必考”,考上公务员的大多是学历较高的大学毕业生,这些大学毕业生来自各个全国各地,一部分在本地结婚生子,沉淀下来;一部分把公务员作为跳板,到基层后或考研,或再参加省考、国考,或想办法调回原籍。 -走读干部现象在哪里比较多? 再加上一些下派干部、异地交流任职干部,构成了看似庞大的“走读”队伍。 -走读干部现象在哪里比较多? 那么,“干部走读”有哪些弊端呢? -走读干部现象在哪里比较多? 一是这些干部人在基层,心在城市,缺乏长期作战的思想,工作不安心。 -走读干部现象在哪里比较多? 周一来上班,周五回家转,对基层工作缺乏热情和感情;二是长期在省市直机关工作,对基层工作不熟悉不了解,工作不热心;三是长期走读,基层干群有工作难汇报,有困难难解决,群众不开心;四是干部来回走读,公车私驾,私费公报,把大量的经济负担转嫁给基层;五是对这些走读干部,基层管不了,上级监督难,节假日期间到哪里去、做什么事,基本处于失控和真空状态,各级组织和基层干群不放心。 -走读干部现象在哪里比较多? 特别需要引起警觉的是,由于少数走读干部有临时思想,满足于“当维持会长”,得过且过混日子,热衷于做一些急功近利、砸锅求铁的短期行为和政绩工程,不愿做打基础、管长远的实事好事,甚至怠政、疏政和懒于理政,影响了党和政府各项方针政策措施的落实,导致基层无政府主义、自由主义抬头,削弱了党和政府的领导,等到矛盾激化甚至不可收拾的时候,处理已是来之不及。 -走读干部现象在哪里比较多? 权利要与义务相等,不能只有义务而没有权利,或是只有权利没有义务。 -走读干部现象在哪里比较多? 如何真正彻底解决乡镇干部“走读”的现象呢? -走读干部现象在哪里比较多? 那就必须让乡镇基层干部义务与权利相等。 -走读干部现象在哪里比较多? 如果不能解决基层干部待遇等问题,即使干部住村,工作上也不会有什么进展的。 -走读干部现象在哪里比较多? 所以,在政治上关心,在生活上照顾,在待遇上提高。 -走读干部现象在哪里比较多? 如,提高基层干部的工资待遇,增加通讯、交通补助;帮助解决子女入学及老人赡养问题;提拔干部优先考虑基层干部;干部退休时的待遇至少不低于机关干部等等。 -化州市良光镇东岸小学学风是什么? 学校全体教职工爱岗敬业,团结拼搏,勇于开拓,大胆创新,进行教育教学改革,努力开辟第二课堂的教学路子,并开通了网络校校通的交流合作方式。 -化州市良光镇东岸小学学风是什么? 现学校教师正在为创建安全文明校园而努力。 -化州市良光镇东岸小学学风是什么? 东岸小学位置偏僻,地处贫穷落后,是良光镇最偏远的学校,学校,下辖分教点——东心埇小学,[1]?。 -化州市良光镇东岸小学学风是什么? 学校2011年有教师22人,学生231人。 -化州市良光镇东岸小学学风是什么? 小学高级教师8人,小学一级教师10人,未定级教师4人,大专学历的教师6人,其余的都具有中师学历。 -化州市良光镇东岸小学学风是什么? 全校共设12个班,学校课程按标准开设。 -化州市良光镇东岸小学学风是什么? 东岸小学原来是一所破旧不堪,教学质量非常差的薄弱学校。 -化州市良光镇东岸小学学风是什么? 近几年来,在各级政府、教育部门及社会各界热心人士鼎力支持下,学校领导大胆改革创新,致力提高教学质量和教师水平,并加大经费投入,大大改善了办学条件,使学校由差变好,实现了大跨越。 -化州市良光镇东岸小学学风是什么? 学校建设性方面。 -化州市良光镇东岸小学学风是什么? 东岸小学属于革命老区学校,始建于1980年,从东心埇村祠堂搬到这个校址,1990年建造一幢建筑面积为800平方米的南面教学楼, 1998年老促会支持从北面建造一幢1800平方米的教学大楼。 -化州市良光镇东岸小学学风是什么? 学校在管理方面表现方面颇具特色,实现了各项制度的日常化和规范化。 -化州市良光镇东岸小学学风是什么? 学校领导有较强的事业心和责任感,讲求民主与合作,勤政廉政,依法治校,树立了服务意识。 -化州市良光镇东岸小学学风是什么? 学校一贯实施“德育为先,以人为本”的教育方针,制定了“团结,律已,拼搏,创新”的校训。 -化州市良光镇东岸小学学风是什么? 教育风为“爱岗敬业,乐于奉献”,学风为“乐学,勤学,巧学,会学”。 -化州市良光镇东岸小学学风是什么? 校内营造了尊师重教的氛围,形成了良好的校风和学风。 -化州市良光镇东岸小学学风是什么? 教师们爱岗敬业,师德高尚,治学严谨,教研教改气氛浓厚,获得喜人的教研成果。 -化州市良光镇东岸小学学风是什么? 近几年来,教师撰写的教育教学论文共10篇获得县市级以上奖励,获了镇级以上奖励的有100人次。 -化州市良光镇东岸小学学风是什么? 学校德育工作成绩显著,多年被评为“安全事故为零”的学校,良光镇先进学校。 -化州市良光镇东岸小学学风是什么? 特别是教学质量大大提高了。 -化州市良光镇东岸小学学风是什么? 这些成绩得到了上级及群众的充分肯定。 -化州市良光镇东岸小学学风是什么? 1.学校环境欠美观有序,学校大门口及校道有待改造。 -化州市良光镇东岸小学学风是什么? 2.学校管理制度有待改进,部分教师业务水平有待提高。 -化州市良光镇东岸小学学风是什么? 3.教师宿舍、教室及学生宿舍欠缺。 -化州市良光镇东岸小学学风是什么? 4.运动场不够规范,各类体育器材及设施需要增加。 -化州市良光镇东岸小学学风是什么? 5.学生活动空间少,见识面窄,视野不够开阔。 -化州市良光镇东岸小学学风是什么? 1.努力营造和谐的教育教学新气氛。 -化州市良光镇东岸小学学风是什么? 建立科学的管理制度,坚持“与时俱进,以人为本”,真正实现领导对教师,教师对学生之间进行“德治与情治”;学校的人文环境做到“文明,和谐,清新”;德育环境做到“自尊,律已,律人”;心理环境做到“安全,谦虚,奋发”;交际环境做到“团结合作,真诚助人”;景物环境做到“宜人,有序。” -化州市良光镇东岸小学学风是什么? 营造学校与育人的新特色。 -我很好奇发射管的输出功率怎么样? 产生或放大高频功率的静电控制电子管,有时也称振荡管。 -我很好奇发射管的输出功率怎么样? 用于音频或开关电路中的发射管称调制管。 -我很好奇发射管的输出功率怎么样? 发射管是无线电广播、通信、电视发射设备和工业高频设备中的主要电子器件。 -我很好奇发射管的输出功率怎么样? 输出功率和工作频率是发射管的基本技术指标。 -我很好奇发射管的输出功率怎么样? 广播、通信和工业设备的发射管,工作频率一般在30兆赫以下,输出功率在1919年为2千瓦以下,1930年达300千瓦,70年代初已超过1000千瓦,效率高达80%以上。 -我很好奇发射管的输出功率怎么样? 发射管工作频率提高时,输出功率和效率都会降低,因此1936年首次实用的脉冲雷达工作频率仅28兆赫,80年代则已达 400兆赫以上。 -我很好奇发射管的输出功率怎么样? 40年代电视发射管的工作频率为数十兆赫,而80年代初,优良的电视发射管可在1000兆赫下工作,输出功率达20千瓦,效率为40%。 -我很好奇发射管的输出功率怎么样? 平面电极结构的小功率发射三极管可在更高的频率下工作。 -我很好奇发射管的输出功率怎么样? 发射管多采用同心圆筒电极结构。 -我很好奇发射管的输出功率怎么样? 阴极在最内层,向外依次为各个栅极和阳极。 -我很好奇发射管的输出功率怎么样? 图中,自左至右为阴极、第一栅、第二栅、栅极阴极组装件和装入阳极后的整个管子。 -我很好奇发射管的输出功率怎么样? 发射管 -我很好奇发射管的输出功率怎么样? 中小功率发射管多采用间热式氧化物阴极。 -我很好奇发射管的输出功率怎么样? 大功率发射管一般采用碳化钍钨丝阴极,有螺旋、直条或网笼等结构形式。 -我很好奇发射管的输出功率怎么样? 图为网笼式阴极。 -我很好奇发射管的输出功率怎么样? 栅极多用钼丝或钨丝绕制,或用钼片经电加工等方法制造。 -我很好奇发射管的输出功率怎么样? 栅极表面经镀金(或铂)或涂敷锆粉等处理,以降低栅极电子发射,使发射管稳定工作。 -我很好奇发射管的输出功率怎么样? 用气相沉积方法制造的石墨栅极,具有良好的性能。 -我很好奇发射管的输出功率怎么样? 发射管阳极直流输入功率转化为高频输出功率的部分约为75%,其余25%成为阳极热损耗,因此对发射管的阳极必须进行冷却。 -我很好奇发射管的输出功率怎么样? 中小功率发射管的阳极采取自然冷却方式,用镍、钼或石墨等材料制造,装在管壳之内,工作温度可达 600℃。 -我很好奇发射管的输出功率怎么样? 大功率发射管的阳极都用铜制成,并作为真空密封管壳的一部分,采用各种强制冷却方式。 -我很好奇发射管的输出功率怎么样? 各种冷却方式下每平方厘米阳极内表面的散热能力为:水冷100瓦;风冷30瓦;蒸发冷却250瓦;超蒸发冷却1000瓦以上,80年代已制成阳极损耗功率为1250千瓦的超蒸发冷却发射管。 -我很好奇发射管的输出功率怎么样? 发射管也常以冷却方式命名,如风冷发射管、水冷发射管和蒸发冷却发射管。 -我很好奇发射管的输出功率怎么样? 发射管管壳用玻璃或陶瓷制造。 -我很好奇发射管的输出功率怎么样? 小功率发射管内使用含钡的吸气剂;大功率发射管则采用锆、钛、钽等吸气材料,管内压强约为10帕量级。 -我很好奇发射管的输出功率怎么样? 发射管寿命取决于阴极发射电子的能力。 -我很好奇发射管的输出功率怎么样? 大功率发射管寿命最高记录可达8万小时。 -我很好奇发射管的输出功率怎么样? 发射四极管的放大作用和输出输入电路间的隔离效果优于三极管,应用最广。 -我很好奇发射管的输出功率怎么样? 工业高频振荡器普遍采用三极管。 -我很好奇发射管的输出功率怎么样? 五极管多用在小功率范围中。 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 鲁能领秀城中央公园位于鲁能领秀城景观中轴之上,总占地161.55亩,总建筑面积约40万平米,容积率为2.70,由22栋小高层、高层组成;其绿地率高达35.2%,环境优美,产品更加注重品质化、人性化和自然生态化,是鲁能领秀城的生态人居典范。 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 中央公园[1] 学区准现房,坐享鲁能领秀城成熟配套,成熟生活一步到位。 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 经典板式小高层,103㎡2+1房仅22席,稀市推出,错过再无;92㎡经典两房、137㎡舒适三房压轴登场! -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 物业公司: -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 济南凯瑞物业公司;深圳长城物业公司;北京盛世物业有限公司 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 绿化率: -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 42% -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 容积率: -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 2.70 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 暖气: -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 集中供暖 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 楼座展示:中央公园由22栋小高层、高层组成,3、16、17号楼分别是11层小高层,18层和28层的高层。 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 4号楼是23层,2梯3户。 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 项目位置: -鬼青蛙在哪里有收录详情? 鬼青蛙这张卡可以从手卡把这张卡以外的1只水属性怪兽丢弃,从手卡特殊召唤。 -鬼青蛙在哪里有收录详情? 这张卡召唤·反转召唤·特殊召唤成功时,可以从自己的卡组·场上选1只水族·水属性·2星以下的怪兽送去墓地。 -鬼青蛙在哪里有收录详情? 此外,1回合1次,可以通过让自己场上1只怪兽回到手卡,这个回合通常召唤外加上只有1次,自己可以把「鬼青蛙」以外的1只名字带有「青蛙」的怪兽召唤。[1] -鬼青蛙在哪里有收录详情? 游戏王卡包收录详情 -鬼青蛙在哪里有收录详情? [09/09/18] -西湖区有多大? 西湖区是江西省南昌市市辖区。 -西湖区有多大? 为南昌市中心城区之一,有着2200多年历史,是一个物华天宝、人杰地灵的古老城区。 -西湖区有多大? 2004年南昌市老城区区划调整后,西湖区东起京九铁路线与青山湖区毗邻,南以洪城路东段、抚河路南段、象湖以及南隔堤为界与青云谱区、南昌县接壤,西凭赣江中心线与红谷滩新区交界,北沿中山路、北京西路与东湖区相连,所辖面积34.5平方公里,常住人口43万,管辖1个镇、10个街道办事处,设12个行政村、100个社区。 -西湖区有多大? (图)西湖区[南昌市] -西湖区有多大? 西湖原为汉代豫章群古太湖的一部分,唐贞元15年(公元799年)洪恩桥的架设将东太湖分隔成东西两部分,洪恩桥以西谓之西湖,西湖区由此而得名。 -西湖区有多大? 西湖区在1926年南昌设市后分别称第四、五部分,六、七部分。 -西湖区有多大? 1949年解放初期分别称第三、四区。 -西湖区有多大? 1955年分别称抚河区、西湖区。 -西湖区有多大? 1980年两区合并称西湖区。[1] -西湖区有多大? 辖:西湖街道、丁公路街道、广外街道、系马桩街道、绳金塔街道、朝阳洲街道、禾草街街道、十字街街道、瓦子角街道、三眼井街道、上海路街道、筷子巷街道、南站街道。[1] -西湖区有多大? 2002年9月,由原筷子巷街道和原禾草街街道合并设立南浦街道,原广外街道与瓦子角街道的一部分合并设立广润门街道。 -西湖区有多大? 2002年12月1日设立桃源街道。 -西湖区有多大? 2004年区划调整前的西湖区区域:东与青山湖区湖坊乡插花接壤;西临赣江与红谷滩新区隔江相望;南以建设路为界,和青云谱区毗邻;北连中山路,北京西路,与东湖区交界。[1] -西湖区有多大? 2002年9月,由原筷子巷街道和原禾草街街道合并设立南浦街道,原广外街道与瓦子角街道的一部分合并设立广润门街道。 -西湖区有多大? 2002年12月1日设立桃源街道。 -西湖区有多大? 2004年区划调整前的西湖区区域:东与青山湖区湖坊乡插花接壤;西临赣江与红谷滩新区隔江相望;南以建设路为界,和青云谱区毗邻;北连中山路,北京西路,与东湖区交界。 -西湖区有多大? 2004年9月7日,国务院批准(国函[2004]70号)调整南昌市市辖区部分行政区划:将西湖区朝阳洲街道的西船居委会划归东湖区管辖。 -西湖区有多大? 将青山湖区的桃花镇和湖坊镇的同盟村划归西湖区管辖。 -西湖区有多大? 将西湖区十字街街道的谷市街、洪城路、南关口、九四、新丰5个居委会,上海路街道的草珊瑚集团、南昌肠衣厂、电子计算机厂、江西涤纶厂、江地基础公司、曙光、商标彩印厂、南昌市染整厂、江南蓄电池厂、四机床厂、二进、国乐新村12个居委会,南站街道的解放西路东居委会划归青云谱区管辖。 -西湖区有多大? 将西湖区上海路街道的轻化所、洪钢、省人民检察院、电信城东分局、安康、省机械施工公司、省水利设计院、省安装公司、南方电动工具厂、江西橡胶厂、上海路北、南昌电池厂、东华计量所、南昌搪瓷厂、上海路新村、华安针织总厂、江西五金厂、三波电机厂、水文地质大队、二六○厂、省卫生学校、新世纪、上海路住宅区北、塔子桥北、南航、上海路住宅区南、沿河、南昌阀门厂28个居委会,丁公路街道的新魏路、半边街、师大南路、顺化门、岔道口东路、师大、广电厅、手表厂、鸿顺9个居委会,南站街道的工人新村北、工人新村南、商苑、洪都中大道、铁路第三、铁路第四、铁路第六7个居委会划归青山湖区管辖。 -西湖区有多大? 调整后,西湖区辖绳金塔、桃源、朝阳洲、广润门、南浦、西湖、系马桩、十字街、丁公路、南站10个街道和桃花镇,区人民政府驻孺子路。 -西湖区有多大? 调整前,西湖区面积31平方千米,人口52万。 -西湖区有多大? (图)西湖区[南昌市] -西湖区有多大? 西湖区位于江西省省会南昌市的中心地带,具有广阔的发展空间和庞大的消费群体,商贸旅游、娱乐服务业等到各个行业都蕴藏着无限商机,投资前景十分广阔。 -西湖区有多大? 不仅水、电价格低廉,劳动力资源丰富,人均工资和房产价格都比沿海城市低,城区拥有良好的人居环境、低廉的投资成本,巨大的发展潜力。 -西湖区有多大? 105、316、320国道和京九铁路贯穿全境,把南北东西交通连成一线;民航可与上海、北京、广州、深圳、厦门、温州等到地通航,并开通了南昌-新加坡第一条国际航线;水运依托赣江可直达长江各港口;邮电通讯便捷,程控电话、数字微波、图文传真进入国际通讯网络;商检、海关、口岸等涉外机构齐全;水、电、气供应充足。 -西湖区有多大? (图)西湖区[南昌市] -西湖区有多大? 西湖区,是江西省省会南昌市的中心城区,面积34.8平方公里,常住人口51.9万人,辖桃花镇、朝农管理处及10个街道,设13个行政村,116个社区居委会,20个家委会。[2] -西湖区有多大? 2005年11月16日,南昌市《关于同意西湖区桃花镇、桃源、十字街街道办事处行政区划进行调整的批复》 -西湖区有多大? 1、同意将桃花镇的三道闸居委会划归桃源街道办事处管辖。 -青藏虎耳草花期什么时候? 青藏虎耳草多年生草本,高4-11.5厘米,丛生。 -青藏虎耳草花期什么时候? 花期7-8月。 -青藏虎耳草花期什么时候? 分布于甘肃(祁连山地)、青海(黄南、海南、海北)和西藏(加查)。 -青藏虎耳草花期什么时候? 生于海拔3 700-4 250米的林下、高山草甸和高山碎石隙。[1] -青藏虎耳草花期什么时候? 多年生草本,高4-11.5厘米,丛生。 -青藏虎耳草花期什么时候? 茎不分枝,具褐色卷曲柔毛。 -青藏虎耳草花期什么时候? 基生叶具柄,叶片卵形、椭圆形至长圆形,长15-25毫米,宽4-8毫米,腹面无毛,背面和边缘具褐色卷曲柔毛,叶柄长1-3厘米,基部扩大,边缘具褐色卷曲柔毛;茎生叶卵形至椭圆形,长1.5-2厘米,向上渐变小。 -青藏虎耳草花期什么时候? 聚伞花序伞房状,具2-6花;花梗长5-19毫米,密被褐色卷曲柔毛;萼片在花期反曲,卵形至狭卵形,长2.5-4.2毫米,宽1.5-2毫米,先端钝,两面无毛,边缘具褐色卷曲柔毛,3-5脉于先端不汇合;花瓣腹面淡黄色且其中下部具红色斑点,背面紫红色,卵形、狭卵形至近长圆形,长2.5-5.2毫米,宽1.5-2.1毫米,先端钝,基部具长0.5-1毫米之爪,3-5(-7)脉,具2痂体;雄蕊长2-3.6毫米,花丝钻形;子房半下位,周围具环状花盘,花柱长1-1.5毫米。 -青藏虎耳草花期什么时候? 生于高山草甸、碎石间。 -青藏虎耳草花期什么时候? 分布青海、西藏、甘肃、四川等地。 -青藏虎耳草花期什么时候? [1] -青藏虎耳草花期什么时候? 顶峰虎耳草Saxifraga cacuminum Harry Sm. -青藏虎耳草花期什么时候? 对叶虎耳Saxifraga contraria Harry Sm. -青藏虎耳草花期什么时候? 狭瓣虎耳草Saxifraga pseudohirculus Engl. -青藏虎耳草花期什么时候? 唐古特虎耳草Saxifraga tangutica Engl. -青藏虎耳草花期什么时候? 宽叶虎耳草(变种)Saxifraga tangutica Engl. var. platyphylla (Harry Sm.) J. T. Pan -青藏虎耳草花期什么时候? 唐古特虎耳草(原变种)Saxifraga tangutica Engl. var. tangutica -青藏虎耳草花期什么时候? 西藏虎耳草Saxifraga tibetica Losinsk.[1] -青藏虎耳草花期什么时候? Saxifraga przewalskii Engl. in Bull. Acad. Sci. St. -Petersb. 29:115. 1883: Engl et Irmsch. in Bot. Jahrb. 48:580. f. 5E-H. 1912 et in Engl. Pflanzenr. 67(IV. 117): 107. f. 21 E-H. 1916; J. T. Pan in Acta Phytotax. Sin. 16(2): 16. 1978;中国高等植物图鉴补编2: 30. 1983; 西藏植物志 2: 483. 1985. [1] -生产一支欧文冲锋枪需要多少钱? 欧文冲锋枪 Owen Gun 1945年,在新不列颠手持欧文冲锋枪的澳大利亚士兵 类型 冲锋枪 原产国 ?澳大利亚 服役记录 服役期间 1941年-1960年代 用户 参见使用国 参与战役 第二次世界大战 马来亚紧急状态 朝鲜战争 越南战争 1964年罗德西亚布什战争 生产历史 研发者 伊夫林·欧文(Evelyn Owen) 研发日期 1931年-1939年 生产商 约翰·莱萨特工厂 利特高轻武器工厂 单位制造费用 $ 30/枝 生产日期 1941年-1945年 制造数量 45,000-50,000 枝 衍生型 Mk 1/42 Mk 1/43 Mk 2/43 基本规格 总重 空枪: Mk 1/42:4.24 千克(9.35 磅) Mk 1/43:3.99 千克(8.8 磅) Mk 2/43:3.47 千克(7.65 磅) 全长 806 毫米(31.73 英吋) 枪管长度 247 毫米(9.72 英吋) 弹药 制式:9 × 19 毫米 原型:.38/200 原型:.45 ACP 口径 9 × 19 毫米:9 毫米(.357 英吋) .38/200:9.65 毫米(.38 英吋) .45 ACP:11.43 毫米(.45 英吋) 枪管 1 根,膛线7 条,右旋 枪机种类 直接反冲作用 开放式枪机 发射速率 理论射速: Mk 1/42:700 发/分钟 Mk 1/43:680 发/分钟 Mk 2/43:600 发/分钟 实际射速:120 发/分钟 枪口初速 380-420 米/秒(1,246.72-1,377.95 英尺/秒) 有效射程 瞄具装定射程:91.44 米(100 码) 最大有效射程:123 米(134.51 码) 最大射程 200 米(218.72 码) 供弹方式 32/33 发可拆卸式弹匣 瞄准具型式 机械瞄具:向右偏置的觇孔式照门和片状准星 欧文冲锋枪(英语:Owen Gun,正式名称:Owen Machine Carbine,以下简称为“欧文枪”)是一枝由伊夫林·(埃沃)·欧文(英语:Evelyn (Evo) Owen)于1939年研制、澳大利亚的首枝冲锋枪,制式型发射9 × 19 毫米鲁格手枪子弹。 -生产一支欧文冲锋枪需要多少钱? 欧文冲锋枪是澳大利亚唯一设计和主要服役的二战冲锋枪,并从1943年由澳大利亚陆军所使用,直到1960年代中期。 -生产一支欧文冲锋枪需要多少钱? 由新南威尔士州卧龙岗市出身的欧文枪发明者,伊夫林·欧文,在24岁时于1939年7月向悉尼维多利亚军营的澳大利亚陆军军械官员展示了他所设计的.22 LR口径“卡宾机枪”原型枪。 -生产一支欧文冲锋枪需要多少钱? 该枪却被澳大利亚陆军所拒绝,因为澳大利亚陆军在当时没有承认冲锋枪的价值。 -生产一支欧文冲锋枪需要多少钱? 随着战争的爆发,欧文加入了澳大利亚军队,并且成为一名列兵。 -生产一支欧文冲锋枪需要多少钱? 1940年9月,欧文的邻居,文森特·沃德尔(英语:Vincent Wardell),看到欧文家楼梯后面搁著一个麻布袋,里面放著一枝欧文枪的原型枪。 -生产一支欧文冲锋枪需要多少钱? 而文森特·沃德尔是坎布拉港的大型钢制品厂莱萨特公司的经理,他向欧文的父亲表明了他对其儿子的粗心大意感到痛心,但无论如何仍然解释了这款武器的历史。 -生产一支欧文冲锋枪需要多少钱? 沃德尔对欧文枪的简洁的设计留下了深刻的印象。 -生产一支欧文冲锋枪需要多少钱? 沃德尔安排欧文转调到陆军发明部(英语:Army Inventions Board),并重新开始在枪上的工作。 -生产一支欧文冲锋枪需要多少钱? 军队仍然持续地从负面角度查看该武器,但同时政府开始采取越来越有利的观点。 -生产一支欧文冲锋枪需要多少钱? 该欧文枪原型配备了装在顶部的弹鼓,后来让位给装在顶部的弹匣使用。 -生产一支欧文冲锋枪需要多少钱? 口径的选择亦花了一些时间去解决。 -生产一支欧文冲锋枪需要多少钱? 由于陆军有大批量的柯尔特.45 ACP子弹,它们决定欧文枪需要采用这种口径。 -生产一支欧文冲锋枪需要多少钱? 直到在1941年9月19日官方举办试验时,约翰·莱萨特工厂制成了9 毫米、.38/200和.45 ACP三种口径版本。 -生产一支欧文冲锋枪需要多少钱? 而从美、英进口的斯登冲锋枪和汤普森冲锋枪在试验中作为基准使用。 -生产一支欧文冲锋枪需要多少钱? 作为测试的一部分,所有的枪支都浸没在泥浆里,并以沙土覆盖,以模拟他们将会被使用时最恶劣的环境。 -生产一支欧文冲锋枪需要多少钱? 欧文枪是唯一在这测试中这样对待以后仍可正常操作的冲锋枪。 -生产一支欧文冲锋枪需要多少钱? 虽然测试表现出欧文枪具有比汤普森冲锋枪和司登冲锋枪更优秀的可靠性,陆军没有对其口径作出决定。 -生产一支欧文冲锋枪需要多少钱? 结果它在上级政府干预以后,陆军才下令9 毫米的衍生型为正式口径,并在1941年11月20日正式被澳大利亚陆军采用。 -生产一支欧文冲锋枪需要多少钱? 在欧文枪的寿命期间,其可靠性在澳大利亚部队中赢得了“军人的至爱”(英语:Digger's Darling)的绰号,亦有人传言它受到美军高度青睐。 -生产一支欧文冲锋枪需要多少钱? 欧文枪是在1942年开始正式由坎布拉港和纽卡斯尔的约翰·莱萨特工厂投入生产,在生产高峰期每个星期生产800 支。 -生产一支欧文冲锋枪需要多少钱? 1942年3月至1943年2月之间,莱萨特生产了28,000 枝欧文枪。 -生产一支欧文冲锋枪需要多少钱? 然而,最初的一批弹药类型竟然是错误的,以至10,000 枝欧文枪无法提供弹药。 -生产一支欧文冲锋枪需要多少钱? 政府再一次推翻军方的官僚主义作风??,并让弹药通过其最后的生产阶段,以及运送到当时在新几内亚与日军战斗的澳大利亚部队的手中。 -生产一支欧文冲锋枪需要多少钱? 在1941年至1945年间生产了约50,000 枝欧文枪。 -生产一支欧文冲锋枪需要多少钱? 在战争期间,欧文枪的平均生产成本为$ 30。[1] -生产一支欧文冲锋枪需要多少钱? 虽然它是有点笨重,因为其可靠性,欧文枪在士兵当中变得非常流行。 -生产一支欧文冲锋枪需要多少钱? 它是如此成功,它也被新西兰、英国和美国订购。[2] -生产一支欧文冲锋枪需要多少钱? 欧文枪后来也被澳大利亚部队在朝鲜战争和越南战争,[3]特别是步兵组的侦察兵。 -生产一支欧文冲锋枪需要多少钱? 这仍然是一枝制式的澳大利亚陆军武器,直到1960年代中期,它被F1冲锋枪所取代。 -第二届中国光伏摄影大赛因为什么政策而开始的? 光伏发电不仅是全球能源科技和产业发展的重要方向,也是我国具有国际竞争优势的战略性新兴产业,是我国保障能源安全、治理环境污染、应对气候变化的战略性选择。 -第二届中国光伏摄影大赛因为什么政策而开始的? 2013年7月以来,国家出台了《关于促进光伏产业健康发展的若干意见》等一系列政策,大力推进分布式光伏发电的应用,光伏发电有望走进千家万户,融入百姓民生。 -第二届中国光伏摄影大赛因为什么政策而开始的? 大赛主办方以此为契机,开启了“第二届中国光伏摄影大赛”的征程。 -悬赏任务有哪些类型? 悬赏任务,威客网站上一种任务模式,由雇主在威客网站发布任务,提供一定数额的赏金,以吸引威客们参与。 -悬赏任务有哪些类型? 悬赏任务数额一般在几十到几千不等,但也有几万甚至几十万的任务。 -悬赏任务有哪些类型? 主要以提交的作品的质量好坏作为中标标准,当然其中也带有雇主的主观喜好,中标人数较少,多为一个或几个,因此竞争激烈。 -悬赏任务有哪些类型? 大型悬赏任务赏金数额巨大,中标者也较多,但参与人也很多,对于身有一技之长的威客来讲,悬赏任务十分适合。 -悬赏任务有哪些类型? 悬赏任务的类型主要包括:设计类、文案类、取名类、网站类、编程类、推广类等等。 -悬赏任务有哪些类型? 每一类所适合的威客人群不同,报酬的多少也不同,比如设计类的报酬就比较高,一般都几百到几千,而推广类的计件任务报酬比较少,一般也就几块钱,但花费的时间很少,技术要求也很低。 -悬赏任务有哪些类型? 1.注册—登陆 -悬赏任务有哪些类型? 2.点击“我要发悬赏”—按照发布流程及提示提交任务要求。 -悬赏任务有哪些类型? 悬赏模式选择->网站托管赏金模式。 -悬赏任务有哪些类型? 威客网站客服稍后会跟发布者联系确认任务要求。 -悬赏任务有哪些类型? 3.没有问题之后就可以预付赏金进行任务发布。 -悬赏任务有哪些类型? 4.会员参与并提交稿件。 -悬赏任务有哪些类型? 5.发布者需要跟会员互动(每个提交稿件的会员都可以),解决问题,完善稿件,初步筛选稿件。 -悬赏任务有哪些类型? 6.任务发布期结束,进入选稿期(在筛选的稿件中选择最后满意的) -悬赏任务有哪些类型? 7.发布者不满意现有稿件可选定一个会员修改至满意为止,或者加价延期重新开放任务进行征稿。 -悬赏任务有哪些类型? (重复第六步)没有问题后进入下一步。 -悬赏任务有哪些类型? 8:中标会员交源文件给发布者—发布者确认—任务结束—网站将赏金付给中标会员。 -悬赏任务有哪些类型? 1、任务发布者自由定价,自由确定悬赏时间,自由发布任务要求,自主确定中标会员和中标方案。 -悬赏任务有哪些类型? 2、任务发布者100%预付任务赏金,让竞标者坚信您的诚意和诚信。 -悬赏任务有哪些类型? 3、任务赏金分配原则:任务一经发布,网站收取20%发布费,中标会员获得赏金的80%。 -悬赏任务有哪些类型? 4、每个任务最终都会选定至少一个作品中标,至少一个竞标者获得赏金。 -悬赏任务有哪些类型? 5、任务发布者若未征集到满意作品,可以加价延期征集,也可让会员修改,会员也可以删除任务。 -悬赏任务有哪些类型? 6、任务发布者自己所在组织的任何人均不能以任何形式参加自己所发布的任务,一经发现则视为任务发布者委托威客网按照网站规则选稿。 -悬赏任务有哪些类型? 7、任务悬赏总金额低于100元(含100元)的任务,悬赏时间最多为7天。 -悬赏任务有哪些类型? 所有任务最长时间不超过30天(特殊任务除外),任务总金额不得低于50元。 -悬赏任务有哪些类型? 8、网赚类、注册类任务总金额不能低于300元人民币,计件任务每个稿件的平均单价不能低于1元人民币。 -悬赏任务有哪些类型? 9、延期任务只有3次加价机会,第1次加价不得低于任务金额的10%,第2次加价不得低于任务总金额的20%,第3次不得低于任务总金额的50%。 -悬赏任务有哪些类型? 每次延期不能超过15天,加价金额不低于50元,特殊任务可以适当加长。 -悬赏任务有哪些类型? 如果为计件任务,且不是网赚类任务,将免费延期,直至征集完规定数量的作品为止。 -悬赏任务有哪些类型? 10、如果威客以交接源文件要挟任务发布者,威客网将扣除威客相关信用值,并取消其中标资格,同时任务将免费延长相应的时间继续征集作品 。 -江湖令由哪些平台运营? 《江湖令》是以隋唐时期为背景的RPG角色扮演类网页游戏。 -江湖令由哪些平台运营? 集角色扮演、策略、冒险等多种游戏元素为一体,画面精美犹如客户端游戏,融合历史、江湖、武功、恩仇多种特色元素,是款不可多得的精品游戏大作。 -江湖令由哪些平台运营? 由ya247平台、91wan游戏平台、2918、4399游戏平台、37wan、6711、兄弟玩网页游戏平台,49you、Y8Y9平台、8090游戏等平台运营的,由07177游戏网发布媒体资讯的网页游戏。 -江湖令由哪些平台运营? 网页游戏《江湖令》由51游戏社区运营,是以隋唐时期为背景的RPG角色扮演类网页游戏。 -江湖令由哪些平台运营? 集角色扮演、策略、冒险等多种游戏元素为一体,画面精美犹如客户端游戏,融合历史、江湖、武功、恩仇多种特色元素,是款不可多得的精品游戏大作… -江湖令由哪些平台运营? 背景故事: -江湖令由哪些平台运营? 隋朝末年,隋炀帝暴政,天下民不聊生,义军四起。 -江湖令由哪些平台运营? 在这动荡的时代中,百姓生活苦不堪言,多少人流离失所,家破人亡。 -江湖令由哪些平台运营? 天下三大势力---飞羽营、上清宫、侠隐岛,也值此机会扩张势力,派出弟子出来行走江湖。 -江湖令由哪些平台运营? 你便是这些弟子中的普通一员,在这群雄并起的年代,你将如何选择自己的未来。 -江湖令由哪些平台运营? 所有的故事,便从瓦岗寨/江都大营开始…… -江湖令由哪些平台运营? 势力: -江湖令由哪些平台运营? ①、飞羽营:【外功、根骨】 -江湖令由哪些平台运营? 南北朝时期,由北方政权创立的一个民间军事团体,经过多年的发展,逐渐成为江湖一大势力。 -江湖令由哪些平台运营? ②、上清宫:【外功、身法】 -江湖令由哪些平台运营? 道家圣地,宫中弟子讲求清静无为,以一种隐世的方式修炼,但身在此乱世,亦也不能独善其身。 -江湖令由哪些平台运营? ③、侠隐岛:【根骨、内力】 -江湖令由哪些平台运营? 位于偏远海岛上的一个世家,岛内弟子大多武功高强,但从不进入江湖行走,适逢乱世,现今岛主也决意作一翻作为。 -江湖令由哪些平台运营? 两大阵营: -江湖令由哪些平台运营? 义军:隋唐末期,百姓生活苦不堪言,有多个有志之士组成义军,对抗当朝暴君,希望建立一个适合百姓安居乐业的天地。 -江湖令由哪些平台运营? 隋军:战争一起即天下打乱,隋军首先要镇压四起的义军,同时在内部慢慢改变现有的朝廷,让天下再次恢复到昔日的安定。 -江湖令由哪些平台运营? 一、宠物品质 -江湖令由哪些平台运营? 宠物的品质分为:灵兽,妖兽,仙兽,圣兽,神兽 -江湖令由哪些平台运营? 二、宠物获取途径 -江湖令由哪些平台运营? 完成任务奖励宠物(其他途径待定)。 -江湖令由哪些平台运营? 三、宠物融合 -江湖令由哪些平台运营? 1、在主界面下方的【宠/骑】按钮进入宠物界面,再点击【融合】即可进入融合界面进行融合,在融合界面可选择要融合的宠物进行融合 -江湖令由哪些平台运营? 2、融合后主宠的形态不变; -江湖令由哪些平台运营? 3、融合后宠物的成长,品质,技能,经验,成长经验,等级都继承成长高的宠物; -江湖令由哪些平台运营? 4、融合宠物技能冲突,则保留成长值高的宠物技能,如果不冲突则叠加在空余的技能位置。 -请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛(土耳其文:Türkiye 1. Süper Futbol Ligi)是土耳其足球协会管理的职业足球联赛,通常简称“土超”,也是土耳其足球联赛中最高级别。 -请问土耳其足球超级联赛是什么时候成立的? 目前,土超联赛队伍共有18支。 -请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛 -请问土耳其足球超级联赛是什么时候成立的? 运动项目 足球 -请问土耳其足球超级联赛是什么时候成立的? 成立年份 1959年 -请问土耳其足球超级联赛是什么时候成立的? 参赛队数 18队 -请问土耳其足球超级联赛是什么时候成立的? 国家 土耳其 -请问土耳其足球超级联赛是什么时候成立的? 现任冠军 费内巴切足球俱乐部(2010-2011) -请问土耳其足球超级联赛是什么时候成立的? 夺冠最多队伍 费内巴切足球俱乐部(18次) -请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛(Türkiye 1. Süper Futbol Ligi)是土耳其足球协会管理的职业足球联赛,通常简称「土超」,也是土耳其足球联赛中最高级别。 -请问土耳其足球超级联赛是什么时候成立的? 土超联赛队伍共有18支。 -请问土耳其足球超级联赛是什么时候成立的? 土超联赛成立于1959年,成立之前土耳其国有多个地区性联赛。 -请问土耳其足球超级联赛是什么时候成立的? 土超联赛成立后便把各地方联赛制度统一起来。 -请问土耳其足球超级联赛是什么时候成立的? 一般土超联赛由八月开始至五月结束,12月至1月会有歇冬期。 -请问土耳其足球超级联赛是什么时候成立的? 十八支球队会互相对叠,各有主场和作客两部分,采计分制。 -请问土耳其足球超级联赛是什么时候成立的? 联赛枋最底的三支球队会降到土耳其足球甲级联赛作赛。 -请问土耳其足球超级联赛是什么时候成立的? 由2005-06年球季起,土超联赛的冠、亚军会取得参加欧洲联赛冠军杯的资格。 -请问土耳其足球超级联赛是什么时候成立的? 成立至今土超联赛乃由两支著名球会所垄断──加拉塔萨雷足球俱乐部和费内巴切足球俱乐部,截至2009-2010赛季,双方各赢得冠军均为17次。 -请问土耳其足球超级联赛是什么时候成立的? 土超联赛共有18支球队,采取双循环得分制,每场比赛胜方得3分,负方0分,平局双方各得1分。 -请问土耳其足球超级联赛是什么时候成立的? 如果两支球队积分相同,对战成绩好的排名靠前,其次按照净胜球来决定;如果有三支以上的球队分数相同,则按照以下标准来确定排名:1、几支队伍间对战的得分,2、几支队伍间对战的净胜球数,3、总净胜球数。 -请问土耳其足球超级联赛是什么时候成立的? 联赛第1名直接参加下个赛季冠军杯小组赛,第2名参加下个赛季冠军杯资格赛第三轮,第3名进入下个赛季欧洲联赛资格赛第三轮,第4名进入下个赛季欧洲联赛资格赛第二轮,最后三名降入下个赛季的土甲联赛。 -请问土耳其足球超级联赛是什么时候成立的? 该赛季的土耳其杯冠军可参加下个赛季欧洲联赛资格赛第四轮,如果冠军已获得冠军杯资格,则亚军可参加下个赛季欧洲联赛资格赛第四轮,否则名额递补给联赛。 -请问土耳其足球超级联赛是什么时候成立的? 2010年/2011年 费内巴切 -请问土耳其足球超级联赛是什么时候成立的? 2009年/2010年 布尔萨体育(又译贝莎) -请问土耳其足球超级联赛是什么时候成立的? 2008年/2009年 贝西克塔斯 -请问土耳其足球超级联赛是什么时候成立的? 2007年/2008年 加拉塔萨雷 -请问土耳其足球超级联赛是什么时候成立的? 2006年/2007年 费内巴切 -请问土耳其足球超级联赛是什么时候成立的? 2005年/2006年 加拉塔沙雷 -请问土耳其足球超级联赛是什么时候成立的? 2004年/2005年 费内巴切(又译费伦巴治) -请问土耳其足球超级联赛是什么时候成立的? 2003年/2004年 费内巴切 -cid 作Customer IDentity解时是什么意思? ? CID 是 Customer IDentity 的简称,简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable)芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号,新的CID不断被使用,以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改 -cid 作Customer IDentity解时是什么意思? ? CID 是 Customer IDentity 的简称,简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable)芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号,新的CID不断被使用,以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改 -cid 作Customer IDentity解时是什么意思? ? (英)刑事调查局,香港警察的重案组 -cid 作Customer IDentity解时是什么意思? ? Criminal Investigation Department -cid 作Customer IDentity解时是什么意思? ? 佩枪: -cid 作Customer IDentity解时是什么意思? ? 香港警察的CID(刑事侦缉队),各区重案组的探员装备短管点38左轮手枪,其特点是便于收藏,而且不容易卡壳,重量轻,其缺点是装弹量少,只有6发,而且换子弹较慢,威力也一般,如果碰上54式手枪或者M9手枪明显处于下风。 -cid 作Customer IDentity解时是什么意思? ? 香港警察的“刑事侦查”(Criminal Investigation Department)部门,早于1983年起已经不叫做C.I.D.的了,1983年香港警察队的重整架构,撤销了C.I.D. ( Criminal Investigation Dept.) “刑事侦缉处”,将“刑事侦查”部门归入去“行动处”内,是“行动处”内的一个分支部门,叫“刑事部”( Crime Wing )。 -cid 作Customer IDentity解时是什么意思? ? 再于90年代的一次警队重整架构,香港警队成立了新的「刑事及保安处」,再将“刑事侦查”部门归入目前的「刑事及保安处」的“处”级单位,是归入这个“处”下的一个部门,亦叫“刑事部” ( Crime Wing ),由一个助理警务处长(刑事)领导。 -cid 作Customer IDentity解时是什么意思? ? 但是时至今天,CID虽已经是一个老旧的名称,香港市民、甚至香港警察都是习惯性的沿用这个历史上的叫法 . -cid 作Customer IDentity解时是什么意思? ? CID格式是美国Adobe公司发表的最新字库格式,它具有易扩充、速度快、兼容性好、简便、灵活等特点,已成为国内开发中文字库的热点,也为用户使用字库提供质量更好,数量更多的字体。 -cid 作Customer IDentity解时是什么意思? ? CID (Character identifier)就是字符识别码,在组成方式上分成CIDFont,CMap表两部分。 -cid 作Customer IDentity解时是什么意思? ? CIDFont文件即总字符集,包括了一种特定语言中所有常用的字符,把这些字符排序,它们在总字符集中排列的次序号就是各个字符的CID标识码(Index);CMap(Character Map)表即字符映像文件,将字符的编码(Code)映像到字符的CID标识码(Index)。 -cid 作Customer IDentity解时是什么意思? ? CID字库完全针对大字符集市场设计,其基本过程为:先根据Code,在CMap表查到Index,然后在CIDFont文件找到相应的字形数据。 -本町位于什么地方? 本条目记述台湾日治时期,各都市之本町。 -本町位于什么地方? 为台湾日治时期台北市之行政区,共分一~四丁目,在表町之西。 -本町位于什么地方? 以现在的位置来看,本町位于现台北市中正区的西北角,约位于忠孝西路一段往西至台北邮局东侧。 -本町位于什么地方? 再向南至开封街一段,沿此路线向西至开封街一段60号,顺60号到汉口街一段向东到现在华南银行总行附近画一条直线到衡阳路。 -本町位于什么地方? 再向东至重庆南路一段,由重庆南路一段回到原点这个范围内。 -本町位于什么地方? 另外,重庆南路一段在当时名为“本町通”。 -本町位于什么地方? 此地方自日治时期起,就是繁华的商业地区,当时也有三和银行、台北专卖分局、日本石油等重要商业机构。 -本町位于什么地方? 其中,专卖分局是战后二二八事件的主要起始点。 -本町位于什么地方? 台湾贮蓄银行(一丁目) -本町位于什么地方? 三和银行(二丁目) -本町位于什么地方? 专卖局台北分局(三丁目) -本町位于什么地方? 日本石油(四丁目) -本町位于什么地方? 为台湾日治时期台南市之行政区。 -本町位于什么地方? 范围包括清代旧街名枋桥头前、枋桥头后、鞋、草花、天公埕、竹仔、下大埕、帽仔、武馆、统领巷、大井头、内宫后、内南町。 -本町位于什么地方? 为清代台南城最繁华的区域。 -本町位于什么地方? 台南公会堂 -本町位于什么地方? 北极殿 -本町位于什么地方? 开基武庙 -本町位于什么地方? 町名改正 -本町位于什么地方? 这是一个与台湾相关的小作品。 -本町位于什么地方? 你可以通过编辑或修订扩充其内容。 -《行走的观点:埃及》的条形码是多少? 出版社: 上海社会科学院出版社; 第1版 (2006年5月1日) -《行走的观点:埃及》的条形码是多少? 丛书名: 时代建筑视觉旅行丛书 -《行走的观点:埃及》的条形码是多少? 条形码: 9787806818640 -《行走的观点:埃及》的条形码是多少? 尺寸: 18 x 13.1 x 0.7 cm -《行走的观点:埃及》的条形码是多少? 重量: 181 g -《行走的观点:埃及》的条形码是多少? 漂浮在沙与海市蜃楼之上的金字塔曾经是否是你的一个梦。 -《行走的观点:埃及》的条形码是多少? 埃及,这片蕴蓄了5000年文明的土地,本书为你撩开它神秘的纱。 -《行走的观点:埃及》的条形码是多少? 诸神、金字塔、神庙、狮身人面像、法老、艳后吸引着我们的注意力;缠绵悱恻的象形文字、医学、雕刻等留给我们的文明,不断引发我们对古代文明的惊喜和赞叹。 -《行走的观点:埃及》的条形码是多少? 尼罗河畔的奇异之旅,数千年的古老文明,尽收在你的眼底…… -《行走的观点:埃及》的条形码是多少? 本书集历史、文化、地理等知识于一体,并以优美、流畅文笔,简明扼要地阐述了埃及的地理环境、政治经济、历史沿革、文化艺术,以大量富有艺术感染力的彩色照片,生动形象地展示了埃及最具特色的名胜古迹、风土人情和自然风光。 -《行走的观点:埃及》的条形码是多少? 古埃及历史 -老挝人民军的工兵部队有几个营? 老挝人民军前身为老挝爱国战线领导的“寮国战斗部队”(即“巴特寮”),始建于1949年1月20日,1965年10月改名为老挝人民解放军,1982年7月改称现名。 -老挝人民军的工兵部队有几个营? 最高领导机构是中央国防和治安委员会,朱马里·赛雅颂任主席,隆再·皮吉任国防部长。 -老挝人民军的工兵部队有几个营? 实行义务兵役制,服役期最少18个月。[1] -老挝人民军的工兵部队有几个营? ?老挝军队在老挝社会中有较好的地位和保障,工资待遇比地方政府工作人员略高。 -老挝人民军的工兵部队有几个营? 武装部队总兵力约6万人,其中陆军约5万人,主力部队编为5个步兵师;空军2000多人;海军(内河巡逻部队)1000多人;部队机关院校5000人。[1] -老挝人民军的工兵部队有几个营? 老挝人民军军旗 -老挝人民军的工兵部队有几个营? 1991年8月14日通过的《老挝人民民主共和国宪法》第11条规定:国家执行保卫国防和维护社会安宁的政策。 -老挝人民军的工兵部队有几个营? 全体公民和国防力量、治安力量必须发扬忠于祖国、忠于人民的精神,履行保卫革命成果、保卫人民生命财产及和平劳动的任务,积极参加国家建设事业。 -老挝人民军的工兵部队有几个营? 最高领导机构是中央国防和治安委员会。 -老挝人民军的工兵部队有几个营? 主席由老挝人民革命党中央委员会总书记兼任。 -老挝人民军的工兵部队有几个营? 老挝陆军成立最早,兵力最多,约有5万人。 -老挝人民军的工兵部队有几个营? 其中主力部队步兵师5个、7个独立团、30多个营、65个独立连。 -老挝人民军的工兵部队有几个营? 地方部队30余个营及县属部队。 -老挝人民军的工兵部队有几个营? 地面炮兵2个团,10多个营。 -老挝人民军的工兵部队有几个营? 高射炮兵1个团9个营。 -老挝人民军的工兵部队有几个营? 导弹部队2个营。 -老挝人民军的工兵部队有几个营? 装甲兵7个营。 -老挝人民军的工兵部队有几个营? 特工部队6个营。 -老挝人民军的工兵部队有几个营? 通讯部队9个营。 -老挝人民军的工兵部队有几个营? 工兵部队6个营。 -老挝人民军的工兵部队有几个营? 基建工程兵2个团13个营。 -老挝人民军的工兵部队有几个营? 运输部队7个营。 -老挝人民军的工兵部队有几个营? 陆军的装备基本是中国和前苏联援助的装备和部分从抗美战争中缴获的美式装备。 -老挝人民军的工兵部队有几个营? 老挝内河部队总兵力约1700人,装备有内河船艇110多艘,编成4个艇队。 -老挝人民军的工兵部队有几个营? 有芒宽、巴能、纳坎、他曲、南盖、巴色等8个基地。 -老挝人民军的工兵部队有几个营? 空军于1975年8月组建,现有2个团、11个飞行大队,总兵力约2000人。 -老挝人民军的工兵部队有几个营? 装备有各种飞机140架,其中主要由前苏联提供和从万象政权的皇家空军手中接管。 -老挝人民军的工兵部队有几个营? 随着军队建设质量的提高,老挝人民军对外军事合作步伐也日益扩大,近年来先后与俄罗斯、印度、马来西亚、越南、菲律宾等国拓展了军事交流与合作的内容。 -老挝人民军的工兵部队有几个营? 2003年1月,印度决定向老挝援助一批军事装备和物资,并承诺提供技术帮助。 -老挝人民军的工兵部队有几个营? 2003年6月,老挝向俄罗斯订购了一批新式防空武器;2003年4月,老挝与越南签署了越南帮助老挝培训军事指挥干部和特种部队以及完成军队通信系统改造等多项协议。 -《焚心之城》的主角是谁? 《焚心之城》[1] 为网络作家老子扛过枪创作的一部都市类小说,目前正在创世中文网连载中。 -《焚心之城》的主角是谁? 乡下大男孩薛城,是一个不甘于生活现状的混混,他混过、爱过、也深深地被伤害过。 -《焚心之城》的主角是谁? 本料此生当浑浑噩噩,拼搏街头。 -《焚心之城》的主角是谁? 高考的成绩却给了他一点渺茫的希望,二月后,大学如期吹响了他进城的号角。 -《焚心之城》的主角是谁? 繁华的都市,热血的人生,冷眼嘲笑中,他发誓再不做一个平常人! -《焚心之城》的主角是谁? 江北小城,黑河大地,他要行走过的每一个角落都有他的传说。 -《焚心之城》的主角是谁? 扯出一面旗,拉一帮兄弟,做男人,就要多一份担当,活一口傲气。 -《焚心之城》的主角是谁? (日期截止到2014年10月23日凌晨) -请问香港利丰集团是什么时候成立的? 香港利丰集团前身是广州的华资贸易 (1906 - 1949) ,利丰是香港历史最悠久的出口贸易商号之一。 -请问香港利丰集团是什么时候成立的? 于1906年,冯柏燎先生和李道明先生在广州创立了利丰贸易公司;是当时中国第一家华资的对外贸易出口商。 -请问香港利丰集团是什么时候成立的? 利丰于1906年创立,初时只从事瓷器及丝绸生意;一年之后,增添了其它的货品,包括竹器、藤器、玉石、象牙及其它手工艺品,包括烟花爆竹类别。 -请问香港利丰集团是什么时候成立的? 在早期的对外贸易,中国南方内河港因水深不足不能行驶远洋船,反之香港港口水深岸阔,占尽地利。 -请问香港利丰集团是什么时候成立的? 因此,在香港成立分公司的责任,落在冯柏燎先生的三子冯汉柱先生身上。 -请问香港利丰集团是什么时候成立的? 1937年12月28日,利丰(1937)有限公司正式在香港创立。 -请问香港利丰集团是什么时候成立的? 第二次世界大战期间,利丰暂停贸易业务。 -请问香港利丰集团是什么时候成立的? 1943年,随着创办人冯柏燎先生去世后,业务移交给冯氏家族第二代。 -请问香港利丰集团是什么时候成立的? 之后,向来不参与业务管理的合伙人李道明先生宣布退休,将所拥有的利丰股权全部卖给冯氏家族。 -请问香港利丰集团是什么时候成立的? 目前由哈佛冯家两兄弟William Fung , Victor Fung和CEO Bruce Rockowitz 管理。 -请问香港利丰集团是什么时候成立的? 截止到2012年,集团旗下有利亚﹝零售﹞有限公司、利和集团、利邦时装有限公司、利越时装有限公司、利丰贸易有限公司。 -请问香港利丰集团是什么时候成立的? 利亚(零售)连锁,业务包括大家所熟悉的:OK便利店、玩具〝反〞斗城和圣安娜饼屋;范围包括香港、台湾、新加坡、马来西亚、至中国大陆及东南亚其它市场逾600多家店 -请问香港利丰集团是什么时候成立的? 利和集团,IDS以专业物流服务为根基,为客户提供经销,物流,制造服务领域内的一系列服务项目。 -请问香港利丰集团是什么时候成立的? 业务网络覆盖大中华区,东盟,美国及英国,经营着90多个经销中心,在中国设有18个经销公司,10,000家现代经销门店。 -请问香港利丰集团是什么时候成立的? 利邦(上海)时装贸易有限公司为大中华区其中一家大型男士服装零售集团。 -请问香港利丰集团是什么时候成立的? 现在在中国大陆、香港、台湾和澳门收购经营11个包括Cerruti 1881,Gieves & Hawkes,Kent & curwen和D’urban 等中档到高档的男士服装品牌,全国有超过350间门店设于各一线城市之高级商场及百货公司。 -请问香港利丰集团是什么时候成立的? 利越(上海)服装商贸有限公司隶属于Branded Lifestyle,负责中国大陆地区LEO里奥(意大利)、GIBO捷宝(意大利)、UFFIZI古杰师(意大利)、OVVIO奥维路(意大利)、Roots绿适(加拿大,全球服装排名第四)品牌销售业务 -请问香港利丰集团是什么时候成立的? 利丰(贸易)1995年收购了英之杰采购服务,1999年收购太古贸易有限公司(Swire & Maclain) 和金巴莉有限公司(Camberley),2000年和2002年分别收购香港采购出口集团Colby Group及Janco Oversea Limited,大大扩张了在美国及欧洲的顾客群,自2008年经济危机起一直到现在,收购多家欧、美、印、非等地区的时尚品牌,如英国品牌Visage,仅2011年上半年6个月就完成26个品牌的收购。 -请问香港利丰集团是什么时候成立的? 2004年利丰与Levi Strauss & Co.签订特许经营协议 -请问香港利丰集团是什么时候成立的? 2005年利丰伙拍Daymon Worldwide为全球供应私有品牌和特许品牌 -请问香港利丰集团是什么时候成立的? 2006年收购Rossetti手袋业务及Oxford Womenswear Group 强化美国批发业务 -请问香港利丰集团是什么时候成立的? 2007年收购Tommy Hilfiher全球采购业务,收购CGroup、Peter Black International LTD、Regetta USA LLC和American Marketing Enterprice -请问香港利丰集团是什么时候成立的? 2008年收购Kent&Curwen全球特许经营权,收购Van Zeeland,Inc和Miles Fashion Group -请问香港利丰集团是什么时候成立的? 2009年收购加拿大休闲品牌Roots ,收购Wear Me Appearl,LLC。 -请问香港利丰集团是什么时候成立的? 与Hudson's Bay、Wolverine Worldwide Inc、Talbots、Liz Claiborne达成了采购协议 -请问香港利丰集团是什么时候成立的? 2010年收购Oxford apparel Visage Group LTD -请问香港利丰集团是什么时候成立的? 2011年一月收购土耳其Modium、美国女性时尚Beyond Productions,三月收购贸易公司Celissa 、玩具公司Techno Source USA, Inc.、卡通品牌产品TVMania和法国著名时装一线品牌Cerruti 1881,五月收购Loyaltex Apparel Ltd.、女装Hampshire Designers和英国彩妆Collection 2000,六月收购家私贸易Exim Designs Co., Ltd.,七月收购家庭旅行产业Union Rich USA, LLC和设计公司Lloyd Textile Fashion Company Limited,八月收购童装Fishman & Tobin和Crimzon Rose,九月收购家私贸易True Innovations, LLC、日用品企业Midway Enterprises和Wonderful World。 -请问香港利丰集团是什么时候成立的? 十二月与USPA – U.S. Polo Association签署授权协议。 -请问香港利丰集团是什么时候成立的? 利丰的精神:积极进取,不断认识并争取有利于客户和自身进步的机会;以行动为主导,对客户、供应商及职工的需求作出快速的决定。 -请问香港利丰集团是什么时候成立的? 利丰的最终目标:在产品采购、销售、流转的各环节建立全球性队伍提供多元化服务,利丰成员有效合作,共达目标。 -如何使魔兽变种akt不被查杀? Trojan/PSW.Moshou.akt“魔兽”变种akt是“魔兽”木马家族的最新成员之一,采用Delphi 6.0-7.0编写,并经过加壳处理。 -如何使魔兽变种akt不被查杀? “魔兽”变种akt运行后,自我复制到被感染计算机的指定目录下。 -如何使魔兽变种akt不被查杀? 修改注册表,实现木马开机自动运行。 -如何使魔兽变种akt不被查杀? 自我注入到被感染计算机的“explorer.exe”、“notepad.exe”等用户级权限的进程中加载运行,隐藏自我,防止被查杀。 -如何使魔兽变种akt不被查杀? 在后台秘密监视用户打开的窗口标题,盗取网络游戏《魔兽世界》玩家的游戏帐号、游戏密码、角色等级、装备信息、金钱数量等信息,并在后台将窃取到的玩家信息发送到骇客指定的远程服务器上,致使玩家游戏帐号、装备物品、金钱等丢失,给游戏玩家造成非常大的损失。 -丙种球蛋白能预防什么病情? 丙种球蛋白预防传染性肝炎,预防麻疹等病毒性疾病感染,治疗先天性丙种球蛋白缺乏症 ,与抗生素合并使用,可提高对某些严重细菌性和病毒性疾病感染的疗效。 -丙种球蛋白能预防什么病情? 中文简称:“丙球” -丙种球蛋白能预防什么病情? 英文名称:γ-globulin、gamma globulin -丙种球蛋白能预防什么病情? 【别名】 免疫血清球蛋白,普通免疫球蛋白,人血丙种球蛋白,丙种球蛋白,静脉注射用人免疫球蛋白(pH4) -丙种球蛋白能预防什么病情? 注:由于人血中的免疫球蛋白大多数为丙种球蛋白(γ-球蛋白),有时丙种球蛋白也被混称为“免疫球蛋白”(immunoglobulin) 。 -丙种球蛋白能预防什么病情? 冻干制剂应为白色或灰白色的疏松体,液体制剂和冻干制剂溶解后,溶液应为接近无色或淡黄色的澄明液体,微带乳光。 -丙种球蛋白能预防什么病情? 但不应含有异物或摇不散的沉淀。 -丙种球蛋白能预防什么病情? 注射丙种球蛋白是一种被动免疫疗法。 -丙种球蛋白能预防什么病情? 它是把免疫球蛋白内含有的大量抗体输给受者,使之从低或无免疫状态很快达到暂时免疫保护状态。 -丙种球蛋白能预防什么病情? 由于抗体与抗原相互作用起到直接中和毒素与杀死细菌和病毒。 -丙种球蛋白能预防什么病情? 因此免疫球蛋白制品对预防细菌、病毒性感染有一定的作用[1]。 -丙种球蛋白能预防什么病情? 人免疫球蛋白的生物半衰期为16~24天。 -丙种球蛋白能预防什么病情? 1、丙种球蛋白[2]含有健康人群血清所具有的各种抗体,因而有增强机体抵抗力以预防感染的作用。 -丙种球蛋白能预防什么病情? 2、主要治疗先天性丙种球蛋白缺乏症和免疫缺陷病 -丙种球蛋白能预防什么病情? 3、预防传染性肝炎,如甲型肝炎和乙型肝炎等。 -丙种球蛋白能预防什么病情? 4、用于麻疹、水痘、腮腺炎、带状疱疹等病毒感染和细菌感染的防治 -丙种球蛋白能预防什么病情? 5、也可用于哮喘、过敏性鼻炎、湿疹等内源性过敏性疾病。 -丙种球蛋白能预防什么病情? 6、与抗生素合并使用,可提高对某些严重细菌性和病毒性疾病感染的疗效。 -丙种球蛋白能预防什么病情? 7、川崎病,又称皮肤粘膜淋巴结综合征,常见于儿童,丙种球蛋白是主要的治疗药物。 -丙种球蛋白能预防什么病情? 1、对免疫球蛋白过敏或有其他严重过敏史者。 -丙种球蛋白能预防什么病情? 2、有IgA抗体的选择性IgA缺乏者。 -丙种球蛋白能预防什么病情? 3、发烧患者禁用或慎用。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (1997年9月1日浙江省第八届人民代表大会常务委员会第三十九次会议通过 1997年9月9日浙江省第八届人民代表大会常务委员会公告第六十九号公布自公布之日起施行) -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 为了保护人的生命和健康,发扬人道主义精神,促进社会发展与和平进步事业,根据《中华人民共和国红十字会法》,结合本省实际,制定本办法。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 本省县级以上按行政区域建立的红十字会,是中国红十字会的地方组织,是从事人道主义工作的社会救助团体,依法取得社会团体法人资格,设置工作机构,配备专职工作人员,依照《中国红十字会章程》独立自主地开展工作。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 全省性行业根据需要可以建立行业红十字会,配备专职或兼职工作人员。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 街道、乡(镇)、机关、团体、学校、企业、事业单位根据需要,可以依照《中国红十字会章程》建立红十字会的基层组织。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 上级红十字会指导下级红十字会的工作。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上地方红十字会指导所在行政区域行业红十字会和基层红十字会的工作。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 人民政府对红十字会给予支持和资助,保障红十字会依法履行职责,并对其活动进行监督;红十字会协助人民政府开展与其职责有关的活动。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 全社会都应当关心和支持红十字事业。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 本省公民和单位承认《中国红十字会章程》并缴纳会费的,可以自愿参加红十字会,成为红十字会的个人会员或团体会员。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 个人会员由本人申请,基层红十字会批准,发给会员证;团体会员由单位申请,县级以上红十字会批准,发给团体会员证。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 个人会员和团体会员应当遵守《中华人民共和国红十字会法》和《中国红十字会章程》,热心红十字事业,履行会员的义务,并享有会员的权利。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上红十字会理事会由会员代表大会民主选举产生。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 理事会民主选举产生会长和副会长;根据会长提名,决定秘书长、副秘书长人选。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上红十字会可以设名誉会长、名誉副会长和名誉理事,由同级红十字会理事会聘请。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 省、市(地)红十字会根据独立、平等、互相尊重的原则,发展同境外、国外地方红十字会和红新月会的友好往来和合作关系。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 红十字会履行下列职责: -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (一)宣传、贯彻《中华人民共和国红十字会法》和本办法; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (二)开展救灾的准备工作,筹措救灾款物;在自然灾害和突发事件中,对伤病人员和其他受害者进行救助; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (三)普及卫生救护和防病知识,进行初级卫生救护培训,对交通、电力、建筑、矿山等容易发生意外伤害的单位进行现场救护培训; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (四)组织群众参加现场救护; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (五)参与输血献血工作,推动无偿献血; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (六)开展红十字青少年活动; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (七)根据中国红十字会总会部署,参加国际人道主义救援工作; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (八)依照国际红十字和红新月运动的基本原则,完成同级人民政府和上级红十字会委托的有关事宜; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (九)《中华人民共和国红十宇会法》和《中国红十字会章程》规定的其他职责。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 第八条 红十字会经费的主要来源: -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (一)红十字会会员缴纳的会费; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (二)接受国内外组织和个人捐赠的款物; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (三)红十字会的动产、不动产以及兴办社会福利事业和经济实体的收入; -宝湖庭院绿化率多少? 建发·宝湖庭院位于银川市金凤区核心地带—正源南街与长城中路交汇处向东500米。 -宝湖庭院绿化率多少? 项目已于2012年4月开工建设,总占地约4.2万平方米,总建筑面积约11.2万平方米,容积率2.14,绿化率35%,预计可入住630户。 -宝湖庭院绿化率多少? “建发·宝湖庭院”是银川建发集团股份有限公司继“建发·宝湖湾”之后,在宝湖湖区的又一力作。 -宝湖庭院绿化率多少? 项目周边发展成熟,东有唐徕渠景观水道,西临银川市交通主干道正源街;南侧与宝湖湿地公园遥相呼应。 -宝湖庭院绿化率多少? “宝湖庭院”项目公共交通资源丰富:15路、21路、35路、38路、43路公交车贯穿银川市各地,出行便利。 -宝湖庭院绿化率多少? 距离新百良田购物广场约1公里,工人疗养院600米,宝湖公园1公里,唐徕渠景观水道500米。 -宝湖庭院绿化率多少? 项目位置优越,购物、餐饮、医疗、交通、休闲等生活资源丰富。[1] -宝湖庭院绿化率多少? 建发·宝湖庭院建筑及景观设置传承建发一贯“简约、大气”的风格:搂间距宽广,确保每一座楼宇视野开阔通透。 -宝湖庭院绿化率多少? 楼宇位置错落有置,外立面设计大气沉稳别致。 -宝湖庭院绿化率多少? 项目内部休闲绿地、景观小品点缀其中,道路及停车系统设计合理,停车及通行条件便利。 -宝湖庭院绿化率多少? 社区会所、幼儿园、活动室、医疗服务中心等生活配套一应俱全。 -宝湖庭院绿化率多少? 行政区域:金凤区 -大月兔(中秋艺术作品)的作者还有哪些代表作? 大月兔是荷兰“大黄鸭”之父弗洛伦泰因·霍夫曼打造的大型装置艺术作品,该作品首次亮相于台湾桃园大园乡海军基地,为了迎接中秋节的到来;在展览期间,海军基地也首次对外开放。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 霍夫曼觉得中国神话中捣杵的玉兔很有想象力,于是特别创作了“月兔”,这也是“月兔”新作第一次展出。[1] -大月兔(中秋艺术作品)的作者还有哪些代表作? ?2014年9月15日因工人施工不慎,遭火烧毁。[2] -大月兔(中秋艺术作品)的作者还有哪些代表作? “大月兔”外表采用的杜邦防水纸、会随风飘动,内部以木材加保丽龙框架支撑做成。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 兔毛用防水纸做成,材质完全防水,不怕日晒雨淋。[3 -大月兔(中秋艺术作品)的作者还有哪些代表作? -4] -大月兔(中秋艺术作品)的作者还有哪些代表作? 25米的“月兔”倚靠在机 -大月兔(中秋艺术作品)的作者还有哪些代表作? 堡上望着天空,像在思考又像赏月。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 月兔斜躺在机堡上,意在思考生命、边做白日梦,编织自己的故事。[3] -大月兔(中秋艺术作品)的作者还有哪些代表作? 台湾桃园大园乡海军基地也首度对外开放。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 428公顷的海军基地中,地景艺术节使用约40公顷,展场包括过去军机机堡、跑道等,由于这处基地过去警备森严,不对外开放,这次结合地景艺术展出,也可一窥过去是黑猫中队基地的神秘面纱。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 2014年9月2日,桃园县政府文化局举行“踩线团”,让 -大月兔(中秋艺术作品)的作者还有哪些代表作? 大月兔 -大月兔(中秋艺术作品)的作者还有哪些代表作? 各项地景艺术作品呈现在媒体眼中,虽然“月兔”仍在进行最后的细节赶工,但横躺在机堡上的“月兔”雏形已经完工。[5] -大月兔(中秋艺术作品)的作者还有哪些代表作? “这么大”、“好可爱呦”是不少踩线团成员对“月兔”的直觉;尤其在蓝天的衬托及前方绿草的组合下,呈现犹如真实版的爱丽丝梦游仙境。[6] -大月兔(中秋艺术作品)的作者还有哪些代表作? 霍夫曼的作品大月兔,“从平凡中,创作出不平凡的视觉”,创造出观赏者打从心中油然而生的幸福感,拉近观赏者的距离。[6] -大月兔(中秋艺术作品)的作者还有哪些代表作? 2014年9月15日早 -大月兔(中秋艺术作品)的作者还有哪些代表作? 上,施工人员要将月兔拆解,搬离海军基地草皮时,疑施工拆除的卡车,在拆除过程,故障起火,起火的卡车不慎延烧到兔子,造成兔子起火燃烧,消防队员即刻抢救,白色的大月兔立即变成焦黑的火烧兔。[7] -大月兔(中秋艺术作品)的作者还有哪些代表作? 桃园县府表示相当遗憾及难过,也不排除向包商求偿,也已将此事告知霍夫曼。[2] -大月兔(中秋艺术作品)的作者还有哪些代表作? ?[8] -大月兔(中秋艺术作品)的作者还有哪些代表作? 弗洛伦泰因·霍夫曼,荷兰艺术家,以在公共空间创作巨大造型 -大月兔(中秋艺术作品)的作者还有哪些代表作? 物的艺术项目见长。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 代表作品包括“胖猴子”(2010年在巴西圣保罗展出)、“大黄兔”(2011年在瑞典厄勒布鲁展出)、粉红猫(2014年5月在上海亮相)、大黄鸭(Rubber Duck)、月兔等。 -英国耆卫保险公司有多少保险客户? 英国耆卫保险公司(Old Mutual plc)成立于1845年,一直在伦敦证券交易所(伦敦证券交易所:OML)作第一上市,也是全球排名第32位(按营业收入排名)的保险公司(人寿/健康)。 -英国耆卫保险公司有多少保险客户? 公司是全球财富500强公司之一,也是被列入英国金融时报100指数的金融服务集团之一。 -英国耆卫保险公司有多少保险客户? Old Mutual 是一家国际金融服务公司,拥有近320万个保险客户,240万个银行储户,270,000个短期保险客户以及700,000个信托客户 -英国耆卫保险公司有多少保险客户? 英国耆卫保险公司(Old Mutual)是一家国际金融服务公司,总部设在伦敦,主要为全球客户提供长期储蓄的解决方案、资产管理、短期保险和金融服务等,目前业务遍及全球34个国家。[1] -英国耆卫保险公司有多少保险客户? 主要包括人寿保险,资产管理,银行等。 -英国耆卫保险公司有多少保险客户? 1845年,Old Mutual在好望角成立。 -英国耆卫保险公司有多少保险客户? 1870年,董事长Charles Bell设计了Old Mutual公司的标记。 -英国耆卫保险公司有多少保险客户? 1910年,南非从英联邦独立出来。 -英国耆卫保险公司有多少保险客户? Old Mutual的董事长John X. Merriman被选为国家总理。 -英国耆卫保险公司有多少保险客户? 1927年,Old Mutual在Harare成立它的第一个事务所。 -英国耆卫保险公司有多少保险客户? 1960年,Old Mutual在南非成立了Mutual Unit信托公司,用来管理公司的信托业务。 -英国耆卫保险公司有多少保险客户? 1970年,Old Mutual的收入超过100百万R。 -英国耆卫保险公司有多少保险客户? 1980年,Old Mutual成为南非第一大人寿保险公司,年收入达10亿R。 -英国耆卫保险公司有多少保险客户? 1991年,Old Mutual在美国财富周刊上评选的全球保险公司中名列第38位。 -英国耆卫保险公司有多少保险客户? 1995年,Old Mutual在美国波士顿建立投资顾问公司,同年、又在香港和Guernsey建立事务所。 -英国耆卫保险公司有多少保险客户? 作为一项加强与其母公司联系的举措,OMNIA公司(百慕大)荣幸的更名为Old Mutual 公司(百慕大) 。 -英国耆卫保险公司有多少保险客户? 这一新的名称和企业识别清晰地展示出公司成为其世界金融机构合作伙伴强有力支持的决心。 -英国耆卫保险公司有多少保险客户? 2003 年4月,该公司被Old Mutual plc公司收购,更名为Sage Life(百慕大)公司并闻名于世,公司为Old Mutual公司提供了一个新的销售渠道,补充了其现有的以美元计价的产品线和分销系统。 -英国耆卫保险公司有多少保险客户? 达到了一个重要里程碑是公司成功的一个例证: 2005年6月3日公司资产超过10亿美元成为公司的一个主要里程碑,也是公司成功的一个例证。 -英国耆卫保险公司有多少保险客户? Old Mutual (百慕大)为客户提供一系列的投资产品。 -英国耆卫保险公司有多少保险客户? 在其开放的结构下,客户除了能够参与由Old Mutual会员管理的方案外,还能够参与由一些世界顶尖投资机构提供的投资选择。 -英国耆卫保险公司有多少保险客户? 首席执行官John Clifford对此发表评论说:“过去的两年对于Old Mutual家族来说是稳固发展的两年,更名是迫在眉睫的事情。 -英国耆卫保险公司有多少保险客户? 通过采用其名字和形象上的相似,Old Mutual (百慕大)进一步强化了与母公司的联系。” -英国耆卫保险公司有多少保险客户? Clifford补充道:“我相信Old Mutual全球品牌认可度和Old Mutual(百慕大)产品专业知识的结合将在未来的日子里进一步推动公司的成功。” -英国耆卫保险公司有多少保险客户? 随着公司更名而来的是公司网站的全新改版,设计投资选择信息、陈述、销售方案、营销材料和公告板块。 -英国耆卫保险公司有多少保险客户? 在美国购买不到OMNIA投资产品,该产品也不向美国公民或居民以及百慕大居民提供。 -英国耆卫保险公司有多少保险客户? 这些产品不对任何要约未得到批准的区域中的任何人,以及进行此要约或询价为非法行为的个人构成要约或询价。 -英国耆卫保险公司有多少保险客户? 关于Old Mutual(百慕大)公司 -英国耆卫保险公司有多少保险客户? Old Mutual(百慕大)公司总部位于百慕大,公司面向非美国居民及公民以及非百慕大居民,通过遍布世界的各个市场的金融机构开发和销售保险和投资方案。 -英国耆卫保险公司有多少保险客户? 这些方案由Old Mutual(百慕大)公司直接做出,向投资者提供各种投资选择和战略,同时提供死亡和其他受益保证。 -谁知道北京的淡定哥做了什么? 尼日利亚足球队守门员恩耶马被封淡定哥,原因是2010年南非世界杯上1:2落后希腊队时,对方前锋已经突破到禁区,其仍头依门柱发呆,其从容淡定令人吃惊。 -谁知道北京的淡定哥做了什么? 淡定哥 -谁知道北京的淡定哥做了什么? 在2010年6月17日的世界杯赛场上,尼日利亚1比2不敌希腊队,但尼日利亚门将恩耶马(英文名:Vincent Enyeama)在赛场上的“淡定”表现令人惊奇。 -谁知道北京的淡定哥做了什么? 随后,网友将赛场照片发布于各大论坛,恩耶马迅速窜红,并被网友称为“淡定哥”。 -谁知道北京的淡定哥做了什么? 淡定哥 -谁知道北京的淡定哥做了什么? 从网友上传得照片中可以看到,“淡定哥”在面临对方前锋突袭至小禁区之时,还靠在球门柱上发呆,其“淡定”程度的确非一般人所能及。 -谁知道北京的淡定哥做了什么? 恩耶马是尼日利亚国家队的主力守门员,目前效力于以色列的特拉维夫哈普尔队。 -谁知道北京的淡定哥做了什么? 1999年,恩耶马在尼日利亚国内的伊波姆星队开始职业生涯,后辗转恩伊姆巴、Iwuanyanwu民族等队,从07年开始,他为特拉维夫效力。 -谁知道北京的淡定哥做了什么? 恩耶马的尼日利亚国脚生涯始于2002年,截至2010年1月底,他为国家队出场已超过50次。 -谁知道北京的淡定哥做了什么? 当地时间2011年1月4日,国际足球历史与统计协会(IFFHS)公布了2010年度世界最佳门将,恩耶马(尼日利亚,特拉维夫夏普尔)10票排第十一 -谁知道北京的淡定哥做了什么? 此词经国家语言资源监测与研究中心等机构专家审定入选2010年年度新词语,并收录到《中国语言生活状况报告》中。 -谁知道北京的淡定哥做了什么? 提示性释义:对遇事从容镇定、处变不惊的男性的戏称。 -谁知道北京的淡定哥做了什么? 例句:上海现“淡定哥”:百米外爆炸他仍专注垂钓(2010年10月20日腾讯网http://news.qq.com/a/20101020/000646.htm) -谁知道北京的淡定哥做了什么? 2011年度新人物 -谁知道北京的淡定哥做了什么? 1、淡定哥(北京) -谁知道北京的淡定哥做了什么? 7月24日傍晚,北京市出现大范围降雨天气,位于通州北苑路出现积水,公交车也难逃被淹。 -谁知道北京的淡定哥做了什么? 李欣摄图片来源:新华网一辆私家车深陷积水,车主索性盘坐在自己的汽车上抽烟等待救援。 -谁知道北京的淡定哥做了什么? 私家车主索性盘坐在自己的车上抽烟等待救援,被网友称“淡定哥” -谁知道北京的淡定哥做了什么? 2、淡定哥——林峰 -谁知道北京的淡定哥做了什么? 在2011年7月23日的动车追尾事故中,绍兴人杨峰(@杨峰特快)在事故中失去了5位亲人:怀孕7个月的妻子、未出世的孩子、岳母、妻姐和外甥女,他的岳父也在事故中受伤正在治疗。 -谁知道北京的淡定哥做了什么? 他披麻戴孝出现在事故现场,要求将家人的死因弄个明白。 -谁知道北京的淡定哥做了什么? 但在第一轮谈判过后,表示:“请原谅我,如果我再坚持,我将失去我最后的第六个亲人。” -谁知道北京的淡定哥做了什么? 如果他继续“纠缠”铁道部,他治疗中的岳父将会“被死亡”。 -谁知道北京的淡定哥做了什么? 很多博友就此批评杨峰,并讽刺其为“淡定哥”。 -071型船坞登陆舰的北约代号是什么? 071型船坞登陆舰(英语:Type 071 Amphibious Transport Dock,北约代号:Yuzhao-class,中文:玉昭级,或以首舰昆仑山号称之为昆仑山级船坞登陆舰),是中国人民解放军海军隶下的大型多功能两栖船坞登陆舰,可作为登陆艇的母舰,用以运送士兵、步兵战车、主战坦克等展开登陆作战,也可搭载两栖车辆,具备大型直升机起降甲板及操作设施。 -071型船坞登陆舰的北约代号是什么? 071型两栖登陆舰是中国首次建造的万吨级作战舰艇,亦为中国大型多功能两栖舰船的开山之作,也可以说是中国万吨级以上大型作战舰艇的试验之作,该舰的建造使中国海军的两栖舰船实力有了质的提升。 -071型船坞登陆舰的北约代号是什么? 在本世纪以前中国海军原有的两栖舰队以一 -071型船坞登陆舰的北约代号是什么? 早期071模型 -071型船坞登陆舰的北约代号是什么? 千至四千吨级登陆舰为主要骨干,这些舰艇吨位小、筹载量有限,直升机操作能力非常欠缺,舰上自卫武装普遍老旧,对于现代化两栖登陆作战可说有很多不足。 -071型船坞登陆舰的北约代号是什么? 为了应对新时期的国际国内形势,中国在本世纪初期紧急强化两栖作战能力,包括短时间内密集建造072、074系列登陆舰,同时也首度设计一种新型船坞登陆舰,型号为071。[1] -071型船坞登陆舰的北约代号是什么? 在两栖作战行动中,这些舰只不得不采取最危险的 -071型船坞登陆舰的北约代号是什么? 舾装中的昆仑山号 -071型船坞登陆舰的北约代号是什么? 敌前登陆方式实施两栖作战行动,必须与敌人预定阻击力量进行面对面的战斗,在台湾地区或者亚洲其他国家的沿海,几乎没有可用而不设防的海滩登陆地带,并且各国或者地区的陆军在战时,可能会很快控制这些易于登陆的海难和港口,这样就限制住了中国海军两栖登陆部队的实际登陆作战能力。 -071型船坞登陆舰的北约代号是什么? 071型登陆舰正是为了更快和更多样化的登陆作战而开发的新型登陆舰艇。[2] -071型船坞登陆舰的北约代号是什么? 071型两栖船坞登陆舰具有十分良好的整体隐身能力, -071型船坞登陆舰的北约代号是什么? 071型概念图 -071型船坞登陆舰的北约代号是什么? 该舰外部线条简洁干练,而且舰体外形下部外倾、上部带有一定角度的内倾,从而形成雷达隐身性能良好的菱形横剖面。 -071型船坞登陆舰的北约代号是什么? 舰体为高干舷平甲板型,长宽比较小,舰身宽满,采用大飞剪型舰首及楔形舰尾,舰的上层建筑位于舰体中间部位,后部是大型直升机甲板,适航性能非常突出。 -071型船坞登陆舰的北约代号是什么? 顶甲板上各类电子设备和武器系统布局十分简洁干净,各系统的突出物很少。 -071型船坞登陆舰的北约代号是什么? 该舰的两座烟囱实行左右分布式设置在舰体两侧,既考虑了隐身特点,也十分新颖。[3] -071型船坞登陆舰的北约代号是什么? 1号甲板及上层建筑物主要设置有指挥室、控 -071型船坞登陆舰的北约代号是什么? 舰尾俯视 -071型船坞登陆舰的北约代号是什么? 制舱、医疗救护舱及一些居住舱,其中医疗救护舱设置有完备的战场救护设施,可以在舰上为伤病员提供紧急手术和野战救护能力。 -071型船坞登陆舰的北约代号是什么? 2号甲板主要是舰员和部分登陆人员的居住舱、办公室及厨房。 -071型船坞登陆舰的北约代号是什么? 主甲板以下则是登陆舱,分前后两段,前段是装甲车辆储存舱,共两层,可以储存登陆装甲车辆和一些其它物资,在进出口处还设有一小型升降机,用于两层之间的移动装卸用。 -071型船坞登陆舰的北约代号是什么? 前段车辆储存舱外壁左右各设有一折叠式装载舱门,所有装载车辆在码头可通过该门直接装载或者登陆上岸。 -071型船坞登陆舰的北约代号是什么? 后段是一个巨型船坞登陆舱,总长约70米,主要用来停泊大小型气垫登陆艇、机械登陆艇或车辆人员登陆艇。[4] -071型船坞登陆舰的北约代号是什么? 自卫武装方面,舰艏设有一门PJ-26型76mm舰炮( -071型船坞登陆舰的北约代号是什么? 井冈山号舰首主炮 -071型船坞登陆舰的北约代号是什么? 俄罗斯AK-176M的中国仿制版,亦被054A采用) , 四具与052B/C相同的726-4 18联装干扰弹发射器分置于舰首两侧以及上层结构两侧,近迫防御则依赖四座布置于上层结构的AK-630 30mm防空机炮 。 -071型船坞登陆舰的北约代号是什么? 原本071模型的舰桥前方设有一座八联装海红-7短程防空导弹发射器,不过071首舰直到出海试航与2009年4月下旬的海上阅兵式中,都未装上此一武器。 -071型船坞登陆舰的北约代号是什么? 电子装备方面, 舰桥后方主桅杆顶配置一具363S型E/F频2D对空/平面搜索雷达 、一具Racal Decca RM-1290 I频导航雷达,后桅杆顶装备一具拥有球型外罩的364型(SR-64)X频2D对空/对海搜索雷达,此外还有一具LR-66C舰炮射控雷达、一具负责导引AK-630机炮的TR-47C型火炮射控雷达等。[5] -071型船坞登陆舰的北约代号是什么? 071型自卫武装布置 -071型船坞登陆舰的北约代号是什么? 071首舰昆仑山号于2006年6月开 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 竹溪县人大常委会办公室:承担人民代表大会会议、常委会会议、主任会议和常委会党组会议(简称“四会”)的筹备和服务工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会组成人员视察活动的联系服务工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 受主任会议委托,拟定有关议案草案。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担常委会人事任免的具体工作,负责机关人事管理和离退休干部的管理与服务。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担县人大机关的行政事务和后勤保障工作,负责机关的安全保卫、文电处理、档案、保密、文印工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担县人大常委会同市人大常委会及乡镇人大的工作联系。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责信息反馈工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 了解宪法、法律、法规和本级人大及其常委会的决议、决定实施情况及常委会成员提出建议办理情况,及时向常委会和主任会议报告。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担人大宣传工作,负责人大常委会会议宣传的组织和联系。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 组织协调各专门工作委员会开展工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 办公室下设五个科,即秘书科、调研科、人事任免科、综合科、老干部科。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 教科文卫工作委员会:负责人大教科文卫工作的日常联系、督办、信息收集反馈和业务指导工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责教科文卫方面法律法规贯彻和人大工作情况的宣传、调研工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担人大常委会教科文卫方面会议议题调查的组织联系和调研材料的起草工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担教科文卫方面规范性备案文件的初审工作,侧重对教科文卫行政执法个案监督业务承办工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会组成人员和人大代表对教科文卫工作方面检查、视察的组织联系工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 代表工作委员会:负责与县人大代表和上级人大代表的联系、情况收集交流工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责《代表法》的宣传贯彻和贯彻实施情况的调查研究工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责县人大代表法律法规和人民代表大会制度知识学习的组织和指导工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会主任、副主任和委员走访联系人大代表的组织、联系工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责组织人大系统的干部培训。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责乡镇人大主席团工作的联系和指导。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责人大代表建议、批评和意见办理工作的联系和督办落实。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责人大代表开展活动的组织、联系工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 财政经济工作委员会:负责人大财政经济工作的日常联系、督办、信息收集反馈和业务指导工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责财政经济方面法律法规贯彻和人大工作情况的宣传、调研工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 对国民经济计划和财政预算编制情况进行初审。 -我想知道武汉常住人口有多少? 武汉,简称“汉”,湖北省省会。 -我想知道武汉常住人口有多少? 它是武昌、汉口、汉阳三镇统称。 -我想知道武汉常住人口有多少? 世界第三大河长江及其最长支流汉江横贯市区,将武汉一分为三,形成武昌、汉口、汉阳,三镇跨江鼎立的格局。 -我想知道武汉常住人口有多少? 唐朝诗人李白在此写下“黄鹤楼中吹玉笛,江城五月落梅花”,因此武汉自古又称“江城”。 -我想知道武汉常住人口有多少? 武汉是中国15个副省级城市之一,全国七大中心城市之一,全市常住人口858万人。 -我想知道武汉常住人口有多少? 华中地区最大都市,华中金融中心、交通中心、文化中心,长江中下游特大城市。 -我想知道武汉常住人口有多少? 武汉城市圈的中心城市。 -我想知道武汉常住人口有多少? [3]武昌、汉口、汉阳三地被俗称武汉三镇。 -我想知道武汉常住人口有多少? 武汉西与仙桃市、洪湖市相接,东与鄂州市、黄石市接壤,南与咸宁市相连,北与孝感市相接,形似一只自西向东的蝴蝶形状。 -我想知道武汉常住人口有多少? 在中国经济地理圈内,武汉处于优越的中心位置是中国地理上的“心脏”,故被称为“九省通衢”之地。 -我想知道武汉常住人口有多少? 武汉市历史悠久,古有夏汭、鄂渚之名。 -我想知道武汉常住人口有多少? 武汉地区考古发现的历史可以上溯距今6000年的新石器时代,其考古发现有东湖放鹰台遗址的含有稻壳的红烧土、石斧、石锛以及鱼叉。 -我想知道武汉常住人口有多少? 市郊黄陂区境内的盘龙城遗址是距今约3500年前的商朝方国宫城,是迄今中国发现及保存最完整的商代古城之一。 -我想知道武汉常住人口有多少? 现代武汉的城市起源,是东汉末年的位于今汉阳的卻月城、鲁山城,和在今武昌蛇山的夏口城。 -我想知道武汉常住人口有多少? 东汉末年,地方军阀刘表派黄祖为江夏太守,将郡治设在位于今汉阳龟山的卻月城中。 -我想知道武汉常住人口有多少? 卻月城是武汉市区内已知的最早城堡。 -我想知道武汉常住人口有多少? 223年,东吴孙权在武昌蛇山修筑夏口城,同时在城内的黄鹄矶上修筑了一座瞭望塔——黄鹤楼。 -我想知道武汉常住人口有多少? 苏轼在《前赤壁赋》中说的“西望夏口,东望武昌”中的夏口就是指武汉(而当时的武昌则是今天的鄂州)。 -我想知道武汉常住人口有多少? 南朝时,夏口扩建为郢州,成为郢州的治所。 -我想知道武汉常住人口有多少? 隋置江夏县和汉阳县,分别以武昌,汉阳为治所。 -我想知道武汉常住人口有多少? 唐时江夏和汉阳分别升为鄂州和沔州的州治,成为长江沿岸的商业重镇。 -我想知道武汉常住人口有多少? 江城之称亦始于隋唐。 -我想知道武汉常住人口有多少? 两宋时武昌属鄂州,汉阳汉口属汉阳郡。 -我想知道武汉常住人口有多少? 经过发掘,武汉出土了大量唐朝墓葬,在武昌马房山和岳家咀出土了灰陶四神砖以及灰陶十二生肖俑等。 -我想知道武汉常住人口有多少? 宋代武汉的制瓷业发达。 -我想知道武汉常住人口有多少? 在市郊江夏区梁子湖旁发现了宋代瓷窑群100多座,烧制的瓷器品种很多,釉色以青白瓷为主。 -我想知道武汉常住人口有多少? 南宋诗人陆游在经过武昌时,写下“市邑雄富,列肆繁错,城外南市亦数里,虽钱塘、建康不能过,隐然一大都会也”来描写武昌的繁华。 -我想知道武汉常住人口有多少? 南宋抗金将领岳飞驻防鄂州(今武昌)8年,在此兴师北伐。 -我想知道武汉常住人口有多少? 元世祖至元十八年(1281年),武昌成为湖广行省的省治。 -我想知道武汉常住人口有多少? 这是武汉第一次成为一级行政单位(相当于现代的省一级)的治所。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇,托洛茨基是联共(布)党内和第三国际时期反对派的领导人,托派"第四国际"的创始人和领导人。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇·托洛茨基 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇·托洛茨基(俄国与国际历史上最重要的无产阶级革命家之一,二十世纪国际共产主义运动中最具争议的、也是备受污蔑的左翼反对派领袖,他以对古典马克思主义“不断革命论”的独创性发展闻名于世,第三共产国际和第四国际的主要缔造者之一(第三国际前三次代表大会的宣言执笔人)。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在1905年俄国革命中被工人群众推举为彼得堡苏维埃主席(而当时布尔什维克多数干部却还在讨论是否支持苏维埃,这些干部后来被赶回俄国的列宁痛击)。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1917年革命托洛茨基率领“区联派”与列宁派联合,并再次被工人推举为彼得格勒苏维埃主席。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 对于十月革命这场20世纪最重大的社会革命,托洛茨基赢得了不朽的历史地位。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 后来成了托洛茨基死敌的斯大林,当时作为革命组织领导者之一却写道:“起义的一切实际组织工作是在彼得格勒苏维埃主席托洛茨基同志直接指挥之下完成的。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 我们可以确切地说,卫戍部队之迅速站在苏维埃方面来,革命军事委员会的工作之所以搞得这样好,党认为这首先要归功于托洛茨基同志。” -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? (值得一提的是,若干年后,当反托成为政治需要时,此类评价都从斯大林文章中删掉了。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )甚至连后来狂热的斯大林派雅克·沙杜尔,当时却也写道:“托洛茨基在十月起义中居支配地位,是起义的钢铁灵魂。” -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? (苏汉诺夫《革命札记》第6卷P76。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )不仅在起义中,而且在无产阶级政权的捍卫、巩固方面和国际共产主义革命方面,托洛茨基也作出了极其卓越的贡献(外交官-苏联国际革命政策的负责人、苏联红军缔造者以及共产国际缔造者)。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 革命后若干年里,托洛茨基与列宁的画像时常双双并列挂在一起;十月革命之后到列宁病逝之前,布尔什维克历次全国代表大会上,代表大会发言结束均高呼口号:“我们的领袖列宁和托洛茨基万岁!” -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在欧美共运中托洛茨基的威望非常高。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 后人常常认为托洛茨基只是一个知识分子文人,实际上他文武双全,而且谙熟军事指挥艺术,并且亲临战场。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 正是他作为十月革命的最高军事领袖(在十月革命期间他与士兵一起在战壕里作战),并且在1918年缔造并指挥苏联红军,是一个杰出的军事家(列宁曾对朋友说,除了托洛茨基,谁还能给我迅速地造成一支上百万人的强大军队? -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在内战期间,他甚至坐装甲列车冒着枪林弹雨亲临战场指挥作战,差点挨炸死;当反革命军队进攻彼得堡时,当时的彼得堡领导人季诺维也夫吓得半死,托洛茨基却从容不迫指挥作战。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 同时托洛茨基又是一个高明的外交家,他曾强硬地要求英国政府释放因反战宣传被囚禁在英国的俄国流亡革命者,否则就不许英国公民离开俄国,连英国政府方面都觉得此举无懈可击;他并且把居高临下的法国到访者当场轰出他的办公室(革命前法国一直是俄国的头号债主与政治操纵者),却彬彬有礼地欢迎前来缓和冲突的法国大使;而在十月革命前夕,他对工人代表议会质询的答复既保守了即将起义的军事秘密,又鼓舞了革命者的战斗意志,同时严格遵循现代民主与公开原则,这些政治答复被波兰人多伊彻誉为“外交辞令的杰作”(伊·多伊彻的托氏传记<先知三部曲·武装的先知>第九章P335,第十一章P390)。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 托洛茨基在国民经济管理与研究工作中颇有创造:是苏俄新经济政策的首先提议者以及社会主义计划经济的首先实践者。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1928年斯大林迟迟开始的计划经济实验,是对1923年以托洛茨基为首的左翼反对派经济纲领的拙劣剽窃和粗暴翻版。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 因为统治者的政策迟到,使得新经济政策到1928年已产生了一个威胁政权生存的农村资产阶级,而苏俄工人阶级国家不得不强力解决——而且是不得不借助已蜕化为官僚集团的强力来解决冲突——结果导致了1929年到30年代初的大饥荒和对农民的大量冤枉错杀。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 另外,他还对文学理论有很高的造诣,其著作<文学与革命>甚至影响了整整一代的国际左翼知识分子(包括中国的鲁迅、王实味等人)。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 他在哈佛大学图书馆留下了100多卷的<托洛茨基全集>,其生动而真诚的自传和大量私人日记、信件,给人留下了研究人类生活各个方面的宝贵财富,更是追求社会进步与解放的历史道路上的重要知识库之一。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 托洛茨基1879年10月26日生于乌克兰赫尔松县富裕农民家庭,祖籍是犹太人。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 原姓布隆施泰因。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1896年开始参加工人运动。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1897年 ,参加建立南俄工人协会 ,反对沙皇专制制度。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1898年 在尼古拉也夫组织工人团体,被流放至西伯利亚。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1902年秋以署名托洛茨基之假护照逃到伦敦,参加V.I.列宁、G.V.普列汉诺夫等人主编的<火星报>的工作。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥,位于洞庭湖与长江交汇处,东接岳阳市区洞庭大道和107国道、京珠高速公路,西连省道306线,是国内目前最长的内河公路桥。 -谁知道洞庭湖大桥有多长? 路桥全长10173.82m,其中桥长5747.82m,桥宽20m,西双向四车道,是我国第一座三塔双索面斜拉大桥,亚洲首座不等高三塔双斜索面预应力混凝土漂浮体系斜拉桥。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥是我国最长的内河公路桥,大桥横跨东洞庭湖区,全长10174.2米,主桥梁长5747.8米。 -谁知道洞庭湖大桥有多长? 大桥的通车使湘、鄂间公路干线大为畅通,并为洞庭湖区运输抗洪抢险物资提供了一条快速通道该桥设计先进,新颖,造型美观,各项技求指标先进,且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥是湖区人民的造福桥,装点湘北门户的形象桥,对优化交通网络绪构,发展区域经济,保障防汛救灾,缩短鄂、豫、陕等省、市西部车辆南下的运距,拓展岳阳城区的主骨架,提升岳阳城市品位,增强城市辐射力,有着十分重要的意义。 -谁知道洞庭湖大桥有多长? 自1996年12月开工以来,共有10支施工队伍和两支监理队伍参与了大桥的建设。 -谁知道洞庭湖大桥有多长? 主桥桥面高52米(黄海),设计通航等级Ⅲ级。 -谁知道洞庭湖大桥有多长? 主桥桥型为不等高三塔、双索面空间索、全飘浮体系的预应力钢筋混凝土肋板梁式结构的斜拉桥,跨径为130+310+310+130米。 -谁知道洞庭湖大桥有多长? 索塔为双室宝石型断面,中塔高为125.684米,两边塔高为99.311米。 -谁知道洞庭湖大桥有多长? 三塔基础为3米和3.2米大直径钻孔灌注桩。 -谁知道洞庭湖大桥有多长? 引桥为连续梁桥,跨径20至50米,基础直径为1.8和2.5米钻孔灌注桩。 -谁知道洞庭湖大桥有多长? 该桥设计先进、新颖、造型美观,各项技求指标先进,且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系,岳阳洞庭湖大桥是我国首次采用不等高三塔斜拉桥桥型的特大桥,设计先进,施工难度大位居亚洲之首,是湖南省桥梁界的一大科研项目。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥设计为三塔斜拉桥,空间双斜面索,主梁采用前支点挂篮施工,并按各种工况模拟挂篮受力进行现场试验,获得了大量有关挂篮受力性能和实际刚度的计算参数,作为施工控制参数。 -谁知道洞庭湖大桥有多长? 利用组合式模型单元,推导了斜拉桥分离式双肋平板主梁的单元刚度矩阵,并进行了岳阳洞庭湖大桥的空间受力分析,结果表明此种单元精度满足工程要求,同时在施工工艺方面也积累了成功经验。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥的通车使湘、鄂间公路干线大为畅通,并为洞庭湖区抗洪抢险物资运输提供了一条快速通道。 -谁知道洞庭湖大桥有多长? 湖大桥设计先进,造型美丽,科技含量高。 -谁知道洞庭湖大桥有多长? 洞庭大桥还是一道美丽的风景线,大桥沿岸风景与岳阳楼,君山岛、洞庭湖等风景名胜融为一体,交相辉映,成为世人了解岳阳的又一崭新窗口,也具有特别旅游资源。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥多塔斜拉桥新技术研究荣获国家科学技术进步二等奖、湖南省科学技术进步一等奖,并获第五届詹天佑大奖。 -谁知道洞庭湖大桥有多长? 大桥在中国土木工程学会2004年第16届年会上入选首届《中国十佳桥梁》,名列斜拉桥第二位。 -谁知道洞庭湖大桥有多长? 2001年荣获湖南省建设厅优秀设计一等奖,省优秀勘察一等奖。 -谁知道洞庭湖大桥有多长? 2003年荣获国家优秀工程设计金奖, "十佳学术活动"奖。 -天气预报员的布景师是谁? 芝加哥天气预报员大卫(尼古拉斯·凯奇),被他的粉丝们热爱,也被诅咒--这些人在天气不好的时候会迁怒于他,而大部分时候,大卫都是在预报坏天气。 -天气预报员的布景师是谁? ?不过,这也没什么,当一家国家早间新闻节目叫他去面试的时候,大卫的事业似乎又将再创新高。 -天气预报员的布景师是谁? 芝加哥天气预报员大卫(尼古拉斯·凯奇),被他的粉丝们热爱,也被诅咒--这些人在天气不好的时候会迁怒于他,而大部分时候,大卫都是在预报坏天气。 -天气预报员的布景师是谁? 不过,这也没什么,当一家国家早间新闻节目叫他去面试的时候,大卫的事业似乎又将再创新高。 -天气预报员的布景师是谁? 在电视节目上,大卫永远微笑,自信而光鲜,就像每一个成功的电视人一样,说起收入,他也绝对不落人后。 -天气预报员的布景师是谁? 不过,大卫的个人生活可就不那么如意了。 -天气预报员的布景师是谁? 与妻子劳伦(霍普·戴维斯)的离婚一直让他痛苦;儿子迈克吸大麻上瘾,正在进行戒毒,可戒毒顾问却对迈克有着异样的感情;女儿雪莉则体重惊人,总是愁眉苦脸、孤独寂寞;大卫的父亲罗伯特(迈克尔·凯恩),一个世界著名的小说家,虽然罗伯特不想再让大卫觉得负担过重,可正是他的名声让大卫的一生都仿佛处在他的阴影之下,更何况,罗伯特就快重病死了。 -天气预报员的布景师是谁? 和妻子的离婚、父亲的疾病、和孩子之间完全不和谐的关系,都让大卫每天头疼,而每次当他越想控制局面,一切就越加复杂。 -天气预报员的布景师是谁? 然而就在最后人们再也不会向他扔快餐,或许是因为他总是背着弓箭在大街上走。 -天气预报员的布景师是谁? 最后,面对那份高额工作的接受意味着又一个新生活的开始。 -天气预报员的布景师是谁? 也许,生活就像天气,想怎么样就怎么样,完全不可预料。 -天气预报员的布景师是谁? 导 演:戈尔·维宾斯基 Gore Verbinski -天气预报员的布景师是谁? 编 剧:Steve Conrad .....(written by) -天气预报员的布景师是谁? 演 员:尼古拉斯·凯奇 Nicolas Cage .....David Spritz -天气预报员的布景师是谁? 尼古拉斯·霍尔特 Nicholas Hoult .....Mike -天气预报员的布景师是谁? 迈克尔·凯恩 Michael Caine .....Robert Spritzel -天气预报员的布景师是谁? 杰蒙妮·德拉佩纳 Gemmenne de la Peña .....Shelly -天气预报员的布景师是谁? 霍普·戴维斯 Hope Davis .....Noreen -天气预报员的布景师是谁? 迈克尔·瑞斯玻利 Michael Rispoli .....Russ -天气预报员的布景师是谁? 原创音乐:James S. Levine .....(co-composer) (as James Levine) -天气预报员的布景师是谁? 汉斯·兹米尔 Hans Zimmer -天气预报员的布景师是谁? 摄 影:Phedon Papamichael -天气预报员的布景师是谁? 剪 辑:Craig Wood -天气预报员的布景师是谁? 选角导演:Denise Chamian -天气预报员的布景师是谁? 艺术指导:Tom Duffield -天气预报员的布景师是谁? 美术设计:Patrick M. Sullivan Jr. .....(as Patrick Sullivan) -天气预报员的布景师是谁? 布景师 :Rosemary Brandenburg -天气预报员的布景师是谁? 服装设计:Penny Rose -天气预报员的布景师是谁? 视觉特效:Charles Gibson -天气预报员的布景师是谁? David Sosalla .....Pacific Title & Art Studio -韩国国家男子足球队教练是谁? 韩国国家足球队,全名大韩民国足球国家代表队(???? ?? ?????),为韩国足球协会所于1928年成立,并于1948年加入国际足球协会。 -韩国国家男子足球队教练是谁? 韩国队自1986年世界杯开始,从未缺席任何一届决赛周。 -韩国国家男子足球队教练是谁? 在2002年世界杯,韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队,最后夺得了殿军,是亚洲球队有史以来最好成绩。 -韩国国家男子足球队教练是谁? 在2010年世界杯,韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈,再次晋身十六强,但以1-2败给乌拉圭出局。 -韩国国家男子足球队教练是谁? 北京时间2014年6月27日3时,巴西世界杯小组赛H组最后一轮赛事韩国对阵比利时,韩国队0-1不敌比利时,3场1平2负积1分垫底出局。 -韩国国家男子足球队教练是谁? 球队教练:洪明甫 -韩国国家男子足球队教练是谁? 韩国国家足球队,全名大韩民国足球国家代表队(韩国国家男子足球队???? ?? ?????),为韩国足球协会所于1928年成立,并于1948年加入国际足联。 -韩国国家男子足球队教练是谁? 韩国队是众多亚洲球队中,在世界杯表现最好,他们自1986年世界杯开始,从未缺席任何一届决赛周。 -韩国国家男子足球队教练是谁? 在2002年世界杯,韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队,最后夺得了殿军,是亚洲球队有史以来最好成绩。 -韩国国家男子足球队教练是谁? 在2010年世界杯,韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈,再次晋身十六强,但以1-2败给乌拉圭出局。 -韩国国家男子足球队教练是谁? 2014年世界杯外围赛,韩国在首轮分组赛以首名出线次轮分组赛,与伊朗、卡塔尔、乌兹别克以及黎巴嫩争逐两个直接出线决赛周资格,最后韩国仅以较佳的得失球差压倒乌兹别克,以小组次名取得2014年世界杯决赛周参赛资格,也是韩国连续八次晋身世界杯决赛周。 -韩国国家男子足球队教练是谁? 虽然韩国队在世界杯成绩为亚洲之冠,但在亚洲杯足球赛的成绩却远不及世界杯。 -韩国国家男子足球队教练是谁? 韩国只在首两届亚洲杯(1956年及1960年)夺冠,之后五十多年未能再度称霸亚洲杯,而自1992年更从未打入过决赛,与另一支东亚强队日本近二十年来四度在亚洲杯夺冠成强烈对比。[1] -韩国国家男子足球队教练是谁? 人物简介 -韩国国家男子足球队教练是谁? 车范根(1953年5月22日-)曾是大韩民国有名的锋线选手,他被欧洲媒体喻为亚洲最佳输出球员之一,他也被认为是世界最佳足球员之一。 -韩国国家男子足球队教练是谁? 他被国际足球史料与数据协会评选为20世纪亚洲最佳球员。 -韩国国家男子足球队教练是谁? 他在85-86赛季是德甲的最有价值球员,直到1999年为止他都是德甲外国球员入球纪录保持者。 -韩国国家男子足球队教练是谁? 德国的球迷一直没办法正确说出他名字的发音,所以球车范根(左)迷都以炸弹车(Cha Boom)称呼他。 -韩国国家男子足球队教练是谁? 这也代表了他强大的禁区得分能力。 -韩国国家男子足球队教练是谁? 职业生涯 -韩国国家男子足球队教练是谁? 车范根生于大韩民国京畿道的华城市,他在1971年于韩国空军俱乐部开始了他的足球员生涯;同年他入选了韩国19岁以下国家足球队(U-19)。 -韩国国家男子足球队教练是谁? 隔年他就加入了韩国国家足球队,他是有史以来加入国家队最年轻的球员。 -韩国国家男子足球队教练是谁? 车范根在27岁时前往德国发展,当时德甲被认为是世界上最好的足球联赛。 -韩国国家男子足球队教练是谁? 他在1978年12月加入了达姆施塔特,不过他在那里只待了不到一年就转到当时的德甲巨人法兰克福。 -韩国国家男子足球队教练是谁? 车范根很快在新俱乐部立足,他帮助球队赢得79-80赛季的欧洲足协杯。 -韩国国家男子足球队教练是谁? 在那个赛季过后,他成为德甲薪水第三高的球员,不过在1981年对上勒沃库森的一场比赛上,他的膝盖严重受伤,几乎毁了他的足球生涯。 -韩国国家男子足球队教练是谁? 在1983年车范根转投勒沃库森;他在这取得很高的成就,他成为85-86赛季德甲的最有价值球员,并且在1988年帮助球队拿下欧洲足协杯,也是他个人第二个欧洲足协杯。 -韩国国家男子足球队教练是谁? 他在决赛对垒西班牙人扮演追平比分的关键角色,而球会才在点球大战上胜出。 -韩国国家男子足球队教练是谁? 车范根在1989年退休,他在308场的德甲比赛中进了98球,一度是德甲外国球员的入球纪录。 -韩国国家男子足球队教练是谁? 执教生涯 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学,简称台湾科大、台科大或台科,是位于台湾台北市大安区的台湾第一所高等技职体系大专院校,现为台湾最知名的科技大学,校本部比邻国立台湾大学。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 该校已于2005年、2008年持续入选教育部的“发展国际一流大学及顶尖研究中心计划”。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? “国立”台湾工业技术学院成立于“民国”六十三年(1974)八月一日,为台湾地区第一所技术职业教育高等学府。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 建校之目的,在因应台湾地区经济与工业迅速发展之需求,以培养高级工程技术及管理人才为目标,同时建立完整之技术职业教育体系。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? “国立”台湾工业技术学院成立于“民国”六十三年(1974)八月一日,为台湾地区第一所技术职业教育高等学府。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 建校之目的,在因应台湾地区经济与工业迅速发展之需求,以培养高级工程技术及管理人才为目标,同时建立完整之技术职业教育体系。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 本校校地约44.5公顷,校本部位于台北市基隆路四段四十三号,。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 民国68年成立硕士班,民国71年成立博士班,现有大学部学生5,664人,研究生4,458人,专任教师451位。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 2001年在台湾地区教育部筹划之研究型大学(“国立”大学研究所基础教育重点改善计画)中,成为全台首批之9所大学之一 。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 自2005年更在“教育部”所推动“五年五百亿 顶尖大学”计划下,遴选为适合发展成“顶尖研究中心”的11所研究型大学之一。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学部设有二年制、四年制及工程在职人员进修班等三种学制;凡二专、三专及五专等专科学校以上之毕业生,皆可以报考本校大学部二年制,而高职、高中毕业生,可以报考本校大学部四年制。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 工业管理、电子工程、机械工程、营建工程及应用外语系等,则设有在职人员进修班学制,其招生对象为在职人员,利用夜间及暑假期间上课。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 凡在本校大学部修毕应修学分且成绩及格者皆授予学士学位。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学目前设有工程、电资、管理、设计、人文社会及精诚荣誉等六个学院,分别有机械、材料科学与工程、营建、化工、电子、电机、资工、工管、企管、资管、建筑、工商业设计、应用外语等13个系及校内招生之财务金融学士学位学程、科技管理学士学位学程;全校、工程、电资、管理、创意设计等五个不分系菁英班及光电研究所、管理研究所、财务金融研究所、科技管理研究所、管理学院MBA、数位学习教育研究所、医学工程研究所、自动化及控制研究所、工程技术研究所、专利研究所等独立研究所,此外尚有人文学科负责人文及社会类等课程之教学,通识学科负责法律、音乐、环保类等课程之教学,以及师资培育中心专以培养学生未来担任中等学校工、商、管理、设计等科之合格教师,合计23个独立系所、师资培育中心、人文学科及通识学科等教学单位。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学至今各系所毕业校友已达约56,456位,毕业生出路包含出国继续深造、在台深造以及投身于产业界。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 由于实作经验丰富,理论基础完备,工作态度认真,毕业校友担任政府要职、大学教授、大学校长及企业主管者众多,深受各界的肯定。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 工商业设计系副教授孙春望与硕一生全明远耗时两个月自制之三分钟动画短片“立体悲剧”。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 本片入选有“动画奥斯卡”之称的“ACM SIGGRAPH”国际动画展,并获得观众票选第一名,这也是台湾首次入选及获奖的短片。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 击败了好莱坞知名导演史蒂芬·史匹柏的“世界大战”、乔治卢卡斯的“星际大战三部曲”、梦工厂出品的动画“马达加斯加”、军机缠斗片“机战未来”及美国太空总署、柏克莱加州大学等好莱坞名片及顶尖学术单位制作的短片。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 2009年荣获有工业设计界奥斯卡奖之称的“德国iF设计大奖”国立台湾科技大学设计学院获得大学排名的全球第二,仅次于韩国三星美术设计学院“SADI”。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 总体排名 依据《泰晤士高等教育》(THES-QS)在2009年的世界大学排名调查,台科大排名全世界第351名,在台湾所有大学中排名第五,仅次于台大,清大,成大及阳明,并且是台湾唯一进入世界四百大名校的科技大学。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 依据在欧洲拥有广大声誉的“Eduniversal商学院排名网”2008年的资料,台湾有七所大学的商管学院被分别列入世界1000大商学院,其中台科大位在“卓越商学院”(EXCELLENT Business Schools,国内主要)之列,“推荐程度”(Recommendation Rate)为全台第四,仅次于台大、政大、中山,与交大并列。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 目前设有工程、电资、管理、设计、人文社会及精诚荣誉学院等六个学院。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 预计于竹北新校区设立产学合作学院及应用理学院。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●台湾建筑科技中心 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●智慧型机械人研究中心科技成果展示(15张) -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●台湾彩卷与博彩研究中心 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●电力电子技术研发中心 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●NCP-Taiwan办公室 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●资通安全研究与教学中心 -在日本,神道最初属于什么信仰? 神道又称天道,语出《易经》“大观在上,顺而巽,中正以观天下。 -在日本,神道最初属于什么信仰? 观,盥而不荐,有孚顒若,下观而化也。 -在日本,神道最初属于什么信仰? 观天之神道,而四时不忒,圣人以神道设教,而天下服矣”。 -在日本,神道最初属于什么信仰? 自汉以降,神道又指“墓前开道,建石柱以为标”。 -在日本,神道最初属于什么信仰? 在中医中,神道,经穴名。 -在日本,神道最初属于什么信仰? 出《针灸甲乙经》。 -在日本,神道最初属于什么信仰? 别名冲道。 -在日本,神道最初属于什么信仰? 属督脉。 -在日本,神道最初属于什么信仰? 宗教中,神道是日本的本土传统民族宗教,最初以自然崇拜为主,属于泛灵多神信仰(精灵崇拜),视自然界各种动植物为神祇。 -在日本,神道最初属于什么信仰? 神道又称天道,语出《易经》“大观在上,顺而巽,中正以观天下。 -在日本,神道最初属于什么信仰? 观,盥而不荐,有孚顒若,下观而化也。 -在日本,神道最初属于什么信仰? 观天之神道,而四时不忒,圣人以神道设教,而天下服矣”。 -在日本,神道最初属于什么信仰? 自汉以降,神道又指“墓前开道,建石柱以为标”。 -在日本,神道最初属于什么信仰? 在中医中,神道,经穴名。 -在日本,神道最初属于什么信仰? 出《针灸甲乙经》。 -在日本,神道最初属于什么信仰? 别名冲道。 -在日本,神道最初属于什么信仰? 属督脉。 -在日本,神道最初属于什么信仰? 宗教中,神道是日本的本土传统民族宗教,最初以自然崇拜为主,属于泛灵多神信仰(精灵崇拜),视自然界各种动植物为神祇。 -在日本,神道最初属于什么信仰? 谓鬼神赐福降灾神妙莫测之道。 -在日本,神道最初属于什么信仰? 《易·观》:“观天之神道,而四时不忒,圣人以神道设教,而天下服矣。” -在日本,神道最初属于什么信仰? 孔颖达 疏:“微妙无方,理不可知,目不可见,不知所以然而然,谓之神道。” -在日本,神道最初属于什么信仰? 《文选·王延寿<鲁灵光殿赋>》:“敷皇极以创业,协神道而大宁。” -在日本,神道最初属于什么信仰? 张载 注:“协和神明之道,而天下大宁。” -在日本,神道最初属于什么信仰? 南朝 梁 刘勰 《文心雕龙·正纬》:“夫神道阐幽,天命微显。” -在日本,神道最初属于什么信仰? 鲁迅 《中国小说史略》第五篇:“﹝ 干宝 ﹞尝感於其父婢死而再生,及其兄气绝复苏,自言见天神事,乃撰《搜神记》二十卷,以‘发明神道之不诬’。” -在日本,神道最初属于什么信仰? 神道设教 观卦里面蕴含着《易经》固有的诸如神道设教、用舍行藏、以德化民等思想,是孔子把这些思想发掘出来。 -在日本,神道最初属于什么信仰? 「据此是孔子见当时之人,惑于吉凶祸福,而卜筮之史,加以穿凿傅会,故演易系辞,明义理,切人事,借卜筮以教后人,所谓以神道设教,其所发明者,实即羲文之义理,而非别有义理,亦非羲文并无义理,至孔子始言义理也,当即朱子之言而小变之曰,易为卜筮作,实为义理作,伏羲文王之易,有占而无文,与今人用火珠林起课者相似,孔子加卦爻辞如签辞,纯以理言,实即羲文本意,则其说分明无误矣。」 -在日本,神道最初属于什么信仰? 孔子所发掘的《易经》思想与孔子在《论语》书中表现出来的思想完全一致。 -在日本,神道最初属于什么信仰? 《易传》的思想反映了孔子的思想,这个思想是《周易》的,也是孔子的。 -在日本,神道最初属于什么信仰? 在《周易》和孔子看来,神不是有意识的人格化的上帝。 -奥林匹克里昂获得了几连霸? 里昂 Lyon 全名 Olympique lyonnais 绰号 Les Gones、OL 成立 1950年 城市 法国,里昂 主场 热尔兰球场(Stade Gerland) 容纳人数 41,044人 主席 奥拉斯 主教练 雷米·加尔德 联赛 法国足球甲级联赛 2013–14 法甲,第 5 位 网站 官方网站 主场球衣 客场球衣 第三球衣 日尔兰体育场 奥林匹克里昂(Olympique lyonnais,简称:OL及Lyon,中文简称里昂)是一间位于法国东南部罗纳-阿尔卑斯区的里昂市的足球会,成立于1950年8月3日,前身为里昂·奥林匹克(Lyon Olympique)体育俱乐部其中一个分支的足球队,1889年离开体育俱乐部自立门户成立新俱乐部,但官方网站表示俱乐部于1950年正式成立。 -奥林匹克里昂获得了几连霸? 现时在法国足球甲级联赛比赛,俱乐部同时设立男子及女子足球队。 -奥林匹克里昂获得了几连霸? 里昂是首届法国足球甲级联赛成员之一,可惜名列第十五位而降落乙组,1951年以乙级联赛冠军获得创会后首次锦标。 -奥林匹克里昂获得了几连霸? 球队在法国足球史上没有取得辉煌成绩,比较优异的算是六十年代曾杀入欧洲杯赛冠军杯四强,及3度晋身法国杯决赛并2次成功获冠。 -奥林匹克里昂获得了几连霸? 直至九十年代末里昂由辛天尼带领,先连续取得联赛头三名,到2002年终于首次登上法国顶级联赛冠军宝座,同年勒冈(Paul Le Guen)接替执教法国国家足球队的辛天尼,他其后继续带领里昂保持气势,加上队中球员小儒尼尼奧、迪亚拉、克里斯蒂亞諾·馬克斯·戈麥斯、迈克尔·埃辛、西德尼·戈武及门将格雷戈里·库佩表现突出,2003年至2005年横扫3届联赛冠军,创下连续四年夺得联赛锦标,平了1960年代末圣艾蒂安及1990年代初马赛的四连冠纪录。 -奥林匹克里昂获得了几连霸? 2005年前利物浦主教练热拉尔·霍利尔重返法国担任新任主教练,并加入葡萄牙中场蒂亚戈,和前巴伦西亚前锋约翰·卡鲁。 -奥林匹克里昂获得了几连霸? 他亦成功带领里昂赢得一届法甲冠军。 -奥林匹克里昂获得了几连霸? 2007年里昂成为首支上市的法国足球俱乐部,招股价21至24.4欧元,发行370万股,集资8400万欧元[1]。 -奥林匹克里昂获得了几连霸? 2007年4月21日,联赛次名图卢兹二比三不敌雷恩,令处于榜首的里昂领先次席多达17分距离,里昂因此提前六轮联赛庆祝俱乐部连续第六年夺得联赛冠军,亦是欧洲五大联赛(英格兰、德国、西班牙、意大利及法国)历史上首支联赛六连冠队伍[2]。 -奥林匹克里昂获得了几连霸? 在2007-08年赛季,里昂再一次成功卫冕联赛锦标,达成七连霸伟业。 -奥林匹克里昂获得了几连霸? 不过在2008-09赛季,里昂排名法甲第三位,联赛冠军被波尔多所获得。 -奥林匹克里昂获得了几连霸? 于2010年4月,里昂以两回合3比2的比分于欧洲冠军联赛击败波尔多跻身四强,此乃里昂首次晋级此项顶级杯赛的四强阶段。 -奥林匹克里昂获得了几连霸? 粗体字为新加盟球员 -奥林匹克里昂获得了几连霸? 以下球员名单更新于2014年8月27日,球员编号参照 官方网站,夏季转会窗为6月9日至8月31日 -火柴人刺杀行动怎么才能过关? 移动鼠标控制瞄准,点击鼠标左键进行射击。 -火柴人刺杀行动怎么才能过关? 游戏加载完成后点击STARTGAME-然后点击STARTMISSION即可开始游戏。 -火柴人刺杀行动怎么才能过关? 这里不仅仅考验的是你的枪法而且最重要的是你的智慧,喜欢火柴人类型游戏的玩家可以进来小试身手。 -火柴人刺杀行动怎么才能过关? 控制瞄准,刺杀游戏中的目标人物即可过关哦。 -你知道2月14日西方情人节是因何起源的吗? 情人节(英语:Valentine's Day),情人节的起源有多个版本,其中一个说法是在公元三世纪,古罗马暴君为了征召更多士兵,禁止婚礼,一名叫瓦伦丁Valentine的修士不理禁令,秘密替人主持婚礼,结果被收监,最后处死。 -你知道2月14日西方情人节是因何起源的吗? 而他死的那天就是2月14日,为纪念Valentine的勇敢精神,人们将每年的2月14日定为Valentine的纪念日。 -你知道2月14日西方情人节是因何起源的吗? 因此成了后来的“情人节”。 -你知道2月14日西方情人节是因何起源的吗? 另外,据记载,教宗在公元496年废除牧神节,把2月14日定为圣瓦伦丁日,即是St.Valentine's Day,后来成为是西方的节日之一。 -你知道2月14日西方情人节是因何起源的吗? 中文名称:情人节 -你知道2月14日西方情人节是因何起源的吗? 外文名称:Valentine‘s Day -你知道2月14日西方情人节是因何起源的吗? 别名:情人节圣瓦伦丁节 -你知道2月14日西方情人节是因何起源的吗? 公历日期:2月14日 -你知道2月14日西方情人节是因何起源的吗? 起源时间:公元270年2月14日 -你知道2月14日西方情人节是因何起源的吗? 起源事件:人们为了纪念为情人做主而牺牲的瓦伦丁神父,把他遇害的那一天(2月14日)称为情人节。 -你知道2月14日西方情人节是因何起源的吗? 地区:欧美地区 -你知道2月14日西方情人节是因何起源的吗? 宗教:基督教 -你知道2月14日西方情人节是因何起源的吗? 其他信息:西方的传统节日之一。 -你知道2月14日西方情人节是因何起源的吗? 男女在这一天互送礼物(如贺卡和玫瑰花等)用以表达爱意或友好。 -你知道2月14日西方情人节是因何起源的吗? 据台湾“今日台湾人讨厌情人节新闻网”报道,西洋情人节即将来到,求职网进行“办公室恋情及情人节调查”发现,在目前全台上班族的感情状态中,有情人相伴的比率约5成5,4成5的上班族单身;较出乎意料的结果是,情人节以近3成(28%)的占比,登上最讨厌的节日第一名,端午节以24.3%居第二;农历年则以18.2%居第三;第四名是圣诞节,占12.4%。 -你知道2月14日西方情人节是因何起源的吗? 调查指出,情人节对单身族来说,不仅成为压力,也显得更加孤单,在情人节当天,单身的上班族有将近4成(39.1%)的人在家看电视度过,近两成(18.7%)上网聊天,有1成4(14.8%)的人,不畏满街闪光,勇气十足出门看电影,近1成(9.7%)的上班族选择留在公司加班;另外有 5.4%的人,会在情人节当天积极参加联谊,希望能改变自己的感情状态。 -你知道2月14日西方情人节是因何起源的吗? 情侣们在情人节当天,庆祝方式以吃浪漫大餐最多(37.1%),不过有近3成(27%)的情侣,在情人节当天不会特别庆祝情人节,且这个比率远比第三名的旅游(占比11.5%)高出1倍以上。 -你知道2月14日西方情人节是因何起源的吗? 在情人节当天庆祝的开销上,可以说是小资男女当道,选择1000元(新台币,下同)以内的上班族最多占33.1%,情人节当天的花费上班族的平均花费是2473元,大手笔花费上万元以上庆祝情人节的,占比只有2.5%。 -你知道2月14日西方情人节是因何起源的吗? 情人节的起源众说纷纭,而为纪念罗马教士瓦伦丁是其中一个普遍的说法。 -你知道2月14日西方情人节是因何起源的吗? 据《世界图书百科全书》(World Book Encyclopedia)数据指出:“在公元200年时期,罗马皇帝克劳狄二世禁止年轻男子结婚。 -你知道2月14日西方情人节是因何起源的吗? 他认为未婚男子可以成为更优良的士兵。 -你知道2月14日西方情人节是因何起源的吗? 一位名叫瓦伦丁的教士违反了皇帝的命令,秘密为年轻男子主持婚礼,引起皇帝不满,结果被收监,据说瓦伦丁于公元269年2月14日被处决。 -你知道2月14日西方情人节是因何起源的吗? 另外,据《天主教百科全书》(The Catholic情人节 Encyclopedia)指出,公元496年,教宗圣基拉西乌斯一世在公元第五世纪末叶废除了牧神节,把2月14日定为圣瓦伦丁日。” -你知道2月14日西方情人节是因何起源的吗? 这个节日现今以“圣瓦伦丁节”——亦即情人节的姿态盛行起来。 -你知道2月14日西方情人节是因何起源的吗? 但是在第2次梵蒂冈大公会议后,1969年的典礼改革上,整理了一堆在史实上不确定是否真实存在的人物以后,圣瓦伦丁日就被废除了。 -你知道2月14日西方情人节是因何起源的吗? 现在天主教圣人历已经没有圣瓦伦丁日(St. Valentine's Day)。 -你知道2月14日西方情人节是因何起源的吗? 根据《布卢姆尔的警句与寓言辞典》记载:“圣瓦伦丁是个罗马教士,由于援助受逼害的基督徒而身陷险境,后来他归信基督教,最后被处死,卒于二月十四日”古代庆祝情人节的习俗与瓦伦丁拉上关系,可能是纯属巧合而已。 -你知道2月14日西方情人节是因何起源的吗? 事实上,这个节日很可能与古罗马的牧神节或雀鸟交配的季节有关。 -你知道2月14日西方情人节是因何起源的吗? 情人节的特色是情侣互相馈赠礼物。 -你知道2月14日西方情人节是因何起源的吗? 时至今日,人们则喜欢以情人卡向爱人表达情意。 -防卫大学每年招收多少学生? 防卫大学的前身是保安大学。 -防卫大学每年招收多少学生? 防卫大学是日本自卫队培养陆、海、空三军初级军官的学校,被称为日军"军官的摇篮"。 -防卫大学每年招收多少学生? 防卫大学是日军的重点院校。 -防卫大学每年招收多少学生? 日本历届内阁首相都要到防卫大学视察"训示",并亲自向学生颁发毕业证书。 -防卫大学每年招收多少学生? 日军四分之一的军官、三分之一的将官从这里走出。 -防卫大学每年招收多少学生? 防卫大学毕业生已成为日军军官的中坚力量。 -防卫大学每年招收多少学生? 防卫大学每年从地方招收18岁至21岁的应届高中毕业生和同等学历的青年。 -防卫大学每年招收多少学生? 每年招生名额为530名。 -防卫大学每年招收多少学生? 1950年 8月,日本组建警察预备队,1952年改为保安队。 -防卫大学每年招收多少学生? 为了充实保安队干部队伍,提高干部军政素质,1953年4月成立了保安大学,校址设在三浦半岛的久里滨。 -防卫大学每年招收多少学生? 1954年7月1日保安厅改为防卫厅。 -防卫大学每年招收多少学生? 在保安队基础上,日本建立了陆、海、空三军自卫队,保安大学遂改名为防卫大学,1955年迁至三浦半岛东南方的小原台。 -防卫大学每年招收多少学生? 学校直属防卫厅领导。 -防卫大学每年招收多少学生? 防卫大学的教育方针是:要求学生德智体全面发展,倡导学生崇尚知识和正义,培养学生具有指挥各种部队的能力。 -防卫大学每年招收多少学生? 防卫大学每年招生名额为530名,其中陆军300名,海军100名,空军130名。 -防卫大学每年招收多少学生? 根据自卫队向妇女敞开军官大门的决定,防卫大学1992年首次招收女学员35名。 -防卫大学每年招收多少学生? 考试分两次进行。 -防卫大学每年招收多少学生? 第一次,每年11月份进行学科考试;第二次,12月份进行口试和体检。 -防卫大学每年招收多少学生? 学校按陆、海、空三军分别设大学本科班和理工科研究生班。 -防卫大学每年招收多少学生? 本科班学制4年,又分为理工和人文社会学两大科。 -防卫大学每年招收多少学生? 学员入学后先分科,530人中有460人专攻理科,70人专攻文科。 -防卫大学每年招收多少学生? 第1学年按专科学习一般大学课程和一般军事知识。 -防卫大学每年招收多少学生? 第2学年以后在军事上开始区分军种,学员分别学习陆、海、空军的专门课程。 -防卫大学每年招收多少学生? 文化课和军事课的比例是6:l。 -防卫大学每年招收多少学生? 文化课程有人文、社会、自然、外语、电气工程、机械工程、土木建筑工程、应用化学、应用物理、航空、航海等。 -防卫大学每年招收多少学生? 军事训练课每学年6周,按一年四季有比例地安排教学内容,对学生进行军事技术和体能训练。 -防卫大学每年招收多少学生? 理工科研究生班,每年招生1期,学制2年,每期招收90人,设电子工程、航空工程、兵器制造等7个专业,课程按一般大学硕士课程标准设置。 -防卫大学每年招收多少学生? 防卫大学的课程和训练都十分紧张。 -防卫大学每年招收多少学生? 近年来,为了增强防卫大学的吸引力,克服考生逐年减少的倾向广泛征集优秀人才,学校进行了一些改革,改变入学考试办法,各高中校长以内部呈报的形式向防卫大学推荐品学兼优的学生;减少学生入学考试科目,放宽对报考防卫大学的学生的视力要求;降低学分数(大约降低30学分);改善学生宿舍条件。 -防卫大学每年招收多少学生? 防卫大学的学生生活紧张而愉快。 -《威鲁贝鲁的物语》官网是什么? 10年前大战后,威鲁贝鲁国一致辛勤的保护着得来不易的和平,但是与邻国圣卡特拉斯国的关系却不断的紧张,战争即将爆发。 -《威鲁贝鲁的物语》官网是什么? 为了避免战争,威鲁贝鲁国王海特鲁王决定将自己最大的女儿公主莉塔嫁给圣卡特拉斯国的王子格鲁尼亚。 -《威鲁贝鲁的物语》官网是什么? 但是莉塔却刺伤了政治婚姻的对象格鲁尼亚王子逃了出去,这事激怒了圣卡特拉斯国的国王兰帕诺夫王,并下令14天之内抓到王女并执行公开处刑来谢罪,不然两国就要开战。 -《威鲁贝鲁的物语》官网是什么? 《威鲁贝鲁的物语~Sisters of Wellber~》 -《威鲁贝鲁的物语》官网是什么? (Sisters of Wellber) -《威鲁贝鲁的物语》官网是什么? 日文名 ウエルベールの物语 -《威鲁贝鲁的物语》官网是什么? 官方网站 http://www.avexmovie.jp/lineup/wellber/ -《威鲁贝鲁的物语》官网是什么? 为了回避发生战争这个最坏的结果,莉塔下定决心去中立国古利达姆。 diff --git a/examples/text_graph/erniesage/example_data/train_data.txt b/examples/text_graph/erniesage/example_data/train_data.txt deleted file mode 100644 index e9aead6c89fa..000000000000 --- a/examples/text_graph/erniesage/example_data/train_data.txt +++ /dev/null @@ -1,1000 +0,0 @@ -黑缘粗角肖叶甲触角有多大? 体长卵形,棕红色;鞘翅棕黄或淡棕色,外缘和中缝黑色或黑褐色;触角基部3、4节棕黄,余节棕色。 -黑缘粗角肖叶甲触角有多大? 头部刻点粗大,分布不均匀,头顶刻点十分稀疏;触角基部的内侧有一个三角形光瘤,唇基前缘呈半圆形凹切。 -黑缘粗角肖叶甲触角有多大? 触角近于体长之半,第1节粗大,棒状,第2节短,椭圆形,3、4两节细长,稍短于第5节,第5节基细端粗,末端6节明显粗大。 -黑缘粗角肖叶甲触角有多大? 前胸背板横宽,宽约为长的两倍,侧缘敞出较宽,圆形,敞边与盘区之间有一条细纵沟;盘区刻点相当密,前半部刻点较大于后半部。 -黑缘粗角肖叶甲触角有多大? 小盾片舌形,光亮,末端圆钝。 -黑缘粗角肖叶甲触角有多大? 鞘翅刻点粗大,不规则排列,肩部之后的刻点更为粗大,具皱褶,近中缝的刻点较小,略呈纵行排列。 -黑缘粗角肖叶甲触角有多大? 前胸前侧片前缘直;前胸后侧片具粗大刻点。 -黑缘粗角肖叶甲触角有多大? 足粗壮;胫节具纵脊,外端角向外延伸,呈弯角状;爪具附齿。 -暮光闪闪的姐姐是谁? 暮光闪闪是一匹雌性独角兽,后来在神秘魔法的影响下变成了空角兽(公主),她是《我的小马驹:友情是魔法》(英文名:My Little Pony:Friendship is Magic)中的主角之一。 -暮光闪闪的姐姐是谁? 她是银甲闪闪(Shining Armor)的妹妹,同时也是韵律公主(Princess Cadance)的小姑子。 -暮光闪闪的姐姐是谁? 在该系列中,她与最好的朋友与助手斯派克(Spike)一起生活在小马镇(Ponyville)的金橡图书馆(Golden Oak Library),研究友谊的魔法。 -暮光闪闪的姐姐是谁? 在暮光闪闪成为天角兽之前(即S3E13前),常常给塞拉丝蒂娅公主(Princess Celestia)关于友谊的报告。[1] -暮光闪闪的姐姐是谁? 《我的小马驹:友谊是魔法》(英文名称:My Little Pony:Friendship is Magic)(简称MLP) -暮光闪闪的姐姐是谁? 动画讲述了一只名叫做暮光闪闪(Twilight Sparkle)的独角兽(在SE3E13 -暮光闪闪的姐姐是谁? My Little Pony:Friendship is Magic[2] -暮光闪闪的姐姐是谁? 后成为了天角兽),执行她的导师塞拉斯蒂娅公主(PrincessCelestia)的任务,在小马镇(Ponyville)学习关于友谊的知识。 -暮光闪闪的姐姐是谁? 她与另外五只小马,苹果杰克(Applejack)、瑞瑞(Rarity)、云宝黛西(Rainbow Dash)、小蝶(Fluttershy)与萍琪派(Pinkie Pie),成为了最要好的朋友。 -暮光闪闪的姐姐是谁? 每匹小马都分别代表了协律精华的6个元素:诚实,慷慨,忠诚,善良,欢笑,魔法,各自扮演着属于自己的重要角色。 -暮光闪闪的姐姐是谁? 此后,暮光闪闪(Twilight Sparkle)便与她认识的新朋友们开始了有趣的日常生活。 -暮光闪闪的姐姐是谁? 在动画中,随时可见她们在小马镇(Ponyville)的种种冒险、奇遇、日常等等。 -暮光闪闪的姐姐是谁? 同时,也在她们之间的互动和冲突中,寻找着最适合最合理的完美解决方案。 -暮光闪闪的姐姐是谁? “尽管小马国并不太平,六位主角之间也常常有这样那样的问题,但是他们之间的真情对待,使得这个童话世界已经成为不少人心中理想的世外桃源。” -暮光闪闪的姐姐是谁? 暮光闪闪在剧情刚开始的时候生活在中心城(Canterlot),后来在夏日 -暮光闪闪的姐姐是谁? 暮光闪闪与斯派克(Spike) -暮光闪闪的姐姐是谁? 庆典的时候被塞拉丝蒂娅公主派遣到小马镇执行检查夏日庆典的准备工作的任务。 -暮光闪闪的姐姐是谁? 在小马镇交到了朋友(即其余5个主角),并和她们一起使用协律精华(Elements of harmony)击败了梦魇之月。 -暮光闪闪的姐姐是谁? 并在塞拉丝蒂亚公主的许可下,留在小马镇继续研究友谊的魔法。 -暮光闪闪的姐姐是谁? 暮光闪闪的知识基本来自于书本,并且她相当不相信书本以外的“迷信”,因为这样她在S1E15里吃足了苦头。 -暮光闪闪的姐姐是谁? 在这之后,她也开始慢慢学会相信一些书本以外的东西。 -暮光闪闪的姐姐是谁? 暮光闪闪热爱学习,并且学习成绩相当好(从她可以立刻算出 -暮光闪闪的姐姐是谁? 的结果可以看 -暮光闪闪的姐姐是谁? 暮光闪闪的原型 -暮光闪闪的姐姐是谁? 出)。 -暮光闪闪的姐姐是谁? 相当敬爱自己的老师塞拉丝蒂亚公主甚至到了精神失常的地步。 -暮光闪闪的姐姐是谁? 在第二季中,曾因为无法交出关于友谊的报告而做出了疯狂的行为,后来被塞拉丝蒂亚公主制止,在这之后,暮光闪闪得到了塞拉丝蒂亚公主“不用定期交友谊报告”的许可。 -暮光闪闪的姐姐是谁? 于是暮光闪闪在后面的剧情中的主角地位越来越得不到明显的体现。 -暮光闪闪的姐姐是谁? 在SE3E13中,因为破解了白胡子星璇留下的神秘魔法而被加冕成为了天角兽(公主),被尊称为“闪闪公主”。 -暮光闪闪的姐姐是谁? 当小星座熊在小马镇引起恐慌的时候,暮光闪闪运用了自身强大的魔法将水库举起后装满牛奶,用牛奶将小星座熊安抚后,连着巨型奶瓶和小星座熊一起送回了小星座熊居住的山洞。 -我想知道红谷十二庭有哪些金融机构? 红谷十二庭是由汪氏集团旗下子公司江西尤金房地产开发有限公司携手城发投资共同开发的精品社区,项目占地面积约380亩,总建筑面积约41万平方米。 -我想知道红谷十二庭有哪些金融机构? 项目以建设人文型、生态型居住环境为规划目标;创造一个布局合理、功能齐全、交通便捷、绿意盎然、生活方便,有文化内涵的居住区。 -我想知道红谷十二庭有哪些金融机构? 金融机构:工商银行、建设银行、农业银行、中国银行红谷滩支行、商业银行红谷滩支行等 -我想知道红谷十二庭有哪些金融机构? 周边公园:沿乌砂河50米宽绿化带、乌砂河水岸公园、秋水广场、赣江市民公园 -我想知道红谷十二庭有哪些金融机构? 周边医院:新建县人民医院、开心人药店、中寰医院 -我想知道红谷十二庭有哪些金融机构? 周边学校:育新小学红谷滩校区、南师附小红谷滩校区、实验小学红谷滩校区中学:南昌二中红谷滩校区、南昌五中、新建二中、竞秀贵族学校 -我想知道红谷十二庭有哪些金融机构? 周边公共交通:112、204、211、219、222、227、238、501等20多辆公交车在本项目社区门前停靠 -我想知道红谷十二庭有哪些金融机构? 红谷十二庭处在南昌一江两城中的西城中心,位属红谷滩CBD文化公园中心——马兰圩中心组团,红谷滩中心区、红角洲、新建县三区交汇处,南临南期友好路、东接红谷滩中心区、西靠乌砂河水岸公园(50米宽,1000米长)。 -我想知道红谷十二庭有哪些金融机构? 交通便捷,景观资源丰富,生活配套设施齐全,出则繁华,入则幽静,是现代人居的理想地段。 -我想知道红谷十二庭有哪些金融机构? 红谷十二庭户型图 -苏琳最开始进入智通实业是担任什么职位? 现任广东智通人才连锁股份有限公司总裁,清华大学高级工商管理硕士。 -苏琳最开始进入智通实业是担任什么职位? 1994年,加入智通实业,从总经理秘书做起。 -苏琳最开始进入智通实业是担任什么职位? 1995年,智通实业决定进入人才服务行业,被启用去负责新公司的筹建及运营工作,在苏琳的努力下,智通人才智力开发有限公司成立。 -苏琳最开始进入智通实业是担任什么职位? 2003年,面对同城对手的激烈竞争,苏琳冷静对待,领导智通先后接管、并购了同城的腾龙、安达盛人才市场,,“品牌运作,连锁经营,差异制胜”成为苏琳屡屡制胜的法宝。 -苏琳最开始进入智通实业是担任什么职位? 2006年,苏琳先是将智通人才升级为“东莞市智通人才连锁有限公司”,一举成为广东省人才市场目前惟一的连锁机构,随后在东莞同时开设长安、松山湖、清溪等镇区分部,至此智通在东莞共有6个分部。 -苏琳最开始进入智通实业是担任什么职位? 一番大刀阔斧完成东莞布局后,苏琳确定下一个更为高远的目标——进军珠三角,向全国发展连锁机构。 -苏琳最开始进入智通实业是担任什么职位? 到2011年末,苏琳领导的智通人才已在珠三角的东莞、佛山、江门、中山等地,长三角的南京、宁波、合肥等地,中西部的南昌、长沙、武汉、重庆、西安等地设立了20多家连锁经营网点。 -苏琳最开始进入智通实业是担任什么职位? 除了财务副总裁之外,苏琳是智通人才核心管理高层当中唯一的女性,不管是要约采访的记者还是刚刚加入智通的员工,见到苏琳的第一面,都会有一种惊艳的感觉,“一位女企业家居然非常美丽和时尚?!” -苏琳最开始进入智通实业是担任什么职位? 智通管理高层的另外6位男性成员,有一次同时接受一家知名媒体采访时,共同表达了对自己老板的“爱慕”之情,苏琳听后莞尔一笑,指着在座的这几位高层说道“其实,我更爱他们!” -苏琳最开始进入智通实业是担任什么职位? 这种具有独特领导魅力的表述让这位记者唏嘘不已,同时由这样的一个细节让他感受到了智通管理团队的协作力量。 -谁知道黄沙中心小学的邮政编码是多少? 学校于1954年始建于棕树湾村,当时借用一间民房做教室,取名为“黄沙小学”,只有教师1人,学生8人。 -谁知道黄沙中心小学的邮政编码是多少? 1958年在大跃进精神的指导下,实行大集体,全乡集中办学,发展到12个班,300多学生,20名教职工。 -谁知道黄沙中心小学的邮政编码是多少? 1959年解散。 -谁知道黄沙中心小学的邮政编码是多少? 1959年下半年,在上级的扶持下,建了6间木房,搬到1960年学校所在地,有6名教师,3个班,60名学生。 -谁知道黄沙中心小学的邮政编码是多少? 1968年,开始招收一个初中班,“黄沙小学”改名为 “附小”。 -谁知道黄沙中心小学的邮政编码是多少? 当时已发展到5个班,8名教师,110多名学生。 -谁知道黄沙中心小学的邮政编码是多少? 增建土木结构教室两间。 -谁知道黄沙中心小学的邮政编码是多少? 1986年,初中、小学分开办学。 -谁知道黄沙中心小学的邮政编码是多少? 增建部分教师宿舍和教室,办学条件稍有改善,学校初具规模。 -谁知道黄沙中心小学的邮政编码是多少? 1996年,我校在市、县领导及希望工程主管部门的关怀下,决定改为“黄沙希望小学”并拨款32万元,新建一栋4层,12间教室的教学楼,教学条件大有改善。 -谁知道黄沙中心小学的邮政编码是多少? 当时发展到10个班,学生300多人,教职工19人,小学高级教师3人,一级教师7人,二级教师9人。 -谁知道黄沙中心小学的邮政编码是多少? 2003年下半年由于农村教育体制改革,撤销教育组,更名为“黄沙中心小学”。 -谁知道黄沙中心小学的邮政编码是多少? 学校现有在校生177人(含学前42人),设有学前至六年级共7个教学班。 -谁知道黄沙中心小学的邮政编码是多少? 有教师19人,其中大专以上学历11人,中师6人;小学高级教师14人,一级教师5人。 -谁知道黄沙中心小学的邮政编码是多少? 学校校园占地面积2050平方米,生均达15.29平方米,校舍建筑面积1645平方米,生均12.27平方米;设有教师办公室、自然实验、电教室(合二为一)、微机室、图书阅览室(合二为一)、体育室、广播室、少先队活动室。 -谁知道黄沙中心小学的邮政编码是多少? 广西壮族自治区桂林市临桂县黄沙瑶族乡黄沙街 邮编:541113[1] -伊藤实华的职业是什么? 伊藤实华(1984年3月25日-)是日本的女性声优。 -伊藤实华的职业是什么? THREE TREE所属,东京都出身,身长149cm,体重39kg,血型AB型。 -伊藤实华的职业是什么? ポルノグラフィティのLION(森男) -伊藤实华的职业是什么? 2000年 -伊藤实华的职业是什么? 犬夜叉(枫(少女时代)) -伊藤实华的职业是什么? 幻影死神(西亚梨沙) -伊藤实华的职业是什么? 2001年 -伊藤实华的职业是什么? NOIR(ロザリー) -伊藤实华的职业是什么? 2002年 -伊藤实华的职业是什么? 水瓶战记(柠檬) -伊藤实华的职业是什么? 返乡战士(エイファ) -伊藤实华的职业是什么? 2003年 -伊藤实华的职业是什么? 奇诺之旅(女子A(悲しい国)) -伊藤实华的职业是什么? 2004年 -伊藤实华的职业是什么? 爱你宝贝(坂下ミキ) -伊藤实华的职业是什么? Get Ride! アムドライバー(イヴァン・ニルギース幼少期) -伊藤实华的职业是什么? スクールランブル(花井春树(幼少时代)) -伊藤实华的职业是什么? 2005年 -伊藤实华的职业是什么? 光速蒙面侠21(虎吉) -伊藤实华的职业是什么? 搞笑漫画日和(男子トイレの精、パン美先生) -伊藤实华的职业是什么? 银牙伝说WEED(テル) -伊藤实华的职业是什么? 魔女的考验(真部カレン、守山太郎) -伊藤实华的职业是什么? BUZZER BEATER(レニー) -伊藤实华的职业是什么? 虫师(“眼福眼祸”さき、“草を踏む音”沢(幼少时代)) -伊藤实华的职业是什么? 2006年 -伊藤实华的职业是什么? 魔女之刃(娜梅) -伊藤实华的职业是什么? 反斗小王子(远藤レイラ) -伊藤实华的职业是什么? 搞笑漫画日和2(パン美先生、フグ子、ダンサー、ヤマトの妹、女性) -伊藤实华的职业是什么? 人造昆虫カブトボーグ V×V(ベネチアンの弟、东ルリ、园儿A) -伊藤实华的职业是什么? 2007年 -爆胎监测与安全控制系统英文是什么? 爆胎监测与安全控制系统(Blow-out Monitoring and Brake System),是吉利全球首创,并拥有自主知识产权及专利的一项安全技术。 -爆胎监测与安全控制系统英文是什么? 这项技术主要是出于防止高速爆胎所导致的车辆失控而设计。 -爆胎监测与安全控制系统英文是什么? BMBS爆胎监测与安全控制系统技术于2004年1月28日正式获得中国发明专利授权。 -爆胎监测与安全控制系统英文是什么? 2008年第一代BMBS系统正式与世人见面,BMBS汇集国内外汽车力学、控制学、人体生理学、电子信息学等方面的专家和工程技术人员经过一百余辆试验车累计行程超过五百万公里的可靠性验证,以确保产品的可靠性。 -爆胎监测与安全控制系统英文是什么? BMBS技术方案的核心即是采用智能化自动控制系统,弥补驾驶员生理局限,在爆胎后反应时间为0.5秒,替代驾驶员实施行车制动,保障行车安全。 -爆胎监测与安全控制系统英文是什么? BMBS系统由控制系统和显示系统两大部分组成,控制系统由BMBS开关、BMBS主机、BMBS分机、BMBS真空助力器四部分组成;显示系统由GPS显示、仪表指示灯、语言提示、制动双闪灯组成。 -爆胎监测与安全控制系统英文是什么? 当轮胎气压高于或低于限值时,控制器声光提示胎压异常。 -爆胎监测与安全控制系统英文是什么? 轮胎温度过高时,控制器发出信号提示轮胎温度过高。 -爆胎监测与安全控制系统英文是什么? 发射器电量不足时,控制器显示低电压报警。 -爆胎监测与安全控制系统英文是什么? 发射器受到干扰长期不发射信号时,控制器显示无信号报警。 -爆胎监测与安全控制系统英文是什么? 当汽车电门钥匙接通时,BMBS首先进入自检程序,检测系统各部分功能是否正常,如不正常,BMBS报警灯常亮。 -走读干部现象在哪里比较多? 走读干部一般是指县乡两级干部家住县城以上的城市,本人在县城或者乡镇工作,要么晚出早归,要么周一去单位上班、周五回家过周末。 -走读干部现象在哪里比较多? 对于这种现象,社会上的议论多是批评性的,认为这些干部脱离群众、作风漂浮、官僚主义,造成行政成本增加和腐败。 -走读干部现象在哪里比较多? 截至2014年10月,共有6484名“走读干部”在专项整治中被查处。 -走读干部现象在哪里比较多? 这是中央首次大规模集中处理这一长期遭诟病的干部作风问题。 -走读干部现象在哪里比较多? 干部“走读”问题主要在乡镇地区比较突出,城市地区则较少。 -走读干部现象在哪里比较多? 从历史成因和各地反映的情况来看,产生“走读”现象的主要原因大致有四种: -走读干部现象在哪里比较多? 现今绝大多数乡村都有通往乡镇和县城的石子公路甚至柏油公路,这无疑为农村干部的出行创造了便利条件,为“干部像候鸟,频往家里跑”创造了客观条件。 -走读干部现象在哪里比较多? 选调生、公务员队伍大多是学历较高的大学毕业生,曾在高校所在地的城市生活,不少人向往城市生活,他们不安心长期扎根基层,而是将基层当作跳板,因此他们往往成为“走读”的主力军。 -走读干部现象在哪里比较多? 公仆意识、服务意识淡化,是“走读”现象滋生的主观原因。 -走读干部现象在哪里比较多? 有些党员干部感到自己长期在基层工作,该为自己和家庭想想了。 -走读干部现象在哪里比较多? 于是,不深入群众认真调查研究、认真听取群众意见、认真解决群众的实际困难,也就不难理解了。 -走读干部现象在哪里比较多? 县级党政组织对乡镇领导干部管理的弱化和为基层服务不到位,导致“走读”问题得不到应有的制度约束,是“走读”问题滋长的组织原因。[2] -走读干部现象在哪里比较多? 近些年来,我国一些地方的“干部走读”现象较为普遍,社会上对此议走读干部论颇多。 -走读干部现象在哪里比较多? 所谓“干部走读”,一般是指县乡两级干部家住县城以上的城市,本人在县城或者乡镇工作,要么早出晚归,要么周一去单位上班、周五回家过周末。 -走读干部现象在哪里比较多? 对于这种现象,社会上的议论多是批评性的,认为这些干部脱离群众、作风漂浮、官僚主义,造成行政成本增加和腐败。 -走读干部现象在哪里比较多? 干部走读之所以成为“千夫所指”,是因为这种行为增加了行政成本。 -走读干部现象在哪里比较多? 从根子上说,干部走读是城乡发展不平衡的产物,“人往高处走,水往低处流”,有了更加舒适的生活环境,不管是为了自己生活条件改善也好,还是因为子女教育也好,农村人口向城镇转移,这是必然结果。 -走读干部现象在哪里比较多? “干部走读”的另一个重要原因,是干部人事制度改革。 -走读干部现象在哪里比较多? 目前公务员队伍“凡进必考”,考上公务员的大多是学历较高的大学毕业生,这些大学毕业生来自各个全国各地,一部分在本地结婚生子,沉淀下来;一部分把公务员作为跳板,到基层后或考研,或再参加省考、国考,或想办法调回原籍。 -走读干部现象在哪里比较多? 再加上一些下派干部、异地交流任职干部,构成了看似庞大的“走读”队伍。 -走读干部现象在哪里比较多? 那么,“干部走读”有哪些弊端呢? -走读干部现象在哪里比较多? 一是这些干部人在基层,心在城市,缺乏长期作战的思想,工作不安心。 -走读干部现象在哪里比较多? 周一来上班,周五回家转,对基层工作缺乏热情和感情;二是长期在省市直机关工作,对基层工作不熟悉不了解,工作不热心;三是长期走读,基层干群有工作难汇报,有困难难解决,群众不开心;四是干部来回走读,公车私驾,私费公报,把大量的经济负担转嫁给基层;五是对这些走读干部,基层管不了,上级监督难,节假日期间到哪里去、做什么事,基本处于失控和真空状态,各级组织和基层干群不放心。 -走读干部现象在哪里比较多? 特别需要引起警觉的是,由于少数走读干部有临时思想,满足于“当维持会长”,得过且过混日子,热衷于做一些急功近利、砸锅求铁的短期行为和政绩工程,不愿做打基础、管长远的实事好事,甚至怠政、疏政和懒于理政,影响了党和政府各项方针政策措施的落实,导致基层无政府主义、自由主义抬头,削弱了党和政府的领导,等到矛盾激化甚至不可收拾的时候,处理已是来之不及。 -走读干部现象在哪里比较多? 权利要与义务相等,不能只有义务而没有权利,或是只有权利没有义务。 -走读干部现象在哪里比较多? 如何真正彻底解决乡镇干部“走读”的现象呢? -走读干部现象在哪里比较多? 那就必须让乡镇基层干部义务与权利相等。 -走读干部现象在哪里比较多? 如果不能解决基层干部待遇等问题,即使干部住村,工作上也不会有什么进展的。 -走读干部现象在哪里比较多? 所以,在政治上关心,在生活上照顾,在待遇上提高。 -走读干部现象在哪里比较多? 如,提高基层干部的工资待遇,增加通讯、交通补助;帮助解决子女入学及老人赡养问题;提拔干部优先考虑基层干部;干部退休时的待遇至少不低于机关干部等等。 -化州市良光镇东岸小学学风是什么? 学校全体教职工爱岗敬业,团结拼搏,勇于开拓,大胆创新,进行教育教学改革,努力开辟第二课堂的教学路子,并开通了网络校校通的交流合作方式。 -化州市良光镇东岸小学学风是什么? 现学校教师正在为创建安全文明校园而努力。 -化州市良光镇东岸小学学风是什么? 东岸小学位置偏僻,地处贫穷落后,是良光镇最偏远的学校,学校,下辖分教点——东心埇小学,[1]?。 -化州市良光镇东岸小学学风是什么? 学校2011年有教师22人,学生231人。 -化州市良光镇东岸小学学风是什么? 小学高级教师8人,小学一级教师10人,未定级教师4人,大专学历的教师6人,其余的都具有中师学历。 -化州市良光镇东岸小学学风是什么? 全校共设12个班,学校课程按标准开设。 -化州市良光镇东岸小学学风是什么? 东岸小学原来是一所破旧不堪,教学质量非常差的薄弱学校。 -化州市良光镇东岸小学学风是什么? 近几年来,在各级政府、教育部门及社会各界热心人士鼎力支持下,学校领导大胆改革创新,致力提高教学质量和教师水平,并加大经费投入,大大改善了办学条件,使学校由差变好,实现了大跨越。 -化州市良光镇东岸小学学风是什么? 学校建设性方面。 -化州市良光镇东岸小学学风是什么? 东岸小学属于革命老区学校,始建于1980年,从东心埇村祠堂搬到这个校址,1990年建造一幢建筑面积为800平方米的南面教学楼, 1998年老促会支持从北面建造一幢1800平方米的教学大楼。 -化州市良光镇东岸小学学风是什么? 学校在管理方面表现方面颇具特色,实现了各项制度的日常化和规范化。 -化州市良光镇东岸小学学风是什么? 学校领导有较强的事业心和责任感,讲求民主与合作,勤政廉政,依法治校,树立了服务意识。 -化州市良光镇东岸小学学风是什么? 学校一贯实施“德育为先,以人为本”的教育方针,制定了“团结,律已,拼搏,创新”的校训。 -化州市良光镇东岸小学学风是什么? 教育风为“爱岗敬业,乐于奉献”,学风为“乐学,勤学,巧学,会学”。 -化州市良光镇东岸小学学风是什么? 校内营造了尊师重教的氛围,形成了良好的校风和学风。 -化州市良光镇东岸小学学风是什么? 教师们爱岗敬业,师德高尚,治学严谨,教研教改气氛浓厚,获得喜人的教研成果。 -化州市良光镇东岸小学学风是什么? 近几年来,教师撰写的教育教学论文共10篇获得县市级以上奖励,获了镇级以上奖励的有100人次。 -化州市良光镇东岸小学学风是什么? 学校德育工作成绩显著,多年被评为“安全事故为零”的学校,良光镇先进学校。 -化州市良光镇东岸小学学风是什么? 特别是教学质量大大提高了。 -化州市良光镇东岸小学学风是什么? 这些成绩得到了上级及群众的充分肯定。 -化州市良光镇东岸小学学风是什么? 1.学校环境欠美观有序,学校大门口及校道有待改造。 -化州市良光镇东岸小学学风是什么? 2.学校管理制度有待改进,部分教师业务水平有待提高。 -化州市良光镇东岸小学学风是什么? 3.教师宿舍、教室及学生宿舍欠缺。 -化州市良光镇东岸小学学风是什么? 4.运动场不够规范,各类体育器材及设施需要增加。 -化州市良光镇东岸小学学风是什么? 5.学生活动空间少,见识面窄,视野不够开阔。 -化州市良光镇东岸小学学风是什么? 1.努力营造和谐的教育教学新气氛。 -化州市良光镇东岸小学学风是什么? 建立科学的管理制度,坚持“与时俱进,以人为本”,真正实现领导对教师,教师对学生之间进行“德治与情治”;学校的人文环境做到“文明,和谐,清新”;德育环境做到“自尊,律已,律人”;心理环境做到“安全,谦虚,奋发”;交际环境做到“团结合作,真诚助人”;景物环境做到“宜人,有序。” -化州市良光镇东岸小学学风是什么? 营造学校与育人的新特色。 -我很好奇发射管的输出功率怎么样? 产生或放大高频功率的静电控制电子管,有时也称振荡管。 -我很好奇发射管的输出功率怎么样? 用于音频或开关电路中的发射管称调制管。 -我很好奇发射管的输出功率怎么样? 发射管是无线电广播、通信、电视发射设备和工业高频设备中的主要电子器件。 -我很好奇发射管的输出功率怎么样? 输出功率和工作频率是发射管的基本技术指标。 -我很好奇发射管的输出功率怎么样? 广播、通信和工业设备的发射管,工作频率一般在30兆赫以下,输出功率在1919年为2千瓦以下,1930年达300千瓦,70年代初已超过1000千瓦,效率高达80%以上。 -我很好奇发射管的输出功率怎么样? 发射管工作频率提高时,输出功率和效率都会降低,因此1936年首次实用的脉冲雷达工作频率仅28兆赫,80年代则已达 400兆赫以上。 -我很好奇发射管的输出功率怎么样? 40年代电视发射管的工作频率为数十兆赫,而80年代初,优良的电视发射管可在1000兆赫下工作,输出功率达20千瓦,效率为40%。 -我很好奇发射管的输出功率怎么样? 平面电极结构的小功率发射三极管可在更高的频率下工作。 -我很好奇发射管的输出功率怎么样? 发射管多采用同心圆筒电极结构。 -我很好奇发射管的输出功率怎么样? 阴极在最内层,向外依次为各个栅极和阳极。 -我很好奇发射管的输出功率怎么样? 图中,自左至右为阴极、第一栅、第二栅、栅极阴极组装件和装入阳极后的整个管子。 -我很好奇发射管的输出功率怎么样? 发射管 -我很好奇发射管的输出功率怎么样? 中小功率发射管多采用间热式氧化物阴极。 -我很好奇发射管的输出功率怎么样? 大功率发射管一般采用碳化钍钨丝阴极,有螺旋、直条或网笼等结构形式。 -我很好奇发射管的输出功率怎么样? 图为网笼式阴极。 -我很好奇发射管的输出功率怎么样? 栅极多用钼丝或钨丝绕制,或用钼片经电加工等方法制造。 -我很好奇发射管的输出功率怎么样? 栅极表面经镀金(或铂)或涂敷锆粉等处理,以降低栅极电子发射,使发射管稳定工作。 -我很好奇发射管的输出功率怎么样? 用气相沉积方法制造的石墨栅极,具有良好的性能。 -我很好奇发射管的输出功率怎么样? 发射管阳极直流输入功率转化为高频输出功率的部分约为75%,其余25%成为阳极热损耗,因此对发射管的阳极必须进行冷却。 -我很好奇发射管的输出功率怎么样? 中小功率发射管的阳极采取自然冷却方式,用镍、钼或石墨等材料制造,装在管壳之内,工作温度可达 600℃。 -我很好奇发射管的输出功率怎么样? 大功率发射管的阳极都用铜制成,并作为真空密封管壳的一部分,采用各种强制冷却方式。 -我很好奇发射管的输出功率怎么样? 各种冷却方式下每平方厘米阳极内表面的散热能力为:水冷100瓦;风冷30瓦;蒸发冷却250瓦;超蒸发冷却1000瓦以上,80年代已制成阳极损耗功率为1250千瓦的超蒸发冷却发射管。 -我很好奇发射管的输出功率怎么样? 发射管也常以冷却方式命名,如风冷发射管、水冷发射管和蒸发冷却发射管。 -我很好奇发射管的输出功率怎么样? 发射管管壳用玻璃或陶瓷制造。 -我很好奇发射管的输出功率怎么样? 小功率发射管内使用含钡的吸气剂;大功率发射管则采用锆、钛、钽等吸气材料,管内压强约为10帕量级。 -我很好奇发射管的输出功率怎么样? 发射管寿命取决于阴极发射电子的能力。 -我很好奇发射管的输出功率怎么样? 大功率发射管寿命最高记录可达8万小时。 -我很好奇发射管的输出功率怎么样? 发射四极管的放大作用和输出输入电路间的隔离效果优于三极管,应用最广。 -我很好奇发射管的输出功率怎么样? 工业高频振荡器普遍采用三极管。 -我很好奇发射管的输出功率怎么样? 五极管多用在小功率范围中。 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 鲁能领秀城中央公园位于鲁能领秀城景观中轴之上,总占地161.55亩,总建筑面积约40万平米,容积率为2.70,由22栋小高层、高层组成;其绿地率高达35.2%,环境优美,产品更加注重品质化、人性化和自然生态化,是鲁能领秀城的生态人居典范。 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 中央公园[1] 学区准现房,坐享鲁能领秀城成熟配套,成熟生活一步到位。 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 经典板式小高层,103㎡2+1房仅22席,稀市推出,错过再无;92㎡经典两房、137㎡舒适三房压轴登场! -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 物业公司: -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 济南凯瑞物业公司;深圳长城物业公司;北京盛世物业有限公司 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 绿化率: -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 42% -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 容积率: -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 2.70 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 暖气: -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 集中供暖 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 楼座展示:中央公园由22栋小高层、高层组成,3、16、17号楼分别是11层小高层,18层和28层的高层。 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 4号楼是23层,2梯3户。 -鲁能领秀城中央公园有23层,2梯3户的是几号楼? 项目位置: -鬼青蛙在哪里有收录详情? 鬼青蛙这张卡可以从手卡把这张卡以外的1只水属性怪兽丢弃,从手卡特殊召唤。 -鬼青蛙在哪里有收录详情? 这张卡召唤·反转召唤·特殊召唤成功时,可以从自己的卡组·场上选1只水族·水属性·2星以下的怪兽送去墓地。 -鬼青蛙在哪里有收录详情? 此外,1回合1次,可以通过让自己场上1只怪兽回到手卡,这个回合通常召唤外加上只有1次,自己可以把「鬼青蛙」以外的1只名字带有「青蛙」的怪兽召唤。[1] -鬼青蛙在哪里有收录详情? 游戏王卡包收录详情 -鬼青蛙在哪里有收录详情? [09/09/18] -西湖区有多大? 西湖区是江西省南昌市市辖区。 -西湖区有多大? 为南昌市中心城区之一,有着2200多年历史,是一个物华天宝、人杰地灵的古老城区。 -西湖区有多大? 2004年南昌市老城区区划调整后,西湖区东起京九铁路线与青山湖区毗邻,南以洪城路东段、抚河路南段、象湖以及南隔堤为界与青云谱区、南昌县接壤,西凭赣江中心线与红谷滩新区交界,北沿中山路、北京西路与东湖区相连,所辖面积34.5平方公里,常住人口43万,管辖1个镇、10个街道办事处,设12个行政村、100个社区。 -西湖区有多大? (图)西湖区[南昌市] -西湖区有多大? 西湖原为汉代豫章群古太湖的一部分,唐贞元15年(公元799年)洪恩桥的架设将东太湖分隔成东西两部分,洪恩桥以西谓之西湖,西湖区由此而得名。 -西湖区有多大? 西湖区在1926年南昌设市后分别称第四、五部分,六、七部分。 -西湖区有多大? 1949年解放初期分别称第三、四区。 -西湖区有多大? 1955年分别称抚河区、西湖区。 -西湖区有多大? 1980年两区合并称西湖区。[1] -西湖区有多大? 辖:西湖街道、丁公路街道、广外街道、系马桩街道、绳金塔街道、朝阳洲街道、禾草街街道、十字街街道、瓦子角街道、三眼井街道、上海路街道、筷子巷街道、南站街道。[1] -西湖区有多大? 2002年9月,由原筷子巷街道和原禾草街街道合并设立南浦街道,原广外街道与瓦子角街道的一部分合并设立广润门街道。 -西湖区有多大? 2002年12月1日设立桃源街道。 -西湖区有多大? 2004年区划调整前的西湖区区域:东与青山湖区湖坊乡插花接壤;西临赣江与红谷滩新区隔江相望;南以建设路为界,和青云谱区毗邻;北连中山路,北京西路,与东湖区交界。[1] -西湖区有多大? 2002年9月,由原筷子巷街道和原禾草街街道合并设立南浦街道,原广外街道与瓦子角街道的一部分合并设立广润门街道。 -西湖区有多大? 2002年12月1日设立桃源街道。 -西湖区有多大? 2004年区划调整前的西湖区区域:东与青山湖区湖坊乡插花接壤;西临赣江与红谷滩新区隔江相望;南以建设路为界,和青云谱区毗邻;北连中山路,北京西路,与东湖区交界。 -西湖区有多大? 2004年9月7日,国务院批准(国函[2004]70号)调整南昌市市辖区部分行政区划:将西湖区朝阳洲街道的西船居委会划归东湖区管辖。 -西湖区有多大? 将青山湖区的桃花镇和湖坊镇的同盟村划归西湖区管辖。 -西湖区有多大? 将西湖区十字街街道的谷市街、洪城路、南关口、九四、新丰5个居委会,上海路街道的草珊瑚集团、南昌肠衣厂、电子计算机厂、江西涤纶厂、江地基础公司、曙光、商标彩印厂、南昌市染整厂、江南蓄电池厂、四机床厂、二进、国乐新村12个居委会,南站街道的解放西路东居委会划归青云谱区管辖。 -西湖区有多大? 将西湖区上海路街道的轻化所、洪钢、省人民检察院、电信城东分局、安康、省机械施工公司、省水利设计院、省安装公司、南方电动工具厂、江西橡胶厂、上海路北、南昌电池厂、东华计量所、南昌搪瓷厂、上海路新村、华安针织总厂、江西五金厂、三波电机厂、水文地质大队、二六○厂、省卫生学校、新世纪、上海路住宅区北、塔子桥北、南航、上海路住宅区南、沿河、南昌阀门厂28个居委会,丁公路街道的新魏路、半边街、师大南路、顺化门、岔道口东路、师大、广电厅、手表厂、鸿顺9个居委会,南站街道的工人新村北、工人新村南、商苑、洪都中大道、铁路第三、铁路第四、铁路第六7个居委会划归青山湖区管辖。 -西湖区有多大? 调整后,西湖区辖绳金塔、桃源、朝阳洲、广润门、南浦、西湖、系马桩、十字街、丁公路、南站10个街道和桃花镇,区人民政府驻孺子路。 -西湖区有多大? 调整前,西湖区面积31平方千米,人口52万。 -西湖区有多大? (图)西湖区[南昌市] -西湖区有多大? 西湖区位于江西省省会南昌市的中心地带,具有广阔的发展空间和庞大的消费群体,商贸旅游、娱乐服务业等到各个行业都蕴藏着无限商机,投资前景十分广阔。 -西湖区有多大? 不仅水、电价格低廉,劳动力资源丰富,人均工资和房产价格都比沿海城市低,城区拥有良好的人居环境、低廉的投资成本,巨大的发展潜力。 -西湖区有多大? 105、316、320国道和京九铁路贯穿全境,把南北东西交通连成一线;民航可与上海、北京、广州、深圳、厦门、温州等到地通航,并开通了南昌-新加坡第一条国际航线;水运依托赣江可直达长江各港口;邮电通讯便捷,程控电话、数字微波、图文传真进入国际通讯网络;商检、海关、口岸等涉外机构齐全;水、电、气供应充足。 -西湖区有多大? (图)西湖区[南昌市] -西湖区有多大? 西湖区,是江西省省会南昌市的中心城区,面积34.8平方公里,常住人口51.9万人,辖桃花镇、朝农管理处及10个街道,设13个行政村,116个社区居委会,20个家委会。[2] -西湖区有多大? 2005年11月16日,南昌市《关于同意西湖区桃花镇、桃源、十字街街道办事处行政区划进行调整的批复》 -西湖区有多大? 1、同意将桃花镇的三道闸居委会划归桃源街道办事处管辖。 -青藏虎耳草花期什么时候? 青藏虎耳草多年生草本,高4-11.5厘米,丛生。 -青藏虎耳草花期什么时候? 花期7-8月。 -青藏虎耳草花期什么时候? 分布于甘肃(祁连山地)、青海(黄南、海南、海北)和西藏(加查)。 -青藏虎耳草花期什么时候? 生于海拔3 700-4 250米的林下、高山草甸和高山碎石隙。[1] -青藏虎耳草花期什么时候? 多年生草本,高4-11.5厘米,丛生。 -青藏虎耳草花期什么时候? 茎不分枝,具褐色卷曲柔毛。 -青藏虎耳草花期什么时候? 基生叶具柄,叶片卵形、椭圆形至长圆形,长15-25毫米,宽4-8毫米,腹面无毛,背面和边缘具褐色卷曲柔毛,叶柄长1-3厘米,基部扩大,边缘具褐色卷曲柔毛;茎生叶卵形至椭圆形,长1.5-2厘米,向上渐变小。 -青藏虎耳草花期什么时候? 聚伞花序伞房状,具2-6花;花梗长5-19毫米,密被褐色卷曲柔毛;萼片在花期反曲,卵形至狭卵形,长2.5-4.2毫米,宽1.5-2毫米,先端钝,两面无毛,边缘具褐色卷曲柔毛,3-5脉于先端不汇合;花瓣腹面淡黄色且其中下部具红色斑点,背面紫红色,卵形、狭卵形至近长圆形,长2.5-5.2毫米,宽1.5-2.1毫米,先端钝,基部具长0.5-1毫米之爪,3-5(-7)脉,具2痂体;雄蕊长2-3.6毫米,花丝钻形;子房半下位,周围具环状花盘,花柱长1-1.5毫米。 -青藏虎耳草花期什么时候? 生于高山草甸、碎石间。 -青藏虎耳草花期什么时候? 分布青海、西藏、甘肃、四川等地。 -青藏虎耳草花期什么时候? [1] -青藏虎耳草花期什么时候? 顶峰虎耳草Saxifraga cacuminum Harry Sm. -青藏虎耳草花期什么时候? 对叶虎耳Saxifraga contraria Harry Sm. -青藏虎耳草花期什么时候? 狭瓣虎耳草Saxifraga pseudohirculus Engl. -青藏虎耳草花期什么时候? 唐古特虎耳草Saxifraga tangutica Engl. -青藏虎耳草花期什么时候? 宽叶虎耳草(变种)Saxifraga tangutica Engl. var. platyphylla (Harry Sm.) J. T. Pan -青藏虎耳草花期什么时候? 唐古特虎耳草(原变种)Saxifraga tangutica Engl. var. tangutica -青藏虎耳草花期什么时候? 西藏虎耳草Saxifraga tibetica Losinsk.[1] -青藏虎耳草花期什么时候? Saxifraga przewalskii Engl. in Bull. Acad. Sci. St. -Petersb. 29:115. 1883: Engl et Irmsch. in Bot. Jahrb. 48:580. f. 5E-H. 1912 et in Engl. Pflanzenr. 67(IV. 117): 107. f. 21 E-H. 1916; J. T. Pan in Acta Phytotax. Sin. 16(2): 16. 1978;中国高等植物图鉴补编2: 30. 1983; 西藏植物志 2: 483. 1985. [1] -生产一支欧文冲锋枪需要多少钱? 欧文冲锋枪 Owen Gun 1945年,在新不列颠手持欧文冲锋枪的澳大利亚士兵 类型 冲锋枪 原产国 ?澳大利亚 服役记录 服役期间 1941年-1960年代 用户 参见使用国 参与战役 第二次世界大战 马来亚紧急状态 朝鲜战争 越南战争 1964年罗德西亚布什战争 生产历史 研发者 伊夫林·欧文(Evelyn Owen) 研发日期 1931年-1939年 生产商 约翰·莱萨特工厂 利特高轻武器工厂 单位制造费用 $ 30/枝 生产日期 1941年-1945年 制造数量 45,000-50,000 枝 衍生型 Mk 1/42 Mk 1/43 Mk 2/43 基本规格 总重 空枪: Mk 1/42:4.24 千克(9.35 磅) Mk 1/43:3.99 千克(8.8 磅) Mk 2/43:3.47 千克(7.65 磅) 全长 806 毫米(31.73 英吋) 枪管长度 247 毫米(9.72 英吋) 弹药 制式:9 × 19 毫米 原型:.38/200 原型:.45 ACP 口径 9 × 19 毫米:9 毫米(.357 英吋) .38/200:9.65 毫米(.38 英吋) .45 ACP:11.43 毫米(.45 英吋) 枪管 1 根,膛线7 条,右旋 枪机种类 直接反冲作用 开放式枪机 发射速率 理论射速: Mk 1/42:700 发/分钟 Mk 1/43:680 发/分钟 Mk 2/43:600 发/分钟 实际射速:120 发/分钟 枪口初速 380-420 米/秒(1,246.72-1,377.95 英尺/秒) 有效射程 瞄具装定射程:91.44 米(100 码) 最大有效射程:123 米(134.51 码) 最大射程 200 米(218.72 码) 供弹方式 32/33 发可拆卸式弹匣 瞄准具型式 机械瞄具:向右偏置的觇孔式照门和片状准星 欧文冲锋枪(英语:Owen Gun,正式名称:Owen Machine Carbine,以下简称为“欧文枪”)是一枝由伊夫林·(埃沃)·欧文(英语:Evelyn (Evo) Owen)于1939年研制、澳大利亚的首枝冲锋枪,制式型发射9 × 19 毫米鲁格手枪子弹。 -生产一支欧文冲锋枪需要多少钱? 欧文冲锋枪是澳大利亚唯一设计和主要服役的二战冲锋枪,并从1943年由澳大利亚陆军所使用,直到1960年代中期。 -生产一支欧文冲锋枪需要多少钱? 由新南威尔士州卧龙岗市出身的欧文枪发明者,伊夫林·欧文,在24岁时于1939年7月向悉尼维多利亚军营的澳大利亚陆军军械官员展示了他所设计的.22 LR口径“卡宾机枪”原型枪。 -生产一支欧文冲锋枪需要多少钱? 该枪却被澳大利亚陆军所拒绝,因为澳大利亚陆军在当时没有承认冲锋枪的价值。 -生产一支欧文冲锋枪需要多少钱? 随着战争的爆发,欧文加入了澳大利亚军队,并且成为一名列兵。 -生产一支欧文冲锋枪需要多少钱? 1940年9月,欧文的邻居,文森特·沃德尔(英语:Vincent Wardell),看到欧文家楼梯后面搁著一个麻布袋,里面放著一枝欧文枪的原型枪。 -生产一支欧文冲锋枪需要多少钱? 而文森特·沃德尔是坎布拉港的大型钢制品厂莱萨特公司的经理,他向欧文的父亲表明了他对其儿子的粗心大意感到痛心,但无论如何仍然解释了这款武器的历史。 -生产一支欧文冲锋枪需要多少钱? 沃德尔对欧文枪的简洁的设计留下了深刻的印象。 -生产一支欧文冲锋枪需要多少钱? 沃德尔安排欧文转调到陆军发明部(英语:Army Inventions Board),并重新开始在枪上的工作。 -生产一支欧文冲锋枪需要多少钱? 军队仍然持续地从负面角度查看该武器,但同时政府开始采取越来越有利的观点。 -生产一支欧文冲锋枪需要多少钱? 该欧文枪原型配备了装在顶部的弹鼓,后来让位给装在顶部的弹匣使用。 -生产一支欧文冲锋枪需要多少钱? 口径的选择亦花了一些时间去解决。 -生产一支欧文冲锋枪需要多少钱? 由于陆军有大批量的柯尔特.45 ACP子弹,它们决定欧文枪需要采用这种口径。 -生产一支欧文冲锋枪需要多少钱? 直到在1941年9月19日官方举办试验时,约翰·莱萨特工厂制成了9 毫米、.38/200和.45 ACP三种口径版本。 -生产一支欧文冲锋枪需要多少钱? 而从美、英进口的斯登冲锋枪和汤普森冲锋枪在试验中作为基准使用。 -生产一支欧文冲锋枪需要多少钱? 作为测试的一部分,所有的枪支都浸没在泥浆里,并以沙土覆盖,以模拟他们将会被使用时最恶劣的环境。 -生产一支欧文冲锋枪需要多少钱? 欧文枪是唯一在这测试中这样对待以后仍可正常操作的冲锋枪。 -生产一支欧文冲锋枪需要多少钱? 虽然测试表现出欧文枪具有比汤普森冲锋枪和司登冲锋枪更优秀的可靠性,陆军没有对其口径作出决定。 -生产一支欧文冲锋枪需要多少钱? 结果它在上级政府干预以后,陆军才下令9 毫米的衍生型为正式口径,并在1941年11月20日正式被澳大利亚陆军采用。 -生产一支欧文冲锋枪需要多少钱? 在欧文枪的寿命期间,其可靠性在澳大利亚部队中赢得了“军人的至爱”(英语:Digger's Darling)的绰号,亦有人传言它受到美军高度青睐。 -生产一支欧文冲锋枪需要多少钱? 欧文枪是在1942年开始正式由坎布拉港和纽卡斯尔的约翰·莱萨特工厂投入生产,在生产高峰期每个星期生产800 支。 -生产一支欧文冲锋枪需要多少钱? 1942年3月至1943年2月之间,莱萨特生产了28,000 枝欧文枪。 -生产一支欧文冲锋枪需要多少钱? 然而,最初的一批弹药类型竟然是错误的,以至10,000 枝欧文枪无法提供弹药。 -生产一支欧文冲锋枪需要多少钱? 政府再一次推翻军方的官僚主义作风??,并让弹药通过其最后的生产阶段,以及运送到当时在新几内亚与日军战斗的澳大利亚部队的手中。 -生产一支欧文冲锋枪需要多少钱? 在1941年至1945年间生产了约50,000 枝欧文枪。 -生产一支欧文冲锋枪需要多少钱? 在战争期间,欧文枪的平均生产成本为$ 30。[1] -生产一支欧文冲锋枪需要多少钱? 虽然它是有点笨重,因为其可靠性,欧文枪在士兵当中变得非常流行。 -生产一支欧文冲锋枪需要多少钱? 它是如此成功,它也被新西兰、英国和美国订购。[2] -生产一支欧文冲锋枪需要多少钱? 欧文枪后来也被澳大利亚部队在朝鲜战争和越南战争,[3]特别是步兵组的侦察兵。 -生产一支欧文冲锋枪需要多少钱? 这仍然是一枝制式的澳大利亚陆军武器,直到1960年代中期,它被F1冲锋枪所取代。 -第二届中国光伏摄影大赛因为什么政策而开始的? 光伏发电不仅是全球能源科技和产业发展的重要方向,也是我国具有国际竞争优势的战略性新兴产业,是我国保障能源安全、治理环境污染、应对气候变化的战略性选择。 -第二届中国光伏摄影大赛因为什么政策而开始的? 2013年7月以来,国家出台了《关于促进光伏产业健康发展的若干意见》等一系列政策,大力推进分布式光伏发电的应用,光伏发电有望走进千家万户,融入百姓民生。 -第二届中国光伏摄影大赛因为什么政策而开始的? 大赛主办方以此为契机,开启了“第二届中国光伏摄影大赛”的征程。 -悬赏任务有哪些类型? 悬赏任务,威客网站上一种任务模式,由雇主在威客网站发布任务,提供一定数额的赏金,以吸引威客们参与。 -悬赏任务有哪些类型? 悬赏任务数额一般在几十到几千不等,但也有几万甚至几十万的任务。 -悬赏任务有哪些类型? 主要以提交的作品的质量好坏作为中标标准,当然其中也带有雇主的主观喜好,中标人数较少,多为一个或几个,因此竞争激烈。 -悬赏任务有哪些类型? 大型悬赏任务赏金数额巨大,中标者也较多,但参与人也很多,对于身有一技之长的威客来讲,悬赏任务十分适合。 -悬赏任务有哪些类型? 悬赏任务的类型主要包括:设计类、文案类、取名类、网站类、编程类、推广类等等。 -悬赏任务有哪些类型? 每一类所适合的威客人群不同,报酬的多少也不同,比如设计类的报酬就比较高,一般都几百到几千,而推广类的计件任务报酬比较少,一般也就几块钱,但花费的时间很少,技术要求也很低。 -悬赏任务有哪些类型? 1.注册—登陆 -悬赏任务有哪些类型? 2.点击“我要发悬赏”—按照发布流程及提示提交任务要求。 -悬赏任务有哪些类型? 悬赏模式选择->网站托管赏金模式。 -悬赏任务有哪些类型? 威客网站客服稍后会跟发布者联系确认任务要求。 -悬赏任务有哪些类型? 3.没有问题之后就可以预付赏金进行任务发布。 -悬赏任务有哪些类型? 4.会员参与并提交稿件。 -悬赏任务有哪些类型? 5.发布者需要跟会员互动(每个提交稿件的会员都可以),解决问题,完善稿件,初步筛选稿件。 -悬赏任务有哪些类型? 6.任务发布期结束,进入选稿期(在筛选的稿件中选择最后满意的) -悬赏任务有哪些类型? 7.发布者不满意现有稿件可选定一个会员修改至满意为止,或者加价延期重新开放任务进行征稿。 -悬赏任务有哪些类型? (重复第六步)没有问题后进入下一步。 -悬赏任务有哪些类型? 8:中标会员交源文件给发布者—发布者确认—任务结束—网站将赏金付给中标会员。 -悬赏任务有哪些类型? 1、任务发布者自由定价,自由确定悬赏时间,自由发布任务要求,自主确定中标会员和中标方案。 -悬赏任务有哪些类型? 2、任务发布者100%预付任务赏金,让竞标者坚信您的诚意和诚信。 -悬赏任务有哪些类型? 3、任务赏金分配原则:任务一经发布,网站收取20%发布费,中标会员获得赏金的80%。 -悬赏任务有哪些类型? 4、每个任务最终都会选定至少一个作品中标,至少一个竞标者获得赏金。 -悬赏任务有哪些类型? 5、任务发布者若未征集到满意作品,可以加价延期征集,也可让会员修改,会员也可以删除任务。 -悬赏任务有哪些类型? 6、任务发布者自己所在组织的任何人均不能以任何形式参加自己所发布的任务,一经发现则视为任务发布者委托威客网按照网站规则选稿。 -悬赏任务有哪些类型? 7、任务悬赏总金额低于100元(含100元)的任务,悬赏时间最多为7天。 -悬赏任务有哪些类型? 所有任务最长时间不超过30天(特殊任务除外),任务总金额不得低于50元。 -悬赏任务有哪些类型? 8、网赚类、注册类任务总金额不能低于300元人民币,计件任务每个稿件的平均单价不能低于1元人民币。 -悬赏任务有哪些类型? 9、延期任务只有3次加价机会,第1次加价不得低于任务金额的10%,第2次加价不得低于任务总金额的20%,第3次不得低于任务总金额的50%。 -悬赏任务有哪些类型? 每次延期不能超过15天,加价金额不低于50元,特殊任务可以适当加长。 -悬赏任务有哪些类型? 如果为计件任务,且不是网赚类任务,将免费延期,直至征集完规定数量的作品为止。 -悬赏任务有哪些类型? 10、如果威客以交接源文件要挟任务发布者,威客网将扣除威客相关信用值,并取消其中标资格,同时任务将免费延长相应的时间继续征集作品 。 -江湖令由哪些平台运营? 《江湖令》是以隋唐时期为背景的RPG角色扮演类网页游戏。 -江湖令由哪些平台运营? 集角色扮演、策略、冒险等多种游戏元素为一体,画面精美犹如客户端游戏,融合历史、江湖、武功、恩仇多种特色元素,是款不可多得的精品游戏大作。 -江湖令由哪些平台运营? 由ya247平台、91wan游戏平台、2918、4399游戏平台、37wan、6711、兄弟玩网页游戏平台,49you、Y8Y9平台、8090游戏等平台运营的,由07177游戏网发布媒体资讯的网页游戏。 -江湖令由哪些平台运营? 网页游戏《江湖令》由51游戏社区运营,是以隋唐时期为背景的RPG角色扮演类网页游戏。 -江湖令由哪些平台运营? 集角色扮演、策略、冒险等多种游戏元素为一体,画面精美犹如客户端游戏,融合历史、江湖、武功、恩仇多种特色元素,是款不可多得的精品游戏大作… -江湖令由哪些平台运营? 背景故事: -江湖令由哪些平台运营? 隋朝末年,隋炀帝暴政,天下民不聊生,义军四起。 -江湖令由哪些平台运营? 在这动荡的时代中,百姓生活苦不堪言,多少人流离失所,家破人亡。 -江湖令由哪些平台运营? 天下三大势力---飞羽营、上清宫、侠隐岛,也值此机会扩张势力,派出弟子出来行走江湖。 -江湖令由哪些平台运营? 你便是这些弟子中的普通一员,在这群雄并起的年代,你将如何选择自己的未来。 -江湖令由哪些平台运营? 所有的故事,便从瓦岗寨/江都大营开始…… -江湖令由哪些平台运营? 势力: -江湖令由哪些平台运营? ①、飞羽营:【外功、根骨】 -江湖令由哪些平台运营? 南北朝时期,由北方政权创立的一个民间军事团体,经过多年的发展,逐渐成为江湖一大势力。 -江湖令由哪些平台运营? ②、上清宫:【外功、身法】 -江湖令由哪些平台运营? 道家圣地,宫中弟子讲求清静无为,以一种隐世的方式修炼,但身在此乱世,亦也不能独善其身。 -江湖令由哪些平台运营? ③、侠隐岛:【根骨、内力】 -江湖令由哪些平台运营? 位于偏远海岛上的一个世家,岛内弟子大多武功高强,但从不进入江湖行走,适逢乱世,现今岛主也决意作一翻作为。 -江湖令由哪些平台运营? 两大阵营: -江湖令由哪些平台运营? 义军:隋唐末期,百姓生活苦不堪言,有多个有志之士组成义军,对抗当朝暴君,希望建立一个适合百姓安居乐业的天地。 -江湖令由哪些平台运营? 隋军:战争一起即天下打乱,隋军首先要镇压四起的义军,同时在内部慢慢改变现有的朝廷,让天下再次恢复到昔日的安定。 -江湖令由哪些平台运营? 一、宠物品质 -江湖令由哪些平台运营? 宠物的品质分为:灵兽,妖兽,仙兽,圣兽,神兽 -江湖令由哪些平台运营? 二、宠物获取途径 -江湖令由哪些平台运营? 完成任务奖励宠物(其他途径待定)。 -江湖令由哪些平台运营? 三、宠物融合 -江湖令由哪些平台运营? 1、在主界面下方的【宠/骑】按钮进入宠物界面,再点击【融合】即可进入融合界面进行融合,在融合界面可选择要融合的宠物进行融合 -江湖令由哪些平台运营? 2、融合后主宠的形态不变; -江湖令由哪些平台运营? 3、融合后宠物的成长,品质,技能,经验,成长经验,等级都继承成长高的宠物; -江湖令由哪些平台运营? 4、融合宠物技能冲突,则保留成长值高的宠物技能,如果不冲突则叠加在空余的技能位置。 -请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛(土耳其文:Türkiye 1. Süper Futbol Ligi)是土耳其足球协会管理的职业足球联赛,通常简称“土超”,也是土耳其足球联赛中最高级别。 -请问土耳其足球超级联赛是什么时候成立的? 目前,土超联赛队伍共有18支。 -请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛 -请问土耳其足球超级联赛是什么时候成立的? 运动项目 足球 -请问土耳其足球超级联赛是什么时候成立的? 成立年份 1959年 -请问土耳其足球超级联赛是什么时候成立的? 参赛队数 18队 -请问土耳其足球超级联赛是什么时候成立的? 国家 土耳其 -请问土耳其足球超级联赛是什么时候成立的? 现任冠军 费内巴切足球俱乐部(2010-2011) -请问土耳其足球超级联赛是什么时候成立的? 夺冠最多队伍 费内巴切足球俱乐部(18次) -请问土耳其足球超级联赛是什么时候成立的? 土耳其足球超级联赛(Türkiye 1. Süper Futbol Ligi)是土耳其足球协会管理的职业足球联赛,通常简称「土超」,也是土耳其足球联赛中最高级别。 -请问土耳其足球超级联赛是什么时候成立的? 土超联赛队伍共有18支。 -请问土耳其足球超级联赛是什么时候成立的? 土超联赛成立于1959年,成立之前土耳其国有多个地区性联赛。 -请问土耳其足球超级联赛是什么时候成立的? 土超联赛成立后便把各地方联赛制度统一起来。 -请问土耳其足球超级联赛是什么时候成立的? 一般土超联赛由八月开始至五月结束,12月至1月会有歇冬期。 -请问土耳其足球超级联赛是什么时候成立的? 十八支球队会互相对叠,各有主场和作客两部分,采计分制。 -请问土耳其足球超级联赛是什么时候成立的? 联赛枋最底的三支球队会降到土耳其足球甲级联赛作赛。 -请问土耳其足球超级联赛是什么时候成立的? 由2005-06年球季起,土超联赛的冠、亚军会取得参加欧洲联赛冠军杯的资格。 -请问土耳其足球超级联赛是什么时候成立的? 成立至今土超联赛乃由两支著名球会所垄断──加拉塔萨雷足球俱乐部和费内巴切足球俱乐部,截至2009-2010赛季,双方各赢得冠军均为17次。 -请问土耳其足球超级联赛是什么时候成立的? 土超联赛共有18支球队,采取双循环得分制,每场比赛胜方得3分,负方0分,平局双方各得1分。 -请问土耳其足球超级联赛是什么时候成立的? 如果两支球队积分相同,对战成绩好的排名靠前,其次按照净胜球来决定;如果有三支以上的球队分数相同,则按照以下标准来确定排名:1、几支队伍间对战的得分,2、几支队伍间对战的净胜球数,3、总净胜球数。 -请问土耳其足球超级联赛是什么时候成立的? 联赛第1名直接参加下个赛季冠军杯小组赛,第2名参加下个赛季冠军杯资格赛第三轮,第3名进入下个赛季欧洲联赛资格赛第三轮,第4名进入下个赛季欧洲联赛资格赛第二轮,最后三名降入下个赛季的土甲联赛。 -请问土耳其足球超级联赛是什么时候成立的? 该赛季的土耳其杯冠军可参加下个赛季欧洲联赛资格赛第四轮,如果冠军已获得冠军杯资格,则亚军可参加下个赛季欧洲联赛资格赛第四轮,否则名额递补给联赛。 -请问土耳其足球超级联赛是什么时候成立的? 2010年/2011年 费内巴切 -请问土耳其足球超级联赛是什么时候成立的? 2009年/2010年 布尔萨体育(又译贝莎) -请问土耳其足球超级联赛是什么时候成立的? 2008年/2009年 贝西克塔斯 -请问土耳其足球超级联赛是什么时候成立的? 2007年/2008年 加拉塔萨雷 -请问土耳其足球超级联赛是什么时候成立的? 2006年/2007年 费内巴切 -请问土耳其足球超级联赛是什么时候成立的? 2005年/2006年 加拉塔沙雷 -请问土耳其足球超级联赛是什么时候成立的? 2004年/2005年 费内巴切(又译费伦巴治) -请问土耳其足球超级联赛是什么时候成立的? 2003年/2004年 费内巴切 -cid 作Customer IDentity解时是什么意思? ? CID 是 Customer IDentity 的简称,简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable)芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号,新的CID不断被使用,以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改 -cid 作Customer IDentity解时是什么意思? ? CID 是 Customer IDentity 的简称,简单来说就是手机的平台版本. CID紧跟IMEI存储在手机的OTP(One Time Programmable)芯片中. CID 后面的数字代表的是索尼爱立信手机软件保护版本号,新的CID不断被使用,以用来防止手机被非索尼爱立信官方的维修程序拿来解锁/刷机/篡改 -cid 作Customer IDentity解时是什么意思? ? (英)刑事调查局,香港警察的重案组 -cid 作Customer IDentity解时是什么意思? ? Criminal Investigation Department -cid 作Customer IDentity解时是什么意思? ? 佩枪: -cid 作Customer IDentity解时是什么意思? ? 香港警察的CID(刑事侦缉队),各区重案组的探员装备短管点38左轮手枪,其特点是便于收藏,而且不容易卡壳,重量轻,其缺点是装弹量少,只有6发,而且换子弹较慢,威力也一般,如果碰上54式手枪或者M9手枪明显处于下风。 -cid 作Customer IDentity解时是什么意思? ? 香港警察的“刑事侦查”(Criminal Investigation Department)部门,早于1983年起已经不叫做C.I.D.的了,1983年香港警察队的重整架构,撤销了C.I.D. ( Criminal Investigation Dept.) “刑事侦缉处”,将“刑事侦查”部门归入去“行动处”内,是“行动处”内的一个分支部门,叫“刑事部”( Crime Wing )。 -cid 作Customer IDentity解时是什么意思? ? 再于90年代的一次警队重整架构,香港警队成立了新的「刑事及保安处」,再将“刑事侦查”部门归入目前的「刑事及保安处」的“处”级单位,是归入这个“处”下的一个部门,亦叫“刑事部” ( Crime Wing ),由一个助理警务处长(刑事)领导。 -cid 作Customer IDentity解时是什么意思? ? 但是时至今天,CID虽已经是一个老旧的名称,香港市民、甚至香港警察都是习惯性的沿用这个历史上的叫法 . -cid 作Customer IDentity解时是什么意思? ? CID格式是美国Adobe公司发表的最新字库格式,它具有易扩充、速度快、兼容性好、简便、灵活等特点,已成为国内开发中文字库的热点,也为用户使用字库提供质量更好,数量更多的字体。 -cid 作Customer IDentity解时是什么意思? ? CID (Character identifier)就是字符识别码,在组成方式上分成CIDFont,CMap表两部分。 -cid 作Customer IDentity解时是什么意思? ? CIDFont文件即总字符集,包括了一种特定语言中所有常用的字符,把这些字符排序,它们在总字符集中排列的次序号就是各个字符的CID标识码(Index);CMap(Character Map)表即字符映像文件,将字符的编码(Code)映像到字符的CID标识码(Index)。 -cid 作Customer IDentity解时是什么意思? ? CID字库完全针对大字符集市场设计,其基本过程为:先根据Code,在CMap表查到Index,然后在CIDFont文件找到相应的字形数据。 -本町位于什么地方? 本条目记述台湾日治时期,各都市之本町。 -本町位于什么地方? 为台湾日治时期台北市之行政区,共分一~四丁目,在表町之西。 -本町位于什么地方? 以现在的位置来看,本町位于现台北市中正区的西北角,约位于忠孝西路一段往西至台北邮局东侧。 -本町位于什么地方? 再向南至开封街一段,沿此路线向西至开封街一段60号,顺60号到汉口街一段向东到现在华南银行总行附近画一条直线到衡阳路。 -本町位于什么地方? 再向东至重庆南路一段,由重庆南路一段回到原点这个范围内。 -本町位于什么地方? 另外,重庆南路一段在当时名为“本町通”。 -本町位于什么地方? 此地方自日治时期起,就是繁华的商业地区,当时也有三和银行、台北专卖分局、日本石油等重要商业机构。 -本町位于什么地方? 其中,专卖分局是战后二二八事件的主要起始点。 -本町位于什么地方? 台湾贮蓄银行(一丁目) -本町位于什么地方? 三和银行(二丁目) -本町位于什么地方? 专卖局台北分局(三丁目) -本町位于什么地方? 日本石油(四丁目) -本町位于什么地方? 为台湾日治时期台南市之行政区。 -本町位于什么地方? 范围包括清代旧街名枋桥头前、枋桥头后、鞋、草花、天公埕、竹仔、下大埕、帽仔、武馆、统领巷、大井头、内宫后、内南町。 -本町位于什么地方? 为清代台南城最繁华的区域。 -本町位于什么地方? 台南公会堂 -本町位于什么地方? 北极殿 -本町位于什么地方? 开基武庙 -本町位于什么地方? 町名改正 -本町位于什么地方? 这是一个与台湾相关的小作品。 -本町位于什么地方? 你可以通过编辑或修订扩充其内容。 -《行走的观点:埃及》的条形码是多少? 出版社: 上海社会科学院出版社; 第1版 (2006年5月1日) -《行走的观点:埃及》的条形码是多少? 丛书名: 时代建筑视觉旅行丛书 -《行走的观点:埃及》的条形码是多少? 条形码: 9787806818640 -《行走的观点:埃及》的条形码是多少? 尺寸: 18 x 13.1 x 0.7 cm -《行走的观点:埃及》的条形码是多少? 重量: 181 g -《行走的观点:埃及》的条形码是多少? 漂浮在沙与海市蜃楼之上的金字塔曾经是否是你的一个梦。 -《行走的观点:埃及》的条形码是多少? 埃及,这片蕴蓄了5000年文明的土地,本书为你撩开它神秘的纱。 -《行走的观点:埃及》的条形码是多少? 诸神、金字塔、神庙、狮身人面像、法老、艳后吸引着我们的注意力;缠绵悱恻的象形文字、医学、雕刻等留给我们的文明,不断引发我们对古代文明的惊喜和赞叹。 -《行走的观点:埃及》的条形码是多少? 尼罗河畔的奇异之旅,数千年的古老文明,尽收在你的眼底…… -《行走的观点:埃及》的条形码是多少? 本书集历史、文化、地理等知识于一体,并以优美、流畅文笔,简明扼要地阐述了埃及的地理环境、政治经济、历史沿革、文化艺术,以大量富有艺术感染力的彩色照片,生动形象地展示了埃及最具特色的名胜古迹、风土人情和自然风光。 -《行走的观点:埃及》的条形码是多少? 古埃及历史 -老挝人民军的工兵部队有几个营? 老挝人民军前身为老挝爱国战线领导的“寮国战斗部队”(即“巴特寮”),始建于1949年1月20日,1965年10月改名为老挝人民解放军,1982年7月改称现名。 -老挝人民军的工兵部队有几个营? 最高领导机构是中央国防和治安委员会,朱马里·赛雅颂任主席,隆再·皮吉任国防部长。 -老挝人民军的工兵部队有几个营? 实行义务兵役制,服役期最少18个月。[1] -老挝人民军的工兵部队有几个营? ?老挝军队在老挝社会中有较好的地位和保障,工资待遇比地方政府工作人员略高。 -老挝人民军的工兵部队有几个营? 武装部队总兵力约6万人,其中陆军约5万人,主力部队编为5个步兵师;空军2000多人;海军(内河巡逻部队)1000多人;部队机关院校5000人。[1] -老挝人民军的工兵部队有几个营? 老挝人民军军旗 -老挝人民军的工兵部队有几个营? 1991年8月14日通过的《老挝人民民主共和国宪法》第11条规定:国家执行保卫国防和维护社会安宁的政策。 -老挝人民军的工兵部队有几个营? 全体公民和国防力量、治安力量必须发扬忠于祖国、忠于人民的精神,履行保卫革命成果、保卫人民生命财产及和平劳动的任务,积极参加国家建设事业。 -老挝人民军的工兵部队有几个营? 最高领导机构是中央国防和治安委员会。 -老挝人民军的工兵部队有几个营? 主席由老挝人民革命党中央委员会总书记兼任。 -老挝人民军的工兵部队有几个营? 老挝陆军成立最早,兵力最多,约有5万人。 -老挝人民军的工兵部队有几个营? 其中主力部队步兵师5个、7个独立团、30多个营、65个独立连。 -老挝人民军的工兵部队有几个营? 地方部队30余个营及县属部队。 -老挝人民军的工兵部队有几个营? 地面炮兵2个团,10多个营。 -老挝人民军的工兵部队有几个营? 高射炮兵1个团9个营。 -老挝人民军的工兵部队有几个营? 导弹部队2个营。 -老挝人民军的工兵部队有几个营? 装甲兵7个营。 -老挝人民军的工兵部队有几个营? 特工部队6个营。 -老挝人民军的工兵部队有几个营? 通讯部队9个营。 -老挝人民军的工兵部队有几个营? 工兵部队6个营。 -老挝人民军的工兵部队有几个营? 基建工程兵2个团13个营。 -老挝人民军的工兵部队有几个营? 运输部队7个营。 -老挝人民军的工兵部队有几个营? 陆军的装备基本是中国和前苏联援助的装备和部分从抗美战争中缴获的美式装备。 -老挝人民军的工兵部队有几个营? 老挝内河部队总兵力约1700人,装备有内河船艇110多艘,编成4个艇队。 -老挝人民军的工兵部队有几个营? 有芒宽、巴能、纳坎、他曲、南盖、巴色等8个基地。 -老挝人民军的工兵部队有几个营? 空军于1975年8月组建,现有2个团、11个飞行大队,总兵力约2000人。 -老挝人民军的工兵部队有几个营? 装备有各种飞机140架,其中主要由前苏联提供和从万象政权的皇家空军手中接管。 -老挝人民军的工兵部队有几个营? 随着军队建设质量的提高,老挝人民军对外军事合作步伐也日益扩大,近年来先后与俄罗斯、印度、马来西亚、越南、菲律宾等国拓展了军事交流与合作的内容。 -老挝人民军的工兵部队有几个营? 2003年1月,印度决定向老挝援助一批军事装备和物资,并承诺提供技术帮助。 -老挝人民军的工兵部队有几个营? 2003年6月,老挝向俄罗斯订购了一批新式防空武器;2003年4月,老挝与越南签署了越南帮助老挝培训军事指挥干部和特种部队以及完成军队通信系统改造等多项协议。 -《焚心之城》的主角是谁? 《焚心之城》[1] 为网络作家老子扛过枪创作的一部都市类小说,目前正在创世中文网连载中。 -《焚心之城》的主角是谁? 乡下大男孩薛城,是一个不甘于生活现状的混混,他混过、爱过、也深深地被伤害过。 -《焚心之城》的主角是谁? 本料此生当浑浑噩噩,拼搏街头。 -《焚心之城》的主角是谁? 高考的成绩却给了他一点渺茫的希望,二月后,大学如期吹响了他进城的号角。 -《焚心之城》的主角是谁? 繁华的都市,热血的人生,冷眼嘲笑中,他发誓再不做一个平常人! -《焚心之城》的主角是谁? 江北小城,黑河大地,他要行走过的每一个角落都有他的传说。 -《焚心之城》的主角是谁? 扯出一面旗,拉一帮兄弟,做男人,就要多一份担当,活一口傲气。 -《焚心之城》的主角是谁? (日期截止到2014年10月23日凌晨) -请问香港利丰集团是什么时候成立的? 香港利丰集团前身是广州的华资贸易 (1906 - 1949) ,利丰是香港历史最悠久的出口贸易商号之一。 -请问香港利丰集团是什么时候成立的? 于1906年,冯柏燎先生和李道明先生在广州创立了利丰贸易公司;是当时中国第一家华资的对外贸易出口商。 -请问香港利丰集团是什么时候成立的? 利丰于1906年创立,初时只从事瓷器及丝绸生意;一年之后,增添了其它的货品,包括竹器、藤器、玉石、象牙及其它手工艺品,包括烟花爆竹类别。 -请问香港利丰集团是什么时候成立的? 在早期的对外贸易,中国南方内河港因水深不足不能行驶远洋船,反之香港港口水深岸阔,占尽地利。 -请问香港利丰集团是什么时候成立的? 因此,在香港成立分公司的责任,落在冯柏燎先生的三子冯汉柱先生身上。 -请问香港利丰集团是什么时候成立的? 1937年12月28日,利丰(1937)有限公司正式在香港创立。 -请问香港利丰集团是什么时候成立的? 第二次世界大战期间,利丰暂停贸易业务。 -请问香港利丰集团是什么时候成立的? 1943年,随着创办人冯柏燎先生去世后,业务移交给冯氏家族第二代。 -请问香港利丰集团是什么时候成立的? 之后,向来不参与业务管理的合伙人李道明先生宣布退休,将所拥有的利丰股权全部卖给冯氏家族。 -请问香港利丰集团是什么时候成立的? 目前由哈佛冯家两兄弟William Fung , Victor Fung和CEO Bruce Rockowitz 管理。 -请问香港利丰集团是什么时候成立的? 截止到2012年,集团旗下有利亚﹝零售﹞有限公司、利和集团、利邦时装有限公司、利越时装有限公司、利丰贸易有限公司。 -请问香港利丰集团是什么时候成立的? 利亚(零售)连锁,业务包括大家所熟悉的:OK便利店、玩具〝反〞斗城和圣安娜饼屋;范围包括香港、台湾、新加坡、马来西亚、至中国大陆及东南亚其它市场逾600多家店 -请问香港利丰集团是什么时候成立的? 利和集团,IDS以专业物流服务为根基,为客户提供经销,物流,制造服务领域内的一系列服务项目。 -请问香港利丰集团是什么时候成立的? 业务网络覆盖大中华区,东盟,美国及英国,经营着90多个经销中心,在中国设有18个经销公司,10,000家现代经销门店。 -请问香港利丰集团是什么时候成立的? 利邦(上海)时装贸易有限公司为大中华区其中一家大型男士服装零售集团。 -请问香港利丰集团是什么时候成立的? 现在在中国大陆、香港、台湾和澳门收购经营11个包括Cerruti 1881,Gieves & Hawkes,Kent & curwen和D’urban 等中档到高档的男士服装品牌,全国有超过350间门店设于各一线城市之高级商场及百货公司。 -请问香港利丰集团是什么时候成立的? 利越(上海)服装商贸有限公司隶属于Branded Lifestyle,负责中国大陆地区LEO里奥(意大利)、GIBO捷宝(意大利)、UFFIZI古杰师(意大利)、OVVIO奥维路(意大利)、Roots绿适(加拿大,全球服装排名第四)品牌销售业务 -请问香港利丰集团是什么时候成立的? 利丰(贸易)1995年收购了英之杰采购服务,1999年收购太古贸易有限公司(Swire & Maclain) 和金巴莉有限公司(Camberley),2000年和2002年分别收购香港采购出口集团Colby Group及Janco Oversea Limited,大大扩张了在美国及欧洲的顾客群,自2008年经济危机起一直到现在,收购多家欧、美、印、非等地区的时尚品牌,如英国品牌Visage,仅2011年上半年6个月就完成26个品牌的收购。 -请问香港利丰集团是什么时候成立的? 2004年利丰与Levi Strauss & Co.签订特许经营协议 -请问香港利丰集团是什么时候成立的? 2005年利丰伙拍Daymon Worldwide为全球供应私有品牌和特许品牌 -请问香港利丰集团是什么时候成立的? 2006年收购Rossetti手袋业务及Oxford Womenswear Group 强化美国批发业务 -请问香港利丰集团是什么时候成立的? 2007年收购Tommy Hilfiher全球采购业务,收购CGroup、Peter Black International LTD、Regetta USA LLC和American Marketing Enterprice -请问香港利丰集团是什么时候成立的? 2008年收购Kent&Curwen全球特许经营权,收购Van Zeeland,Inc和Miles Fashion Group -请问香港利丰集团是什么时候成立的? 2009年收购加拿大休闲品牌Roots ,收购Wear Me Appearl,LLC。 -请问香港利丰集团是什么时候成立的? 与Hudson's Bay、Wolverine Worldwide Inc、Talbots、Liz Claiborne达成了采购协议 -请问香港利丰集团是什么时候成立的? 2010年收购Oxford apparel Visage Group LTD -请问香港利丰集团是什么时候成立的? 2011年一月收购土耳其Modium、美国女性时尚Beyond Productions,三月收购贸易公司Celissa 、玩具公司Techno Source USA, Inc.、卡通品牌产品TVMania和法国著名时装一线品牌Cerruti 1881,五月收购Loyaltex Apparel Ltd.、女装Hampshire Designers和英国彩妆Collection 2000,六月收购家私贸易Exim Designs Co., Ltd.,七月收购家庭旅行产业Union Rich USA, LLC和设计公司Lloyd Textile Fashion Company Limited,八月收购童装Fishman & Tobin和Crimzon Rose,九月收购家私贸易True Innovations, LLC、日用品企业Midway Enterprises和Wonderful World。 -请问香港利丰集团是什么时候成立的? 十二月与USPA – U.S. Polo Association签署授权协议。 -请问香港利丰集团是什么时候成立的? 利丰的精神:积极进取,不断认识并争取有利于客户和自身进步的机会;以行动为主导,对客户、供应商及职工的需求作出快速的决定。 -请问香港利丰集团是什么时候成立的? 利丰的最终目标:在产品采购、销售、流转的各环节建立全球性队伍提供多元化服务,利丰成员有效合作,共达目标。 -如何使魔兽变种akt不被查杀? Trojan/PSW.Moshou.akt“魔兽”变种akt是“魔兽”木马家族的最新成员之一,采用Delphi 6.0-7.0编写,并经过加壳处理。 -如何使魔兽变种akt不被查杀? “魔兽”变种akt运行后,自我复制到被感染计算机的指定目录下。 -如何使魔兽变种akt不被查杀? 修改注册表,实现木马开机自动运行。 -如何使魔兽变种akt不被查杀? 自我注入到被感染计算机的“explorer.exe”、“notepad.exe”等用户级权限的进程中加载运行,隐藏自我,防止被查杀。 -如何使魔兽变种akt不被查杀? 在后台秘密监视用户打开的窗口标题,盗取网络游戏《魔兽世界》玩家的游戏帐号、游戏密码、角色等级、装备信息、金钱数量等信息,并在后台将窃取到的玩家信息发送到骇客指定的远程服务器上,致使玩家游戏帐号、装备物品、金钱等丢失,给游戏玩家造成非常大的损失。 -丙种球蛋白能预防什么病情? 丙种球蛋白预防传染性肝炎,预防麻疹等病毒性疾病感染,治疗先天性丙种球蛋白缺乏症 ,与抗生素合并使用,可提高对某些严重细菌性和病毒性疾病感染的疗效。 -丙种球蛋白能预防什么病情? 中文简称:“丙球” -丙种球蛋白能预防什么病情? 英文名称:γ-globulin、gamma globulin -丙种球蛋白能预防什么病情? 【别名】 免疫血清球蛋白,普通免疫球蛋白,人血丙种球蛋白,丙种球蛋白,静脉注射用人免疫球蛋白(pH4) -丙种球蛋白能预防什么病情? 注:由于人血中的免疫球蛋白大多数为丙种球蛋白(γ-球蛋白),有时丙种球蛋白也被混称为“免疫球蛋白”(immunoglobulin) 。 -丙种球蛋白能预防什么病情? 冻干制剂应为白色或灰白色的疏松体,液体制剂和冻干制剂溶解后,溶液应为接近无色或淡黄色的澄明液体,微带乳光。 -丙种球蛋白能预防什么病情? 但不应含有异物或摇不散的沉淀。 -丙种球蛋白能预防什么病情? 注射丙种球蛋白是一种被动免疫疗法。 -丙种球蛋白能预防什么病情? 它是把免疫球蛋白内含有的大量抗体输给受者,使之从低或无免疫状态很快达到暂时免疫保护状态。 -丙种球蛋白能预防什么病情? 由于抗体与抗原相互作用起到直接中和毒素与杀死细菌和病毒。 -丙种球蛋白能预防什么病情? 因此免疫球蛋白制品对预防细菌、病毒性感染有一定的作用[1]。 -丙种球蛋白能预防什么病情? 人免疫球蛋白的生物半衰期为16~24天。 -丙种球蛋白能预防什么病情? 1、丙种球蛋白[2]含有健康人群血清所具有的各种抗体,因而有增强机体抵抗力以预防感染的作用。 -丙种球蛋白能预防什么病情? 2、主要治疗先天性丙种球蛋白缺乏症和免疫缺陷病 -丙种球蛋白能预防什么病情? 3、预防传染性肝炎,如甲型肝炎和乙型肝炎等。 -丙种球蛋白能预防什么病情? 4、用于麻疹、水痘、腮腺炎、带状疱疹等病毒感染和细菌感染的防治 -丙种球蛋白能预防什么病情? 5、也可用于哮喘、过敏性鼻炎、湿疹等内源性过敏性疾病。 -丙种球蛋白能预防什么病情? 6、与抗生素合并使用,可提高对某些严重细菌性和病毒性疾病感染的疗效。 -丙种球蛋白能预防什么病情? 7、川崎病,又称皮肤粘膜淋巴结综合征,常见于儿童,丙种球蛋白是主要的治疗药物。 -丙种球蛋白能预防什么病情? 1、对免疫球蛋白过敏或有其他严重过敏史者。 -丙种球蛋白能预防什么病情? 2、有IgA抗体的选择性IgA缺乏者。 -丙种球蛋白能预防什么病情? 3、发烧患者禁用或慎用。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (1997年9月1日浙江省第八届人民代表大会常务委员会第三十九次会议通过 1997年9月9日浙江省第八届人民代表大会常务委员会公告第六十九号公布自公布之日起施行) -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 为了保护人的生命和健康,发扬人道主义精神,促进社会发展与和平进步事业,根据《中华人民共和国红十字会法》,结合本省实际,制定本办法。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 本省县级以上按行政区域建立的红十字会,是中国红十字会的地方组织,是从事人道主义工作的社会救助团体,依法取得社会团体法人资格,设置工作机构,配备专职工作人员,依照《中国红十字会章程》独立自主地开展工作。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 全省性行业根据需要可以建立行业红十字会,配备专职或兼职工作人员。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 街道、乡(镇)、机关、团体、学校、企业、事业单位根据需要,可以依照《中国红十字会章程》建立红十字会的基层组织。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 上级红十字会指导下级红十字会的工作。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上地方红十字会指导所在行政区域行业红十字会和基层红十字会的工作。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 人民政府对红十字会给予支持和资助,保障红十字会依法履行职责,并对其活动进行监督;红十字会协助人民政府开展与其职责有关的活动。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 全社会都应当关心和支持红十字事业。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 本省公民和单位承认《中国红十字会章程》并缴纳会费的,可以自愿参加红十字会,成为红十字会的个人会员或团体会员。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 个人会员由本人申请,基层红十字会批准,发给会员证;团体会员由单位申请,县级以上红十字会批准,发给团体会员证。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 个人会员和团体会员应当遵守《中华人民共和国红十字会法》和《中国红十字会章程》,热心红十字事业,履行会员的义务,并享有会员的权利。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上红十字会理事会由会员代表大会民主选举产生。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 理事会民主选举产生会长和副会长;根据会长提名,决定秘书长、副秘书长人选。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 县级以上红十字会可以设名誉会长、名誉副会长和名誉理事,由同级红十字会理事会聘请。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 省、市(地)红十字会根据独立、平等、互相尊重的原则,发展同境外、国外地方红十字会和红新月会的友好往来和合作关系。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 红十字会履行下列职责: -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (一)宣传、贯彻《中华人民共和国红十字会法》和本办法; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (二)开展救灾的准备工作,筹措救灾款物;在自然灾害和突发事件中,对伤病人员和其他受害者进行救助; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (三)普及卫生救护和防病知识,进行初级卫生救护培训,对交通、电力、建筑、矿山等容易发生意外伤害的单位进行现场救护培训; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (四)组织群众参加现场救护; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (五)参与输血献血工作,推动无偿献血; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (六)开展红十字青少年活动; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (七)根据中国红十字会总会部署,参加国际人道主义救援工作; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (八)依照国际红十字和红新月运动的基本原则,完成同级人民政府和上级红十字会委托的有关事宜; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (九)《中华人民共和国红十宇会法》和《中国红十字会章程》规定的其他职责。 -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? 第八条 红十字会经费的主要来源: -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (一)红十字会会员缴纳的会费; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (二)接受国内外组织和个人捐赠的款物; -浙江省实施《中华人民共和国红十字会法》办法在浙江省第八届人民代表大会常务委员会第几次会议通过的? (三)红十字会的动产、不动产以及兴办社会福利事业和经济实体的收入; -宝湖庭院绿化率多少? 建发·宝湖庭院位于银川市金凤区核心地带—正源南街与长城中路交汇处向东500米。 -宝湖庭院绿化率多少? 项目已于2012年4月开工建设,总占地约4.2万平方米,总建筑面积约11.2万平方米,容积率2.14,绿化率35%,预计可入住630户。 -宝湖庭院绿化率多少? “建发·宝湖庭院”是银川建发集团股份有限公司继“建发·宝湖湾”之后,在宝湖湖区的又一力作。 -宝湖庭院绿化率多少? 项目周边发展成熟,东有唐徕渠景观水道,西临银川市交通主干道正源街;南侧与宝湖湿地公园遥相呼应。 -宝湖庭院绿化率多少? “宝湖庭院”项目公共交通资源丰富:15路、21路、35路、38路、43路公交车贯穿银川市各地,出行便利。 -宝湖庭院绿化率多少? 距离新百良田购物广场约1公里,工人疗养院600米,宝湖公园1公里,唐徕渠景观水道500米。 -宝湖庭院绿化率多少? 项目位置优越,购物、餐饮、医疗、交通、休闲等生活资源丰富。[1] -宝湖庭院绿化率多少? 建发·宝湖庭院建筑及景观设置传承建发一贯“简约、大气”的风格:搂间距宽广,确保每一座楼宇视野开阔通透。 -宝湖庭院绿化率多少? 楼宇位置错落有置,外立面设计大气沉稳别致。 -宝湖庭院绿化率多少? 项目内部休闲绿地、景观小品点缀其中,道路及停车系统设计合理,停车及通行条件便利。 -宝湖庭院绿化率多少? 社区会所、幼儿园、活动室、医疗服务中心等生活配套一应俱全。 -宝湖庭院绿化率多少? 行政区域:金凤区 -大月兔(中秋艺术作品)的作者还有哪些代表作? 大月兔是荷兰“大黄鸭”之父弗洛伦泰因·霍夫曼打造的大型装置艺术作品,该作品首次亮相于台湾桃园大园乡海军基地,为了迎接中秋节的到来;在展览期间,海军基地也首次对外开放。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 霍夫曼觉得中国神话中捣杵的玉兔很有想象力,于是特别创作了“月兔”,这也是“月兔”新作第一次展出。[1] -大月兔(中秋艺术作品)的作者还有哪些代表作? ?2014年9月15日因工人施工不慎,遭火烧毁。[2] -大月兔(中秋艺术作品)的作者还有哪些代表作? “大月兔”外表采用的杜邦防水纸、会随风飘动,内部以木材加保丽龙框架支撑做成。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 兔毛用防水纸做成,材质完全防水,不怕日晒雨淋。[3 -大月兔(中秋艺术作品)的作者还有哪些代表作? -4] -大月兔(中秋艺术作品)的作者还有哪些代表作? 25米的“月兔”倚靠在机 -大月兔(中秋艺术作品)的作者还有哪些代表作? 堡上望着天空,像在思考又像赏月。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 月兔斜躺在机堡上,意在思考生命、边做白日梦,编织自己的故事。[3] -大月兔(中秋艺术作品)的作者还有哪些代表作? 台湾桃园大园乡海军基地也首度对外开放。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 428公顷的海军基地中,地景艺术节使用约40公顷,展场包括过去军机机堡、跑道等,由于这处基地过去警备森严,不对外开放,这次结合地景艺术展出,也可一窥过去是黑猫中队基地的神秘面纱。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 2014年9月2日,桃园县政府文化局举行“踩线团”,让 -大月兔(中秋艺术作品)的作者还有哪些代表作? 大月兔 -大月兔(中秋艺术作品)的作者还有哪些代表作? 各项地景艺术作品呈现在媒体眼中,虽然“月兔”仍在进行最后的细节赶工,但横躺在机堡上的“月兔”雏形已经完工。[5] -大月兔(中秋艺术作品)的作者还有哪些代表作? “这么大”、“好可爱呦”是不少踩线团成员对“月兔”的直觉;尤其在蓝天的衬托及前方绿草的组合下,呈现犹如真实版的爱丽丝梦游仙境。[6] -大月兔(中秋艺术作品)的作者还有哪些代表作? 霍夫曼的作品大月兔,“从平凡中,创作出不平凡的视觉”,创造出观赏者打从心中油然而生的幸福感,拉近观赏者的距离。[6] -大月兔(中秋艺术作品)的作者还有哪些代表作? 2014年9月15日早 -大月兔(中秋艺术作品)的作者还有哪些代表作? 上,施工人员要将月兔拆解,搬离海军基地草皮时,疑施工拆除的卡车,在拆除过程,故障起火,起火的卡车不慎延烧到兔子,造成兔子起火燃烧,消防队员即刻抢救,白色的大月兔立即变成焦黑的火烧兔。[7] -大月兔(中秋艺术作品)的作者还有哪些代表作? 桃园县府表示相当遗憾及难过,也不排除向包商求偿,也已将此事告知霍夫曼。[2] -大月兔(中秋艺术作品)的作者还有哪些代表作? ?[8] -大月兔(中秋艺术作品)的作者还有哪些代表作? 弗洛伦泰因·霍夫曼,荷兰艺术家,以在公共空间创作巨大造型 -大月兔(中秋艺术作品)的作者还有哪些代表作? 物的艺术项目见长。 -大月兔(中秋艺术作品)的作者还有哪些代表作? 代表作品包括“胖猴子”(2010年在巴西圣保罗展出)、“大黄兔”(2011年在瑞典厄勒布鲁展出)、粉红猫(2014年5月在上海亮相)、大黄鸭(Rubber Duck)、月兔等。 -英国耆卫保险公司有多少保险客户? 英国耆卫保险公司(Old Mutual plc)成立于1845年,一直在伦敦证券交易所(伦敦证券交易所:OML)作第一上市,也是全球排名第32位(按营业收入排名)的保险公司(人寿/健康)。 -英国耆卫保险公司有多少保险客户? 公司是全球财富500强公司之一,也是被列入英国金融时报100指数的金融服务集团之一。 -英国耆卫保险公司有多少保险客户? Old Mutual 是一家国际金融服务公司,拥有近320万个保险客户,240万个银行储户,270,000个短期保险客户以及700,000个信托客户 -英国耆卫保险公司有多少保险客户? 英国耆卫保险公司(Old Mutual)是一家国际金融服务公司,总部设在伦敦,主要为全球客户提供长期储蓄的解决方案、资产管理、短期保险和金融服务等,目前业务遍及全球34个国家。[1] -英国耆卫保险公司有多少保险客户? 主要包括人寿保险,资产管理,银行等。 -英国耆卫保险公司有多少保险客户? 1845年,Old Mutual在好望角成立。 -英国耆卫保险公司有多少保险客户? 1870年,董事长Charles Bell设计了Old Mutual公司的标记。 -英国耆卫保险公司有多少保险客户? 1910年,南非从英联邦独立出来。 -英国耆卫保险公司有多少保险客户? Old Mutual的董事长John X. Merriman被选为国家总理。 -英国耆卫保险公司有多少保险客户? 1927年,Old Mutual在Harare成立它的第一个事务所。 -英国耆卫保险公司有多少保险客户? 1960年,Old Mutual在南非成立了Mutual Unit信托公司,用来管理公司的信托业务。 -英国耆卫保险公司有多少保险客户? 1970年,Old Mutual的收入超过100百万R。 -英国耆卫保险公司有多少保险客户? 1980年,Old Mutual成为南非第一大人寿保险公司,年收入达10亿R。 -英国耆卫保险公司有多少保险客户? 1991年,Old Mutual在美国财富周刊上评选的全球保险公司中名列第38位。 -英国耆卫保险公司有多少保险客户? 1995年,Old Mutual在美国波士顿建立投资顾问公司,同年、又在香港和Guernsey建立事务所。 -英国耆卫保险公司有多少保险客户? 作为一项加强与其母公司联系的举措,OMNIA公司(百慕大)荣幸的更名为Old Mutual 公司(百慕大) 。 -英国耆卫保险公司有多少保险客户? 这一新的名称和企业识别清晰地展示出公司成为其世界金融机构合作伙伴强有力支持的决心。 -英国耆卫保险公司有多少保险客户? 2003 年4月,该公司被Old Mutual plc公司收购,更名为Sage Life(百慕大)公司并闻名于世,公司为Old Mutual公司提供了一个新的销售渠道,补充了其现有的以美元计价的产品线和分销系统。 -英国耆卫保险公司有多少保险客户? 达到了一个重要里程碑是公司成功的一个例证: 2005年6月3日公司资产超过10亿美元成为公司的一个主要里程碑,也是公司成功的一个例证。 -英国耆卫保险公司有多少保险客户? Old Mutual (百慕大)为客户提供一系列的投资产品。 -英国耆卫保险公司有多少保险客户? 在其开放的结构下,客户除了能够参与由Old Mutual会员管理的方案外,还能够参与由一些世界顶尖投资机构提供的投资选择。 -英国耆卫保险公司有多少保险客户? 首席执行官John Clifford对此发表评论说:“过去的两年对于Old Mutual家族来说是稳固发展的两年,更名是迫在眉睫的事情。 -英国耆卫保险公司有多少保险客户? 通过采用其名字和形象上的相似,Old Mutual (百慕大)进一步强化了与母公司的联系。” -英国耆卫保险公司有多少保险客户? Clifford补充道:“我相信Old Mutual全球品牌认可度和Old Mutual(百慕大)产品专业知识的结合将在未来的日子里进一步推动公司的成功。” -英国耆卫保险公司有多少保险客户? 随着公司更名而来的是公司网站的全新改版,设计投资选择信息、陈述、销售方案、营销材料和公告板块。 -英国耆卫保险公司有多少保险客户? 在美国购买不到OMNIA投资产品,该产品也不向美国公民或居民以及百慕大居民提供。 -英国耆卫保险公司有多少保险客户? 这些产品不对任何要约未得到批准的区域中的任何人,以及进行此要约或询价为非法行为的个人构成要约或询价。 -英国耆卫保险公司有多少保险客户? 关于Old Mutual(百慕大)公司 -英国耆卫保险公司有多少保险客户? Old Mutual(百慕大)公司总部位于百慕大,公司面向非美国居民及公民以及非百慕大居民,通过遍布世界的各个市场的金融机构开发和销售保险和投资方案。 -英国耆卫保险公司有多少保险客户? 这些方案由Old Mutual(百慕大)公司直接做出,向投资者提供各种投资选择和战略,同时提供死亡和其他受益保证。 -谁知道北京的淡定哥做了什么? 尼日利亚足球队守门员恩耶马被封淡定哥,原因是2010年南非世界杯上1:2落后希腊队时,对方前锋已经突破到禁区,其仍头依门柱发呆,其从容淡定令人吃惊。 -谁知道北京的淡定哥做了什么? 淡定哥 -谁知道北京的淡定哥做了什么? 在2010年6月17日的世界杯赛场上,尼日利亚1比2不敌希腊队,但尼日利亚门将恩耶马(英文名:Vincent Enyeama)在赛场上的“淡定”表现令人惊奇。 -谁知道北京的淡定哥做了什么? 随后,网友将赛场照片发布于各大论坛,恩耶马迅速窜红,并被网友称为“淡定哥”。 -谁知道北京的淡定哥做了什么? 淡定哥 -谁知道北京的淡定哥做了什么? 从网友上传得照片中可以看到,“淡定哥”在面临对方前锋突袭至小禁区之时,还靠在球门柱上发呆,其“淡定”程度的确非一般人所能及。 -谁知道北京的淡定哥做了什么? 恩耶马是尼日利亚国家队的主力守门员,目前效力于以色列的特拉维夫哈普尔队。 -谁知道北京的淡定哥做了什么? 1999年,恩耶马在尼日利亚国内的伊波姆星队开始职业生涯,后辗转恩伊姆巴、Iwuanyanwu民族等队,从07年开始,他为特拉维夫效力。 -谁知道北京的淡定哥做了什么? 恩耶马的尼日利亚国脚生涯始于2002年,截至2010年1月底,他为国家队出场已超过50次。 -谁知道北京的淡定哥做了什么? 当地时间2011年1月4日,国际足球历史与统计协会(IFFHS)公布了2010年度世界最佳门将,恩耶马(尼日利亚,特拉维夫夏普尔)10票排第十一 -谁知道北京的淡定哥做了什么? 此词经国家语言资源监测与研究中心等机构专家审定入选2010年年度新词语,并收录到《中国语言生活状况报告》中。 -谁知道北京的淡定哥做了什么? 提示性释义:对遇事从容镇定、处变不惊的男性的戏称。 -谁知道北京的淡定哥做了什么? 例句:上海现“淡定哥”:百米外爆炸他仍专注垂钓(2010年10月20日腾讯网http://news.qq.com/a/20101020/000646.htm) -谁知道北京的淡定哥做了什么? 2011年度新人物 -谁知道北京的淡定哥做了什么? 1、淡定哥(北京) -谁知道北京的淡定哥做了什么? 7月24日傍晚,北京市出现大范围降雨天气,位于通州北苑路出现积水,公交车也难逃被淹。 -谁知道北京的淡定哥做了什么? 李欣摄图片来源:新华网一辆私家车深陷积水,车主索性盘坐在自己的汽车上抽烟等待救援。 -谁知道北京的淡定哥做了什么? 私家车主索性盘坐在自己的车上抽烟等待救援,被网友称“淡定哥” -谁知道北京的淡定哥做了什么? 2、淡定哥——林峰 -谁知道北京的淡定哥做了什么? 在2011年7月23日的动车追尾事故中,绍兴人杨峰(@杨峰特快)在事故中失去了5位亲人:怀孕7个月的妻子、未出世的孩子、岳母、妻姐和外甥女,他的岳父也在事故中受伤正在治疗。 -谁知道北京的淡定哥做了什么? 他披麻戴孝出现在事故现场,要求将家人的死因弄个明白。 -谁知道北京的淡定哥做了什么? 但在第一轮谈判过后,表示:“请原谅我,如果我再坚持,我将失去我最后的第六个亲人。” -谁知道北京的淡定哥做了什么? 如果他继续“纠缠”铁道部,他治疗中的岳父将会“被死亡”。 -谁知道北京的淡定哥做了什么? 很多博友就此批评杨峰,并讽刺其为“淡定哥”。 -071型船坞登陆舰的北约代号是什么? 071型船坞登陆舰(英语:Type 071 Amphibious Transport Dock,北约代号:Yuzhao-class,中文:玉昭级,或以首舰昆仑山号称之为昆仑山级船坞登陆舰),是中国人民解放军海军隶下的大型多功能两栖船坞登陆舰,可作为登陆艇的母舰,用以运送士兵、步兵战车、主战坦克等展开登陆作战,也可搭载两栖车辆,具备大型直升机起降甲板及操作设施。 -071型船坞登陆舰的北约代号是什么? 071型两栖登陆舰是中国首次建造的万吨级作战舰艇,亦为中国大型多功能两栖舰船的开山之作,也可以说是中国万吨级以上大型作战舰艇的试验之作,该舰的建造使中国海军的两栖舰船实力有了质的提升。 -071型船坞登陆舰的北约代号是什么? 在本世纪以前中国海军原有的两栖舰队以一 -071型船坞登陆舰的北约代号是什么? 早期071模型 -071型船坞登陆舰的北约代号是什么? 千至四千吨级登陆舰为主要骨干,这些舰艇吨位小、筹载量有限,直升机操作能力非常欠缺,舰上自卫武装普遍老旧,对于现代化两栖登陆作战可说有很多不足。 -071型船坞登陆舰的北约代号是什么? 为了应对新时期的国际国内形势,中国在本世纪初期紧急强化两栖作战能力,包括短时间内密集建造072、074系列登陆舰,同时也首度设计一种新型船坞登陆舰,型号为071。[1] -071型船坞登陆舰的北约代号是什么? 在两栖作战行动中,这些舰只不得不采取最危险的 -071型船坞登陆舰的北约代号是什么? 舾装中的昆仑山号 -071型船坞登陆舰的北约代号是什么? 敌前登陆方式实施两栖作战行动,必须与敌人预定阻击力量进行面对面的战斗,在台湾地区或者亚洲其他国家的沿海,几乎没有可用而不设防的海滩登陆地带,并且各国或者地区的陆军在战时,可能会很快控制这些易于登陆的海难和港口,这样就限制住了中国海军两栖登陆部队的实际登陆作战能力。 -071型船坞登陆舰的北约代号是什么? 071型登陆舰正是为了更快和更多样化的登陆作战而开发的新型登陆舰艇。[2] -071型船坞登陆舰的北约代号是什么? 071型两栖船坞登陆舰具有十分良好的整体隐身能力, -071型船坞登陆舰的北约代号是什么? 071型概念图 -071型船坞登陆舰的北约代号是什么? 该舰外部线条简洁干练,而且舰体外形下部外倾、上部带有一定角度的内倾,从而形成雷达隐身性能良好的菱形横剖面。 -071型船坞登陆舰的北约代号是什么? 舰体为高干舷平甲板型,长宽比较小,舰身宽满,采用大飞剪型舰首及楔形舰尾,舰的上层建筑位于舰体中间部位,后部是大型直升机甲板,适航性能非常突出。 -071型船坞登陆舰的北约代号是什么? 顶甲板上各类电子设备和武器系统布局十分简洁干净,各系统的突出物很少。 -071型船坞登陆舰的北约代号是什么? 该舰的两座烟囱实行左右分布式设置在舰体两侧,既考虑了隐身特点,也十分新颖。[3] -071型船坞登陆舰的北约代号是什么? 1号甲板及上层建筑物主要设置有指挥室、控 -071型船坞登陆舰的北约代号是什么? 舰尾俯视 -071型船坞登陆舰的北约代号是什么? 制舱、医疗救护舱及一些居住舱,其中医疗救护舱设置有完备的战场救护设施,可以在舰上为伤病员提供紧急手术和野战救护能力。 -071型船坞登陆舰的北约代号是什么? 2号甲板主要是舰员和部分登陆人员的居住舱、办公室及厨房。 -071型船坞登陆舰的北约代号是什么? 主甲板以下则是登陆舱,分前后两段,前段是装甲车辆储存舱,共两层,可以储存登陆装甲车辆和一些其它物资,在进出口处还设有一小型升降机,用于两层之间的移动装卸用。 -071型船坞登陆舰的北约代号是什么? 前段车辆储存舱外壁左右各设有一折叠式装载舱门,所有装载车辆在码头可通过该门直接装载或者登陆上岸。 -071型船坞登陆舰的北约代号是什么? 后段是一个巨型船坞登陆舱,总长约70米,主要用来停泊大小型气垫登陆艇、机械登陆艇或车辆人员登陆艇。[4] -071型船坞登陆舰的北约代号是什么? 自卫武装方面,舰艏设有一门PJ-26型76mm舰炮( -071型船坞登陆舰的北约代号是什么? 井冈山号舰首主炮 -071型船坞登陆舰的北约代号是什么? 俄罗斯AK-176M的中国仿制版,亦被054A采用) , 四具与052B/C相同的726-4 18联装干扰弹发射器分置于舰首两侧以及上层结构两侧,近迫防御则依赖四座布置于上层结构的AK-630 30mm防空机炮 。 -071型船坞登陆舰的北约代号是什么? 原本071模型的舰桥前方设有一座八联装海红-7短程防空导弹发射器,不过071首舰直到出海试航与2009年4月下旬的海上阅兵式中,都未装上此一武器。 -071型船坞登陆舰的北约代号是什么? 电子装备方面, 舰桥后方主桅杆顶配置一具363S型E/F频2D对空/平面搜索雷达 、一具Racal Decca RM-1290 I频导航雷达,后桅杆顶装备一具拥有球型外罩的364型(SR-64)X频2D对空/对海搜索雷达,此外还有一具LR-66C舰炮射控雷达、一具负责导引AK-630机炮的TR-47C型火炮射控雷达等。[5] -071型船坞登陆舰的北约代号是什么? 071型自卫武装布置 -071型船坞登陆舰的北约代号是什么? 071首舰昆仑山号于2006年6月开 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 竹溪县人大常委会办公室:承担人民代表大会会议、常委会会议、主任会议和常委会党组会议(简称“四会”)的筹备和服务工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会组成人员视察活动的联系服务工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 受主任会议委托,拟定有关议案草案。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担常委会人事任免的具体工作,负责机关人事管理和离退休干部的管理与服务。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担县人大机关的行政事务和后勤保障工作,负责机关的安全保卫、文电处理、档案、保密、文印工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担县人大常委会同市人大常委会及乡镇人大的工作联系。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责信息反馈工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 了解宪法、法律、法规和本级人大及其常委会的决议、决定实施情况及常委会成员提出建议办理情况,及时向常委会和主任会议报告。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担人大宣传工作,负责人大常委会会议宣传的组织和联系。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 组织协调各专门工作委员会开展工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 办公室下设五个科,即秘书科、调研科、人事任免科、综合科、老干部科。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 教科文卫工作委员会:负责人大教科文卫工作的日常联系、督办、信息收集反馈和业务指导工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责教科文卫方面法律法规贯彻和人大工作情况的宣传、调研工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担人大常委会教科文卫方面会议议题调查的组织联系和调研材料的起草工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承担教科文卫方面规范性备案文件的初审工作,侧重对教科文卫行政执法个案监督业务承办工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会组成人员和人大代表对教科文卫工作方面检查、视察的组织联系工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 代表工作委员会:负责与县人大代表和上级人大代表的联系、情况收集交流工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责《代表法》的宣传贯彻和贯彻实施情况的调查研究工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责县人大代表法律法规和人民代表大会制度知识学习的组织和指导工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责常委会主任、副主任和委员走访联系人大代表的组织、联系工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责组织人大系统的干部培训。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责乡镇人大主席团工作的联系和指导。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责人大代表建议、批评和意见办理工作的联系和督办落实。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责人大代表开展活动的组织、联系工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 承办上级交办的其他工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 财政经济工作委员会:负责人大财政经济工作的日常联系、督办、信息收集反馈和业务指导工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 负责财政经济方面法律法规贯彻和人大工作情况的宣传、调研工作。 -我很好奇竹溪县人大常委会财政经济工作委员会是负责做什么的? 对国民经济计划和财政预算编制情况进行初审。 -我想知道武汉常住人口有多少? 武汉,简称“汉”,湖北省省会。 -我想知道武汉常住人口有多少? 它是武昌、汉口、汉阳三镇统称。 -我想知道武汉常住人口有多少? 世界第三大河长江及其最长支流汉江横贯市区,将武汉一分为三,形成武昌、汉口、汉阳,三镇跨江鼎立的格局。 -我想知道武汉常住人口有多少? 唐朝诗人李白在此写下“黄鹤楼中吹玉笛,江城五月落梅花”,因此武汉自古又称“江城”。 -我想知道武汉常住人口有多少? 武汉是中国15个副省级城市之一,全国七大中心城市之一,全市常住人口858万人。 -我想知道武汉常住人口有多少? 华中地区最大都市,华中金融中心、交通中心、文化中心,长江中下游特大城市。 -我想知道武汉常住人口有多少? 武汉城市圈的中心城市。 -我想知道武汉常住人口有多少? [3]武昌、汉口、汉阳三地被俗称武汉三镇。 -我想知道武汉常住人口有多少? 武汉西与仙桃市、洪湖市相接,东与鄂州市、黄石市接壤,南与咸宁市相连,北与孝感市相接,形似一只自西向东的蝴蝶形状。 -我想知道武汉常住人口有多少? 在中国经济地理圈内,武汉处于优越的中心位置是中国地理上的“心脏”,故被称为“九省通衢”之地。 -我想知道武汉常住人口有多少? 武汉市历史悠久,古有夏汭、鄂渚之名。 -我想知道武汉常住人口有多少? 武汉地区考古发现的历史可以上溯距今6000年的新石器时代,其考古发现有东湖放鹰台遗址的含有稻壳的红烧土、石斧、石锛以及鱼叉。 -我想知道武汉常住人口有多少? 市郊黄陂区境内的盘龙城遗址是距今约3500年前的商朝方国宫城,是迄今中国发现及保存最完整的商代古城之一。 -我想知道武汉常住人口有多少? 现代武汉的城市起源,是东汉末年的位于今汉阳的卻月城、鲁山城,和在今武昌蛇山的夏口城。 -我想知道武汉常住人口有多少? 东汉末年,地方军阀刘表派黄祖为江夏太守,将郡治设在位于今汉阳龟山的卻月城中。 -我想知道武汉常住人口有多少? 卻月城是武汉市区内已知的最早城堡。 -我想知道武汉常住人口有多少? 223年,东吴孙权在武昌蛇山修筑夏口城,同时在城内的黄鹄矶上修筑了一座瞭望塔——黄鹤楼。 -我想知道武汉常住人口有多少? 苏轼在《前赤壁赋》中说的“西望夏口,东望武昌”中的夏口就是指武汉(而当时的武昌则是今天的鄂州)。 -我想知道武汉常住人口有多少? 南朝时,夏口扩建为郢州,成为郢州的治所。 -我想知道武汉常住人口有多少? 隋置江夏县和汉阳县,分别以武昌,汉阳为治所。 -我想知道武汉常住人口有多少? 唐时江夏和汉阳分别升为鄂州和沔州的州治,成为长江沿岸的商业重镇。 -我想知道武汉常住人口有多少? 江城之称亦始于隋唐。 -我想知道武汉常住人口有多少? 两宋时武昌属鄂州,汉阳汉口属汉阳郡。 -我想知道武汉常住人口有多少? 经过发掘,武汉出土了大量唐朝墓葬,在武昌马房山和岳家咀出土了灰陶四神砖以及灰陶十二生肖俑等。 -我想知道武汉常住人口有多少? 宋代武汉的制瓷业发达。 -我想知道武汉常住人口有多少? 在市郊江夏区梁子湖旁发现了宋代瓷窑群100多座,烧制的瓷器品种很多,釉色以青白瓷为主。 -我想知道武汉常住人口有多少? 南宋诗人陆游在经过武昌时,写下“市邑雄富,列肆繁错,城外南市亦数里,虽钱塘、建康不能过,隐然一大都会也”来描写武昌的繁华。 -我想知道武汉常住人口有多少? 南宋抗金将领岳飞驻防鄂州(今武昌)8年,在此兴师北伐。 -我想知道武汉常住人口有多少? 元世祖至元十八年(1281年),武昌成为湖广行省的省治。 -我想知道武汉常住人口有多少? 这是武汉第一次成为一级行政单位(相当于现代的省一级)的治所。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇,托洛茨基是联共(布)党内和第三国际时期反对派的领导人,托派"第四国际"的创始人和领导人。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇·托洛茨基 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 列夫·达维多维奇·托洛茨基(俄国与国际历史上最重要的无产阶级革命家之一,二十世纪国际共产主义运动中最具争议的、也是备受污蔑的左翼反对派领袖,他以对古典马克思主义“不断革命论”的独创性发展闻名于世,第三共产国际和第四国际的主要缔造者之一(第三国际前三次代表大会的宣言执笔人)。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在1905年俄国革命中被工人群众推举为彼得堡苏维埃主席(而当时布尔什维克多数干部却还在讨论是否支持苏维埃,这些干部后来被赶回俄国的列宁痛击)。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1917年革命托洛茨基率领“区联派”与列宁派联合,并再次被工人推举为彼得格勒苏维埃主席。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 对于十月革命这场20世纪最重大的社会革命,托洛茨基赢得了不朽的历史地位。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 后来成了托洛茨基死敌的斯大林,当时作为革命组织领导者之一却写道:“起义的一切实际组织工作是在彼得格勒苏维埃主席托洛茨基同志直接指挥之下完成的。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 我们可以确切地说,卫戍部队之迅速站在苏维埃方面来,革命军事委员会的工作之所以搞得这样好,党认为这首先要归功于托洛茨基同志。” -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? (值得一提的是,若干年后,当反托成为政治需要时,此类评价都从斯大林文章中删掉了。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )甚至连后来狂热的斯大林派雅克·沙杜尔,当时却也写道:“托洛茨基在十月起义中居支配地位,是起义的钢铁灵魂。” -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? (苏汉诺夫《革命札记》第6卷P76。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )不仅在起义中,而且在无产阶级政权的捍卫、巩固方面和国际共产主义革命方面,托洛茨基也作出了极其卓越的贡献(外交官-苏联国际革命政策的负责人、苏联红军缔造者以及共产国际缔造者)。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 革命后若干年里,托洛茨基与列宁的画像时常双双并列挂在一起;十月革命之后到列宁病逝之前,布尔什维克历次全国代表大会上,代表大会发言结束均高呼口号:“我们的领袖列宁和托洛茨基万岁!” -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在欧美共运中托洛茨基的威望非常高。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 后人常常认为托洛茨基只是一个知识分子文人,实际上他文武双全,而且谙熟军事指挥艺术,并且亲临战场。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 正是他作为十月革命的最高军事领袖(在十月革命期间他与士兵一起在战壕里作战),并且在1918年缔造并指挥苏联红军,是一个杰出的军事家(列宁曾对朋友说,除了托洛茨基,谁还能给我迅速地造成一支上百万人的强大军队? -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? )。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 在内战期间,他甚至坐装甲列车冒着枪林弹雨亲临战场指挥作战,差点挨炸死;当反革命军队进攻彼得堡时,当时的彼得堡领导人季诺维也夫吓得半死,托洛茨基却从容不迫指挥作战。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 同时托洛茨基又是一个高明的外交家,他曾强硬地要求英国政府释放因反战宣传被囚禁在英国的俄国流亡革命者,否则就不许英国公民离开俄国,连英国政府方面都觉得此举无懈可击;他并且把居高临下的法国到访者当场轰出他的办公室(革命前法国一直是俄国的头号债主与政治操纵者),却彬彬有礼地欢迎前来缓和冲突的法国大使;而在十月革命前夕,他对工人代表议会质询的答复既保守了即将起义的军事秘密,又鼓舞了革命者的战斗意志,同时严格遵循现代民主与公开原则,这些政治答复被波兰人多伊彻誉为“外交辞令的杰作”(伊·多伊彻的托氏传记<先知三部曲·武装的先知>第九章P335,第十一章P390)。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 托洛茨基在国民经济管理与研究工作中颇有创造:是苏俄新经济政策的首先提议者以及社会主义计划经济的首先实践者。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1928年斯大林迟迟开始的计划经济实验,是对1923年以托洛茨基为首的左翼反对派经济纲领的拙劣剽窃和粗暴翻版。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 因为统治者的政策迟到,使得新经济政策到1928年已产生了一个威胁政权生存的农村资产阶级,而苏俄工人阶级国家不得不强力解决——而且是不得不借助已蜕化为官僚集团的强力来解决冲突——结果导致了1929年到30年代初的大饥荒和对农民的大量冤枉错杀。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 另外,他还对文学理论有很高的造诣,其著作<文学与革命>甚至影响了整整一代的国际左翼知识分子(包括中国的鲁迅、王实味等人)。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 他在哈佛大学图书馆留下了100多卷的<托洛茨基全集>,其生动而真诚的自传和大量私人日记、信件,给人留下了研究人类生活各个方面的宝贵财富,更是追求社会进步与解放的历史道路上的重要知识库之一。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 托洛茨基1879年10月26日生于乌克兰赫尔松县富裕农民家庭,祖籍是犹太人。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 原姓布隆施泰因。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1896年开始参加工人运动。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1897年 ,参加建立南俄工人协会 ,反对沙皇专制制度。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1898年 在尼古拉也夫组织工人团体,被流放至西伯利亚。 -列夫·达维多维奇·托洛茨基是什么时候开始参加工人运动的? 1902年秋以署名托洛茨基之假护照逃到伦敦,参加V.I.列宁、G.V.普列汉诺夫等人主编的<火星报>的工作。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥,位于洞庭湖与长江交汇处,东接岳阳市区洞庭大道和107国道、京珠高速公路,西连省道306线,是国内目前最长的内河公路桥。 -谁知道洞庭湖大桥有多长? 路桥全长10173.82m,其中桥长5747.82m,桥宽20m,西双向四车道,是我国第一座三塔双索面斜拉大桥,亚洲首座不等高三塔双斜索面预应力混凝土漂浮体系斜拉桥。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥是我国最长的内河公路桥,大桥横跨东洞庭湖区,全长10174.2米,主桥梁长5747.8米。 -谁知道洞庭湖大桥有多长? 大桥的通车使湘、鄂间公路干线大为畅通,并为洞庭湖区运输抗洪抢险物资提供了一条快速通道该桥设计先进,新颖,造型美观,各项技求指标先进,且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥是湖区人民的造福桥,装点湘北门户的形象桥,对优化交通网络绪构,发展区域经济,保障防汛救灾,缩短鄂、豫、陕等省、市西部车辆南下的运距,拓展岳阳城区的主骨架,提升岳阳城市品位,增强城市辐射力,有着十分重要的意义。 -谁知道洞庭湖大桥有多长? 自1996年12月开工以来,共有10支施工队伍和两支监理队伍参与了大桥的建设。 -谁知道洞庭湖大桥有多长? 主桥桥面高52米(黄海),设计通航等级Ⅲ级。 -谁知道洞庭湖大桥有多长? 主桥桥型为不等高三塔、双索面空间索、全飘浮体系的预应力钢筋混凝土肋板梁式结构的斜拉桥,跨径为130+310+310+130米。 -谁知道洞庭湖大桥有多长? 索塔为双室宝石型断面,中塔高为125.684米,两边塔高为99.311米。 -谁知道洞庭湖大桥有多长? 三塔基础为3米和3.2米大直径钻孔灌注桩。 -谁知道洞庭湖大桥有多长? 引桥为连续梁桥,跨径20至50米,基础直径为1.8和2.5米钻孔灌注桩。 -谁知道洞庭湖大桥有多长? 该桥设计先进、新颖、造型美观,各项技求指标先进,且为首次在国内特大型桥梁中采用主塔斜拉桥结构体系,岳阳洞庭湖大桥是我国首次采用不等高三塔斜拉桥桥型的特大桥,设计先进,施工难度大位居亚洲之首,是湖南省桥梁界的一大科研项目。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥设计为三塔斜拉桥,空间双斜面索,主梁采用前支点挂篮施工,并按各种工况模拟挂篮受力进行现场试验,获得了大量有关挂篮受力性能和实际刚度的计算参数,作为施工控制参数。 -谁知道洞庭湖大桥有多长? 利用组合式模型单元,推导了斜拉桥分离式双肋平板主梁的单元刚度矩阵,并进行了岳阳洞庭湖大桥的空间受力分析,结果表明此种单元精度满足工程要求,同时在施工工艺方面也积累了成功经验。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥的通车使湘、鄂间公路干线大为畅通,并为洞庭湖区抗洪抢险物资运输提供了一条快速通道。 -谁知道洞庭湖大桥有多长? 湖大桥设计先进,造型美丽,科技含量高。 -谁知道洞庭湖大桥有多长? 洞庭大桥还是一道美丽的风景线,大桥沿岸风景与岳阳楼,君山岛、洞庭湖等风景名胜融为一体,交相辉映,成为世人了解岳阳的又一崭新窗口,也具有特别旅游资源。 -谁知道洞庭湖大桥有多长? 洞庭湖大桥多塔斜拉桥新技术研究荣获国家科学技术进步二等奖、湖南省科学技术进步一等奖,并获第五届詹天佑大奖。 -谁知道洞庭湖大桥有多长? 大桥在中国土木工程学会2004年第16届年会上入选首届《中国十佳桥梁》,名列斜拉桥第二位。 -谁知道洞庭湖大桥有多长? 2001年荣获湖南省建设厅优秀设计一等奖,省优秀勘察一等奖。 -谁知道洞庭湖大桥有多长? 2003年荣获国家优秀工程设计金奖, "十佳学术活动"奖。 -天气预报员的布景师是谁? 芝加哥天气预报员大卫(尼古拉斯·凯奇),被他的粉丝们热爱,也被诅咒--这些人在天气不好的时候会迁怒于他,而大部分时候,大卫都是在预报坏天气。 -天气预报员的布景师是谁? ?不过,这也没什么,当一家国家早间新闻节目叫他去面试的时候,大卫的事业似乎又将再创新高。 -天气预报员的布景师是谁? 芝加哥天气预报员大卫(尼古拉斯·凯奇),被他的粉丝们热爱,也被诅咒--这些人在天气不好的时候会迁怒于他,而大部分时候,大卫都是在预报坏天气。 -天气预报员的布景师是谁? 不过,这也没什么,当一家国家早间新闻节目叫他去面试的时候,大卫的事业似乎又将再创新高。 -天气预报员的布景师是谁? 在电视节目上,大卫永远微笑,自信而光鲜,就像每一个成功的电视人一样,说起收入,他也绝对不落人后。 -天气预报员的布景师是谁? 不过,大卫的个人生活可就不那么如意了。 -天气预报员的布景师是谁? 与妻子劳伦(霍普·戴维斯)的离婚一直让他痛苦;儿子迈克吸大麻上瘾,正在进行戒毒,可戒毒顾问却对迈克有着异样的感情;女儿雪莉则体重惊人,总是愁眉苦脸、孤独寂寞;大卫的父亲罗伯特(迈克尔·凯恩),一个世界著名的小说家,虽然罗伯特不想再让大卫觉得负担过重,可正是他的名声让大卫的一生都仿佛处在他的阴影之下,更何况,罗伯特就快重病死了。 -天气预报员的布景师是谁? 和妻子的离婚、父亲的疾病、和孩子之间完全不和谐的关系,都让大卫每天头疼,而每次当他越想控制局面,一切就越加复杂。 -天气预报员的布景师是谁? 然而就在最后人们再也不会向他扔快餐,或许是因为他总是背着弓箭在大街上走。 -天气预报员的布景师是谁? 最后,面对那份高额工作的接受意味着又一个新生活的开始。 -天气预报员的布景师是谁? 也许,生活就像天气,想怎么样就怎么样,完全不可预料。 -天气预报员的布景师是谁? 导 演:戈尔·维宾斯基 Gore Verbinski -天气预报员的布景师是谁? 编 剧:Steve Conrad .....(written by) -天气预报员的布景师是谁? 演 员:尼古拉斯·凯奇 Nicolas Cage .....David Spritz -天气预报员的布景师是谁? 尼古拉斯·霍尔特 Nicholas Hoult .....Mike -天气预报员的布景师是谁? 迈克尔·凯恩 Michael Caine .....Robert Spritzel -天气预报员的布景师是谁? 杰蒙妮·德拉佩纳 Gemmenne de la Peña .....Shelly -天气预报员的布景师是谁? 霍普·戴维斯 Hope Davis .....Noreen -天气预报员的布景师是谁? 迈克尔·瑞斯玻利 Michael Rispoli .....Russ -天气预报员的布景师是谁? 原创音乐:James S. Levine .....(co-composer) (as James Levine) -天气预报员的布景师是谁? 汉斯·兹米尔 Hans Zimmer -天气预报员的布景师是谁? 摄 影:Phedon Papamichael -天气预报员的布景师是谁? 剪 辑:Craig Wood -天气预报员的布景师是谁? 选角导演:Denise Chamian -天气预报员的布景师是谁? 艺术指导:Tom Duffield -天气预报员的布景师是谁? 美术设计:Patrick M. Sullivan Jr. .....(as Patrick Sullivan) -天气预报员的布景师是谁? 布景师 :Rosemary Brandenburg -天气预报员的布景师是谁? 服装设计:Penny Rose -天气预报员的布景师是谁? 视觉特效:Charles Gibson -天气预报员的布景师是谁? David Sosalla .....Pacific Title & Art Studio -韩国国家男子足球队教练是谁? 韩国国家足球队,全名大韩民国足球国家代表队(???? ?? ?????),为韩国足球协会所于1928年成立,并于1948年加入国际足球协会。 -韩国国家男子足球队教练是谁? 韩国队自1986年世界杯开始,从未缺席任何一届决赛周。 -韩国国家男子足球队教练是谁? 在2002年世界杯,韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队,最后夺得了殿军,是亚洲球队有史以来最好成绩。 -韩国国家男子足球队教练是谁? 在2010年世界杯,韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈,再次晋身十六强,但以1-2败给乌拉圭出局。 -韩国国家男子足球队教练是谁? 北京时间2014年6月27日3时,巴西世界杯小组赛H组最后一轮赛事韩国对阵比利时,韩国队0-1不敌比利时,3场1平2负积1分垫底出局。 -韩国国家男子足球队教练是谁? 球队教练:洪明甫 -韩国国家男子足球队教练是谁? 韩国国家足球队,全名大韩民国足球国家代表队(韩国国家男子足球队???? ?? ?????),为韩国足球协会所于1928年成立,并于1948年加入国际足联。 -韩国国家男子足球队教练是谁? 韩国队是众多亚洲球队中,在世界杯表现最好,他们自1986年世界杯开始,从未缺席任何一届决赛周。 -韩国国家男子足球队教练是谁? 在2002年世界杯,韩国在主场之利淘汰了葡萄牙、意大利及西班牙三支欧洲强队,最后夺得了殿军,是亚洲球队有史以来最好成绩。 -韩国国家男子足球队教练是谁? 在2010年世界杯,韩国也在首圈分组赛压倒希腊及尼日利亚出线次圈,再次晋身十六强,但以1-2败给乌拉圭出局。 -韩国国家男子足球队教练是谁? 2014年世界杯外围赛,韩国在首轮分组赛以首名出线次轮分组赛,与伊朗、卡塔尔、乌兹别克以及黎巴嫩争逐两个直接出线决赛周资格,最后韩国仅以较佳的得失球差压倒乌兹别克,以小组次名取得2014年世界杯决赛周参赛资格,也是韩国连续八次晋身世界杯决赛周。 -韩国国家男子足球队教练是谁? 虽然韩国队在世界杯成绩为亚洲之冠,但在亚洲杯足球赛的成绩却远不及世界杯。 -韩国国家男子足球队教练是谁? 韩国只在首两届亚洲杯(1956年及1960年)夺冠,之后五十多年未能再度称霸亚洲杯,而自1992年更从未打入过决赛,与另一支东亚强队日本近二十年来四度在亚洲杯夺冠成强烈对比。[1] -韩国国家男子足球队教练是谁? 人物简介 -韩国国家男子足球队教练是谁? 车范根(1953年5月22日-)曾是大韩民国有名的锋线选手,他被欧洲媒体喻为亚洲最佳输出球员之一,他也被认为是世界最佳足球员之一。 -韩国国家男子足球队教练是谁? 他被国际足球史料与数据协会评选为20世纪亚洲最佳球员。 -韩国国家男子足球队教练是谁? 他在85-86赛季是德甲的最有价值球员,直到1999年为止他都是德甲外国球员入球纪录保持者。 -韩国国家男子足球队教练是谁? 德国的球迷一直没办法正确说出他名字的发音,所以球车范根(左)迷都以炸弹车(Cha Boom)称呼他。 -韩国国家男子足球队教练是谁? 这也代表了他强大的禁区得分能力。 -韩国国家男子足球队教练是谁? 职业生涯 -韩国国家男子足球队教练是谁? 车范根生于大韩民国京畿道的华城市,他在1971年于韩国空军俱乐部开始了他的足球员生涯;同年他入选了韩国19岁以下国家足球队(U-19)。 -韩国国家男子足球队教练是谁? 隔年他就加入了韩国国家足球队,他是有史以来加入国家队最年轻的球员。 -韩国国家男子足球队教练是谁? 车范根在27岁时前往德国发展,当时德甲被认为是世界上最好的足球联赛。 -韩国国家男子足球队教练是谁? 他在1978年12月加入了达姆施塔特,不过他在那里只待了不到一年就转到当时的德甲巨人法兰克福。 -韩国国家男子足球队教练是谁? 车范根很快在新俱乐部立足,他帮助球队赢得79-80赛季的欧洲足协杯。 -韩国国家男子足球队教练是谁? 在那个赛季过后,他成为德甲薪水第三高的球员,不过在1981年对上勒沃库森的一场比赛上,他的膝盖严重受伤,几乎毁了他的足球生涯。 -韩国国家男子足球队教练是谁? 在1983年车范根转投勒沃库森;他在这取得很高的成就,他成为85-86赛季德甲的最有价值球员,并且在1988年帮助球队拿下欧洲足协杯,也是他个人第二个欧洲足协杯。 -韩国国家男子足球队教练是谁? 他在决赛对垒西班牙人扮演追平比分的关键角色,而球会才在点球大战上胜出。 -韩国国家男子足球队教练是谁? 车范根在1989年退休,他在308场的德甲比赛中进了98球,一度是德甲外国球员的入球纪录。 -韩国国家男子足球队教练是谁? 执教生涯 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学,简称台湾科大、台科大或台科,是位于台湾台北市大安区的台湾第一所高等技职体系大专院校,现为台湾最知名的科技大学,校本部比邻国立台湾大学。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 该校已于2005年、2008年持续入选教育部的“发展国际一流大学及顶尖研究中心计划”。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? “国立”台湾工业技术学院成立于“民国”六十三年(1974)八月一日,为台湾地区第一所技术职业教育高等学府。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 建校之目的,在因应台湾地区经济与工业迅速发展之需求,以培养高级工程技术及管理人才为目标,同时建立完整之技术职业教育体系。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? “国立”台湾工业技术学院成立于“民国”六十三年(1974)八月一日,为台湾地区第一所技术职业教育高等学府。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 建校之目的,在因应台湾地区经济与工业迅速发展之需求,以培养高级工程技术及管理人才为目标,同时建立完整之技术职业教育体系。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 本校校地约44.5公顷,校本部位于台北市基隆路四段四十三号,。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 民国68年成立硕士班,民国71年成立博士班,现有大学部学生5,664人,研究生4,458人,专任教师451位。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 2001年在台湾地区教育部筹划之研究型大学(“国立”大学研究所基础教育重点改善计画)中,成为全台首批之9所大学之一 。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 自2005年更在“教育部”所推动“五年五百亿 顶尖大学”计划下,遴选为适合发展成“顶尖研究中心”的11所研究型大学之一。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学部设有二年制、四年制及工程在职人员进修班等三种学制;凡二专、三专及五专等专科学校以上之毕业生,皆可以报考本校大学部二年制,而高职、高中毕业生,可以报考本校大学部四年制。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 工业管理、电子工程、机械工程、营建工程及应用外语系等,则设有在职人员进修班学制,其招生对象为在职人员,利用夜间及暑假期间上课。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 凡在本校大学部修毕应修学分且成绩及格者皆授予学士学位。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学目前设有工程、电资、管理、设计、人文社会及精诚荣誉等六个学院,分别有机械、材料科学与工程、营建、化工、电子、电机、资工、工管、企管、资管、建筑、工商业设计、应用外语等13个系及校内招生之财务金融学士学位学程、科技管理学士学位学程;全校、工程、电资、管理、创意设计等五个不分系菁英班及光电研究所、管理研究所、财务金融研究所、科技管理研究所、管理学院MBA、数位学习教育研究所、医学工程研究所、自动化及控制研究所、工程技术研究所、专利研究所等独立研究所,此外尚有人文学科负责人文及社会类等课程之教学,通识学科负责法律、音乐、环保类等课程之教学,以及师资培育中心专以培养学生未来担任中等学校工、商、管理、设计等科之合格教师,合计23个独立系所、师资培育中心、人文学科及通识学科等教学单位。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 国立台湾科技大学至今各系所毕业校友已达约56,456位,毕业生出路包含出国继续深造、在台深造以及投身于产业界。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 由于实作经验丰富,理论基础完备,工作态度认真,毕业校友担任政府要职、大学教授、大学校长及企业主管者众多,深受各界的肯定。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 工商业设计系副教授孙春望与硕一生全明远耗时两个月自制之三分钟动画短片“立体悲剧”。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 本片入选有“动画奥斯卡”之称的“ACM SIGGRAPH”国际动画展,并获得观众票选第一名,这也是台湾首次入选及获奖的短片。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 击败了好莱坞知名导演史蒂芬·史匹柏的“世界大战”、乔治卢卡斯的“星际大战三部曲”、梦工厂出品的动画“马达加斯加”、军机缠斗片“机战未来”及美国太空总署、柏克莱加州大学等好莱坞名片及顶尖学术单位制作的短片。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 2009年荣获有工业设计界奥斯卡奖之称的“德国iF设计大奖”国立台湾科技大学设计学院获得大学排名的全球第二,仅次于韩国三星美术设计学院“SADI”。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 总体排名 依据《泰晤士高等教育》(THES-QS)在2009年的世界大学排名调查,台科大排名全世界第351名,在台湾所有大学中排名第五,仅次于台大,清大,成大及阳明,并且是台湾唯一进入世界四百大名校的科技大学。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 依据在欧洲拥有广大声誉的“Eduniversal商学院排名网”2008年的资料,台湾有七所大学的商管学院被分别列入世界1000大商学院,其中台科大位在“卓越商学院”(EXCELLENT Business Schools,国内主要)之列,“推荐程度”(Recommendation Rate)为全台第四,仅次于台大、政大、中山,与交大并列。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 目前设有工程、电资、管理、设计、人文社会及精诚荣誉学院等六个学院。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? 预计于竹北新校区设立产学合作学院及应用理学院。 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●台湾建筑科技中心 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●智慧型机械人研究中心科技成果展示(15张) -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●台湾彩卷与博彩研究中心 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●电力电子技术研发中心 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●NCP-Taiwan办公室 -国立台湾科技大学副教授自制的动画“立体悲剧”入选的“ACM SIGGRAPH”国际动画展还有什么别称? ●资通安全研究与教学中心 -在日本,神道最初属于什么信仰? 神道又称天道,语出《易经》“大观在上,顺而巽,中正以观天下。 -在日本,神道最初属于什么信仰? 观,盥而不荐,有孚顒若,下观而化也。 -在日本,神道最初属于什么信仰? 观天之神道,而四时不忒,圣人以神道设教,而天下服矣”。 -在日本,神道最初属于什么信仰? 自汉以降,神道又指“墓前开道,建石柱以为标”。 -在日本,神道最初属于什么信仰? 在中医中,神道,经穴名。 -在日本,神道最初属于什么信仰? 出《针灸甲乙经》。 -在日本,神道最初属于什么信仰? 别名冲道。 -在日本,神道最初属于什么信仰? 属督脉。 -在日本,神道最初属于什么信仰? 宗教中,神道是日本的本土传统民族宗教,最初以自然崇拜为主,属于泛灵多神信仰(精灵崇拜),视自然界各种动植物为神祇。 -在日本,神道最初属于什么信仰? 神道又称天道,语出《易经》“大观在上,顺而巽,中正以观天下。 -在日本,神道最初属于什么信仰? 观,盥而不荐,有孚顒若,下观而化也。 -在日本,神道最初属于什么信仰? 观天之神道,而四时不忒,圣人以神道设教,而天下服矣”。 -在日本,神道最初属于什么信仰? 自汉以降,神道又指“墓前开道,建石柱以为标”。 -在日本,神道最初属于什么信仰? 在中医中,神道,经穴名。 -在日本,神道最初属于什么信仰? 出《针灸甲乙经》。 -在日本,神道最初属于什么信仰? 别名冲道。 -在日本,神道最初属于什么信仰? 属督脉。 -在日本,神道最初属于什么信仰? 宗教中,神道是日本的本土传统民族宗教,最初以自然崇拜为主,属于泛灵多神信仰(精灵崇拜),视自然界各种动植物为神祇。 -在日本,神道最初属于什么信仰? 谓鬼神赐福降灾神妙莫测之道。 -在日本,神道最初属于什么信仰? 《易·观》:“观天之神道,而四时不忒,圣人以神道设教,而天下服矣。” -在日本,神道最初属于什么信仰? 孔颖达 疏:“微妙无方,理不可知,目不可见,不知所以然而然,谓之神道。” -在日本,神道最初属于什么信仰? 《文选·王延寿<鲁灵光殿赋>》:“敷皇极以创业,协神道而大宁。” -在日本,神道最初属于什么信仰? 张载 注:“协和神明之道,而天下大宁。” -在日本,神道最初属于什么信仰? 南朝 梁 刘勰 《文心雕龙·正纬》:“夫神道阐幽,天命微显。” -在日本,神道最初属于什么信仰? 鲁迅 《中国小说史略》第五篇:“﹝ 干宝 ﹞尝感於其父婢死而再生,及其兄气绝复苏,自言见天神事,乃撰《搜神记》二十卷,以‘发明神道之不诬’。” -在日本,神道最初属于什么信仰? 神道设教 观卦里面蕴含着《易经》固有的诸如神道设教、用舍行藏、以德化民等思想,是孔子把这些思想发掘出来。 -在日本,神道最初属于什么信仰? 「据此是孔子见当时之人,惑于吉凶祸福,而卜筮之史,加以穿凿傅会,故演易系辞,明义理,切人事,借卜筮以教后人,所谓以神道设教,其所发明者,实即羲文之义理,而非别有义理,亦非羲文并无义理,至孔子始言义理也,当即朱子之言而小变之曰,易为卜筮作,实为义理作,伏羲文王之易,有占而无文,与今人用火珠林起课者相似,孔子加卦爻辞如签辞,纯以理言,实即羲文本意,则其说分明无误矣。」 -在日本,神道最初属于什么信仰? 孔子所发掘的《易经》思想与孔子在《论语》书中表现出来的思想完全一致。 -在日本,神道最初属于什么信仰? 《易传》的思想反映了孔子的思想,这个思想是《周易》的,也是孔子的。 -在日本,神道最初属于什么信仰? 在《周易》和孔子看来,神不是有意识的人格化的上帝。 -奥林匹克里昂获得了几连霸? 里昂 Lyon 全名 Olympique lyonnais 绰号 Les Gones、OL 成立 1950年 城市 法国,里昂 主场 热尔兰球场(Stade Gerland) 容纳人数 41,044人 主席 奥拉斯 主教练 雷米·加尔德 联赛 法国足球甲级联赛 2013–14 法甲,第 5 位 网站 官方网站 主场球衣 客场球衣 第三球衣 日尔兰体育场 奥林匹克里昂(Olympique lyonnais,简称:OL及Lyon,中文简称里昂)是一间位于法国东南部罗纳-阿尔卑斯区的里昂市的足球会,成立于1950年8月3日,前身为里昂·奥林匹克(Lyon Olympique)体育俱乐部其中一个分支的足球队,1889年离开体育俱乐部自立门户成立新俱乐部,但官方网站表示俱乐部于1950年正式成立。 -奥林匹克里昂获得了几连霸? 现时在法国足球甲级联赛比赛,俱乐部同时设立男子及女子足球队。 -奥林匹克里昂获得了几连霸? 里昂是首届法国足球甲级联赛成员之一,可惜名列第十五位而降落乙组,1951年以乙级联赛冠军获得创会后首次锦标。 -奥林匹克里昂获得了几连霸? 球队在法国足球史上没有取得辉煌成绩,比较优异的算是六十年代曾杀入欧洲杯赛冠军杯四强,及3度晋身法国杯决赛并2次成功获冠。 -奥林匹克里昂获得了几连霸? 直至九十年代末里昂由辛天尼带领,先连续取得联赛头三名,到2002年终于首次登上法国顶级联赛冠军宝座,同年勒冈(Paul Le Guen)接替执教法国国家足球队的辛天尼,他其后继续带领里昂保持气势,加上队中球员小儒尼尼奧、迪亚拉、克里斯蒂亞諾·馬克斯·戈麥斯、迈克尔·埃辛、西德尼·戈武及门将格雷戈里·库佩表现突出,2003年至2005年横扫3届联赛冠军,创下连续四年夺得联赛锦标,平了1960年代末圣艾蒂安及1990年代初马赛的四连冠纪录。 -奥林匹克里昂获得了几连霸? 2005年前利物浦主教练热拉尔·霍利尔重返法国担任新任主教练,并加入葡萄牙中场蒂亚戈,和前巴伦西亚前锋约翰·卡鲁。 -奥林匹克里昂获得了几连霸? 他亦成功带领里昂赢得一届法甲冠军。 -奥林匹克里昂获得了几连霸? 2007年里昂成为首支上市的法国足球俱乐部,招股价21至24.4欧元,发行370万股,集资8400万欧元[1]。 -奥林匹克里昂获得了几连霸? 2007年4月21日,联赛次名图卢兹二比三不敌雷恩,令处于榜首的里昂领先次席多达17分距离,里昂因此提前六轮联赛庆祝俱乐部连续第六年夺得联赛冠军,亦是欧洲五大联赛(英格兰、德国、西班牙、意大利及法国)历史上首支联赛六连冠队伍[2]。 -奥林匹克里昂获得了几连霸? 在2007-08年赛季,里昂再一次成功卫冕联赛锦标,达成七连霸伟业。 -奥林匹克里昂获得了几连霸? 不过在2008-09赛季,里昂排名法甲第三位,联赛冠军被波尔多所获得。 -奥林匹克里昂获得了几连霸? 于2010年4月,里昂以两回合3比2的比分于欧洲冠军联赛击败波尔多跻身四强,此乃里昂首次晋级此项顶级杯赛的四强阶段。 -奥林匹克里昂获得了几连霸? 粗体字为新加盟球员 -奥林匹克里昂获得了几连霸? 以下球员名单更新于2014年8月27日,球员编号参照 官方网站,夏季转会窗为6月9日至8月31日 -火柴人刺杀行动怎么才能过关? 移动鼠标控制瞄准,点击鼠标左键进行射击。 -火柴人刺杀行动怎么才能过关? 游戏加载完成后点击STARTGAME-然后点击STARTMISSION即可开始游戏。 -火柴人刺杀行动怎么才能过关? 这里不仅仅考验的是你的枪法而且最重要的是你的智慧,喜欢火柴人类型游戏的玩家可以进来小试身手。 -火柴人刺杀行动怎么才能过关? 控制瞄准,刺杀游戏中的目标人物即可过关哦。 -你知道2月14日西方情人节是因何起源的吗? 情人节(英语:Valentine's Day),情人节的起源有多个版本,其中一个说法是在公元三世纪,古罗马暴君为了征召更多士兵,禁止婚礼,一名叫瓦伦丁Valentine的修士不理禁令,秘密替人主持婚礼,结果被收监,最后处死。 -你知道2月14日西方情人节是因何起源的吗? 而他死的那天就是2月14日,为纪念Valentine的勇敢精神,人们将每年的2月14日定为Valentine的纪念日。 -你知道2月14日西方情人节是因何起源的吗? 因此成了后来的“情人节”。 -你知道2月14日西方情人节是因何起源的吗? 另外,据记载,教宗在公元496年废除牧神节,把2月14日定为圣瓦伦丁日,即是St.Valentine's Day,后来成为是西方的节日之一。 -你知道2月14日西方情人节是因何起源的吗? 中文名称:情人节 -你知道2月14日西方情人节是因何起源的吗? 外文名称:Valentine‘s Day -你知道2月14日西方情人节是因何起源的吗? 别名:情人节圣瓦伦丁节 -你知道2月14日西方情人节是因何起源的吗? 公历日期:2月14日 -你知道2月14日西方情人节是因何起源的吗? 起源时间:公元270年2月14日 -你知道2月14日西方情人节是因何起源的吗? 起源事件:人们为了纪念为情人做主而牺牲的瓦伦丁神父,把他遇害的那一天(2月14日)称为情人节。 -你知道2月14日西方情人节是因何起源的吗? 地区:欧美地区 -你知道2月14日西方情人节是因何起源的吗? 宗教:基督教 -你知道2月14日西方情人节是因何起源的吗? 其他信息:西方的传统节日之一。 -你知道2月14日西方情人节是因何起源的吗? 男女在这一天互送礼物(如贺卡和玫瑰花等)用以表达爱意或友好。 -你知道2月14日西方情人节是因何起源的吗? 据台湾“今日台湾人讨厌情人节新闻网”报道,西洋情人节即将来到,求职网进行“办公室恋情及情人节调查”发现,在目前全台上班族的感情状态中,有情人相伴的比率约5成5,4成5的上班族单身;较出乎意料的结果是,情人节以近3成(28%)的占比,登上最讨厌的节日第一名,端午节以24.3%居第二;农历年则以18.2%居第三;第四名是圣诞节,占12.4%。 -你知道2月14日西方情人节是因何起源的吗? 调查指出,情人节对单身族来说,不仅成为压力,也显得更加孤单,在情人节当天,单身的上班族有将近4成(39.1%)的人在家看电视度过,近两成(18.7%)上网聊天,有1成4(14.8%)的人,不畏满街闪光,勇气十足出门看电影,近1成(9.7%)的上班族选择留在公司加班;另外有 5.4%的人,会在情人节当天积极参加联谊,希望能改变自己的感情状态。 -你知道2月14日西方情人节是因何起源的吗? 情侣们在情人节当天,庆祝方式以吃浪漫大餐最多(37.1%),不过有近3成(27%)的情侣,在情人节当天不会特别庆祝情人节,且这个比率远比第三名的旅游(占比11.5%)高出1倍以上。 -你知道2月14日西方情人节是因何起源的吗? 在情人节当天庆祝的开销上,可以说是小资男女当道,选择1000元(新台币,下同)以内的上班族最多占33.1%,情人节当天的花费上班族的平均花费是2473元,大手笔花费上万元以上庆祝情人节的,占比只有2.5%。 -你知道2月14日西方情人节是因何起源的吗? 情人节的起源众说纷纭,而为纪念罗马教士瓦伦丁是其中一个普遍的说法。 -你知道2月14日西方情人节是因何起源的吗? 据《世界图书百科全书》(World Book Encyclopedia)数据指出:“在公元200年时期,罗马皇帝克劳狄二世禁止年轻男子结婚。 -你知道2月14日西方情人节是因何起源的吗? 他认为未婚男子可以成为更优良的士兵。 -你知道2月14日西方情人节是因何起源的吗? 一位名叫瓦伦丁的教士违反了皇帝的命令,秘密为年轻男子主持婚礼,引起皇帝不满,结果被收监,据说瓦伦丁于公元269年2月14日被处决。 -你知道2月14日西方情人节是因何起源的吗? 另外,据《天主教百科全书》(The Catholic情人节 Encyclopedia)指出,公元496年,教宗圣基拉西乌斯一世在公元第五世纪末叶废除了牧神节,把2月14日定为圣瓦伦丁日。” -你知道2月14日西方情人节是因何起源的吗? 这个节日现今以“圣瓦伦丁节”——亦即情人节的姿态盛行起来。 -你知道2月14日西方情人节是因何起源的吗? 但是在第2次梵蒂冈大公会议后,1969年的典礼改革上,整理了一堆在史实上不确定是否真实存在的人物以后,圣瓦伦丁日就被废除了。 -你知道2月14日西方情人节是因何起源的吗? 现在天主教圣人历已经没有圣瓦伦丁日(St. Valentine's Day)。 -你知道2月14日西方情人节是因何起源的吗? 根据《布卢姆尔的警句与寓言辞典》记载:“圣瓦伦丁是个罗马教士,由于援助受逼害的基督徒而身陷险境,后来他归信基督教,最后被处死,卒于二月十四日”古代庆祝情人节的习俗与瓦伦丁拉上关系,可能是纯属巧合而已。 -你知道2月14日西方情人节是因何起源的吗? 事实上,这个节日很可能与古罗马的牧神节或雀鸟交配的季节有关。 -你知道2月14日西方情人节是因何起源的吗? 情人节的特色是情侣互相馈赠礼物。 -你知道2月14日西方情人节是因何起源的吗? 时至今日,人们则喜欢以情人卡向爱人表达情意。 -防卫大学每年招收多少学生? 防卫大学的前身是保安大学。 -防卫大学每年招收多少学生? 防卫大学是日本自卫队培养陆、海、空三军初级军官的学校,被称为日军"军官的摇篮"。 -防卫大学每年招收多少学生? 防卫大学是日军的重点院校。 -防卫大学每年招收多少学生? 日本历届内阁首相都要到防卫大学视察"训示",并亲自向学生颁发毕业证书。 -防卫大学每年招收多少学生? 日军四分之一的军官、三分之一的将官从这里走出。 -防卫大学每年招收多少学生? 防卫大学毕业生已成为日军军官的中坚力量。 -防卫大学每年招收多少学生? 防卫大学每年从地方招收18岁至21岁的应届高中毕业生和同等学历的青年。 -防卫大学每年招收多少学生? 每年招生名额为530名。 -防卫大学每年招收多少学生? 1950年 8月,日本组建警察预备队,1952年改为保安队。 -防卫大学每年招收多少学生? 为了充实保安队干部队伍,提高干部军政素质,1953年4月成立了保安大学,校址设在三浦半岛的久里滨。 -防卫大学每年招收多少学生? 1954年7月1日保安厅改为防卫厅。 -防卫大学每年招收多少学生? 在保安队基础上,日本建立了陆、海、空三军自卫队,保安大学遂改名为防卫大学,1955年迁至三浦半岛东南方的小原台。 -防卫大学每年招收多少学生? 学校直属防卫厅领导。 -防卫大学每年招收多少学生? 防卫大学的教育方针是:要求学生德智体全面发展,倡导学生崇尚知识和正义,培养学生具有指挥各种部队的能力。 -防卫大学每年招收多少学生? 防卫大学每年招生名额为530名,其中陆军300名,海军100名,空军130名。 -防卫大学每年招收多少学生? 根据自卫队向妇女敞开军官大门的决定,防卫大学1992年首次招收女学员35名。 -防卫大学每年招收多少学生? 考试分两次进行。 -防卫大学每年招收多少学生? 第一次,每年11月份进行学科考试;第二次,12月份进行口试和体检。 -防卫大学每年招收多少学生? 学校按陆、海、空三军分别设大学本科班和理工科研究生班。 -防卫大学每年招收多少学生? 本科班学制4年,又分为理工和人文社会学两大科。 -防卫大学每年招收多少学生? 学员入学后先分科,530人中有460人专攻理科,70人专攻文科。 -防卫大学每年招收多少学生? 第1学年按专科学习一般大学课程和一般军事知识。 -防卫大学每年招收多少学生? 第2学年以后在军事上开始区分军种,学员分别学习陆、海、空军的专门课程。 -防卫大学每年招收多少学生? 文化课和军事课的比例是6:l。 -防卫大学每年招收多少学生? 文化课程有人文、社会、自然、外语、电气工程、机械工程、土木建筑工程、应用化学、应用物理、航空、航海等。 -防卫大学每年招收多少学生? 军事训练课每学年6周,按一年四季有比例地安排教学内容,对学生进行军事技术和体能训练。 -防卫大学每年招收多少学生? 理工科研究生班,每年招生1期,学制2年,每期招收90人,设电子工程、航空工程、兵器制造等7个专业,课程按一般大学硕士课程标准设置。 -防卫大学每年招收多少学生? 防卫大学的课程和训练都十分紧张。 -防卫大学每年招收多少学生? 近年来,为了增强防卫大学的吸引力,克服考生逐年减少的倾向广泛征集优秀人才,学校进行了一些改革,改变入学考试办法,各高中校长以内部呈报的形式向防卫大学推荐品学兼优的学生;减少学生入学考试科目,放宽对报考防卫大学的学生的视力要求;降低学分数(大约降低30学分);改善学生宿舍条件。 -防卫大学每年招收多少学生? 防卫大学的学生生活紧张而愉快。 -《威鲁贝鲁的物语》官网是什么? 10年前大战后,威鲁贝鲁国一致辛勤的保护着得来不易的和平,但是与邻国圣卡特拉斯国的关系却不断的紧张,战争即将爆发。 -《威鲁贝鲁的物语》官网是什么? 为了避免战争,威鲁贝鲁国王海特鲁王决定将自己最大的女儿公主莉塔嫁给圣卡特拉斯国的王子格鲁尼亚。 -《威鲁贝鲁的物语》官网是什么? 但是莉塔却刺伤了政治婚姻的对象格鲁尼亚王子逃了出去,这事激怒了圣卡特拉斯国的国王兰帕诺夫王,并下令14天之内抓到王女并执行公开处刑来谢罪,不然两国就要开战。 -《威鲁贝鲁的物语》官网是什么? 《威鲁贝鲁的物语~Sisters of Wellber~》 -《威鲁贝鲁的物语》官网是什么? (Sisters of Wellber) -《威鲁贝鲁的物语》官网是什么? 日文名 ウエルベールの物语 -《威鲁贝鲁的物语》官网是什么? 官方网站 http://www.avexmovie.jp/lineup/wellber/ -《威鲁贝鲁的物语》官网是什么? 为了回避发生战争这个最坏的结果,莉塔下定决心去中立国古利达姆。 diff --git a/examples/text_graph/erniesage/link_prediction.py b/examples/text_graph/erniesage/link_prediction.py deleted file mode 100644 index 2ad8b2faecfa..000000000000 --- a/examples/text_graph/erniesage/link_prediction.py +++ /dev/null @@ -1,177 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import io -import os -import random -import time -from functools import partial - -import numpy as np -import paddle -import pgl -import yaml -from data import GraphDataLoader, PredictData, TrainData, batch_fn -from easydict import EasyDict as edict -from models import ErnieSageForLinkPrediction - -from paddlenlp.transformers import ErnieTinyTokenizer, ErnieTokenizer -from paddlenlp.utils.log import logger - -MODEL_CLASSES = { - "ernie-tiny": (ErnieSageForLinkPrediction, ErnieTinyTokenizer), - "ernie-1.0": (ErnieSageForLinkPrediction, ErnieTokenizer), -} - - -def set_seed(config): - random.seed(config.seed) - np.random.seed(config.seed) - paddle.seed(config.seed) - - -def load_data(graph_data_path): - base_graph = pgl.Graph.load(graph_data_path) - term_ids = np.load(os.path.join(graph_data_path, "term_ids.npy"), mmap_mode="r") - return base_graph, term_ids - - -def do_train(config): - paddle.set_device(config.device) - if paddle.distributed.get_world_size() > 1: - paddle.distributed.init_parallel_env() - set_seed(config) - - base_graph, term_ids = load_data(config.graph_work_path) - collate_fn = partial(batch_fn, samples=config.samples, base_graph=base_graph, term_ids=term_ids) - - # mode = "train" - train_ds = TrainData(config.graph_work_path) - - model_class, tokenizer_class = MODEL_CLASSES[config.model_name_or_path] - tokenizer = tokenizer_class.from_pretrained(config.model_name_or_path) - config.cls_token_id = tokenizer.cls_token_id - - model = model_class.from_pretrained(config.model_name_or_path, config_file=config) - model = paddle.DataParallel(model) - - train_loader = GraphDataLoader( - train_ds, batch_size=config.batch_size, shuffle=True, num_workers=config.sample_workers, collate_fn=collate_fn - ) - - optimizer = paddle.optimizer.Adam(learning_rate=config.lr, parameters=model.parameters()) - - rank = paddle.distributed.get_rank() - global_step = 0 - tic_train = time.time() - for epoch in range(config.epoch): - for step, (graphs, datas) in enumerate(train_loader): - global_step += 1 - loss, outputs = model(graphs, datas) - if global_step % config.log_per_step == 0: - logger.info( - "global step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" - % (global_step, epoch, step, loss, config.log_per_step / (time.time() - tic_train)) - ) - tic_train = time.time() - loss.backward() - optimizer.step() - optimizer.clear_grad() - if global_step % config.save_per_step == 0: - if rank == 0: - output_dir = os.path.join(config.output_path, "model_%d" % global_step) - if not os.path.exists(output_dir): - os.makedirs(output_dir) - model._layers.save_pretrained(output_dir) - if rank == 0: - output_dir = os.path.join(config.output_path, "last") - if not os.path.exists(output_dir): - os.makedirs(output_dir) - model._layers.save_pretrained(output_dir) - - -def tostr(data_array): - return " ".join(["%.5lf" % d for d in data_array]) - - -@paddle.no_grad() -def do_predict(config): - paddle.set_device(config.device) - if paddle.distributed.get_world_size() > 1: - paddle.distributed.init_parallel_env() - set_seed(config) - - # mode = "predict" - num_nodes = int(np.load(os.path.join(config.graph_work_path, "num_nodes.npy"))) - - base_graph, term_ids = load_data(config.graph_work_path) - collate_fn = partial(batch_fn, samples=config.samples, base_graph=base_graph, term_ids=term_ids) - - model_class, tokenizer_class = MODEL_CLASSES[config.model_name_or_path] - tokenizer = tokenizer_class.from_pretrained(config.model_name_or_path) - config.cls_token_id = tokenizer.cls_token_id - - model = model_class.from_pretrained(config.infer_model, config_file=config) - - model = paddle.DataParallel(model) - predict_ds = PredictData(num_nodes) - - predict_loader = GraphDataLoader( - predict_ds, - batch_size=config.infer_batch_size, - shuffle=True, - num_workers=config.sample_workers, - collate_fn=collate_fn, - ) - - trainer_id = paddle.distributed.get_rank() - id2str = io.open(os.path.join(config.graph_work_path, "terms.txt"), encoding=config.encoding).readlines() - if not os.path.exists(config.output_path): - os.mkdir(config.output_path) - fout = io.open("%s/part-%s" % (config.output_path, trainer_id), "w", encoding="utf8") - - global_step = 0 - epoch = 0 - tic_train = time.time() - model.eval() - for step, (graphs, datas) in enumerate(predict_loader): - global_step += 1 - loss, outputs = model(graphs, datas) - for user_feat, user_real_index in zip(outputs[0].numpy(), outputs[3].numpy()): - sri = id2str[int(user_real_index)].strip("\n") - line = "{}\t{}\n".format(sri, tostr(user_feat)) - fout.write(line) - if global_step % config.log_per_step == 0: - logger.info( - "predict step %d, epoch: %d, batch: %d, loss: %f, speed: %.2f step/s" - % (global_step, epoch, step, loss, config.log_per_step / (time.time() - tic_train)) - ) - tic_train = time.time() - fout.close() - - -if __name__ == "__main__": - parser = argparse.ArgumentParser(description="main") - parser.add_argument("--conf", type=str, default="./config.yaml") - parser.add_argument("--do_predict", action="store_true", default=False) - args = parser.parse_args() - config = edict(yaml.load(open(args.conf), Loader=yaml.FullLoader)) - - assert config.device in ["gpu", "cpu"], "Device should be gpu/cpu, but got %s." % config.device - logger.info(config) - if args.do_predict: - do_predict(config) - else: - do_train(config) diff --git a/examples/text_graph/erniesage/models/conv.py b/examples/text_graph/erniesage/models/conv.py deleted file mode 100644 index 8ec0c61d7b0a..000000000000 --- a/examples/text_graph/erniesage/models/conv.py +++ /dev/null @@ -1,174 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -import paddle.nn as nn -import paddle.nn.functional as F - - -class GraphSageConv(nn.Layer): - """GraphSAGE is a general inductive framework that leverages node feature - information (e.g., text attributes) to efficiently generate node embeddings - for previously unseen data. - - Paper reference: - Hamilton, Will, Zhitao Ying, and Jure Leskovec. - "Inductive representation learning on large graphs." - Advances in neural information processing systems. 2017. - """ - - def __init__(self, input_size, hidden_size, learning_rate, aggr_func="sum"): - super(GraphSageConv, self).__init__() - assert aggr_func in [ - "sum", - "mean", - "max", - "min", - ], "Only support 'sum', 'mean', 'max', 'min' built-in receive function." - self.aggr_func = "reduce_%s" % aggr_func - - self.self_linear = nn.Linear( - input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate) - ) - self.neigh_linear = nn.Linear( - input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate) - ) - - def forward(self, graph, feature, act=None): - def _send_func(src_feat, dst_feat, edge_feat): - return {"msg": src_feat["h"]} - - def _recv_func(message): - return getattr(message, self.aggr_func)(message["msg"]) - - msg = graph.send(_send_func, src_feat={"h": feature}) - neigh_feature = graph.recv(reduce_func=_recv_func, msg=msg) - - self_feature = self.self_linear(feature) - neigh_feature = self.neigh_linear(neigh_feature) - output = self_feature + neigh_feature - if act is not None: - output = getattr(F, act)(output) - - output = F.normalize(output, axis=1) - return output - - -class ErnieSageV2Conv(nn.Layer): - """ErnieSage (abbreviation of ERNIE SAmple aggreGatE), a model proposed by the PGL team. - ErnieSageV2: Ernie is applied to the EDGE of the text graph. - """ - - def __init__(self, ernie, input_size, hidden_size, learning_rate, cls_token_id=1, aggr_func="sum"): - """ErnieSageV2: Ernie is applied to the EDGE of the text graph. - - Args: - ernie (nn.Layer): the ernie model. - input_size (int): input size of feature tensor. - hidden_size (int): hidden size of the Conv layers. - learning_rate (float): learning rate. - aggr_func (str): aggregate function. 'sum', 'mean', 'max' avaliable. - """ - super(ErnieSageV2Conv, self).__init__() - assert aggr_func in [ - "sum", - "mean", - "max", - "min", - ], "Only support 'sum', 'mean', 'max', 'min' built-in receive function." - self.aggr_func = "reduce_%s" % aggr_func - self.cls_token_id = cls_token_id - self.self_linear = nn.Linear( - input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate) - ) - self.neigh_linear = nn.Linear( - input_size, hidden_size, weight_attr=paddle.ParamAttr(learning_rate=learning_rate) - ) - - self.ernie = ernie - - def ernie_send(self, src_feat, dst_feat, edge_feat): - """Apply ernie model on the edge. - - Args: - src_feat (Tensor Dict): src feature tensor dict. - dst_feat (Tensor Dict): dst feature tensor dict. - edge_feat (Tensor Dict): edge feature tensor dict. - - Returns: - Tensor Dict: tensor dict which use 'msg' as the key. - """ - # input_ids - cls = paddle.full(shape=[src_feat["term_ids"].shape[0], 1], dtype="int64", fill_value=self.cls_token_id) - src_ids = paddle.concat([cls, src_feat["term_ids"]], 1) - - dst_ids = dst_feat["term_ids"] - - # sent_ids - sent_ids = paddle.concat([paddle.zeros_like(src_ids), paddle.ones_like(dst_ids)], 1) - term_ids = paddle.concat([src_ids, dst_ids], 1) - - # build position_ids - input_mask = paddle.cast(term_ids > 0, "int64") - position_ids = paddle.cumsum(input_mask, axis=1) - 1 - - outputs = self.ernie(term_ids, sent_ids, position_ids) - feature = outputs[1] - return {"msg": feature} - - def send_recv(self, graph, term_ids): - """Message Passing of erniesage v2. - - Args: - graph (Graph): the Graph object. - feature (Tensor): the node feature tensor. - - Returns: - Tensor: the self and neighbor feature tensors. - """ - - def _recv_func(message): - return getattr(message, self.aggr_func)(message["msg"]) - - msg = graph.send(self.ernie_send, node_feat={"term_ids": term_ids}) - neigh_feature = graph.recv(reduce_func=_recv_func, msg=msg) - - cls = paddle.full(shape=[term_ids.shape[0], 1], dtype="int64", fill_value=self.cls_token_id) - term_ids = paddle.concat([cls, term_ids], 1) - term_ids.stop_gradient = True - outputs = self.ernie(term_ids, paddle.zeros_like(term_ids)) - self_feature = outputs[1] - - return self_feature, neigh_feature - - def forward(self, graph, term_ids, act="relu"): - """Forward funciton of Conv layer. - - Args: - graph (Graph): Graph object. - feature (Tensor): node feture. - act (str, optional): activation function. Defaults to 'relu'. - - Returns: - Tensor: feature after conv. - """ - - self_feature, neigh_feature = self.send_recv(graph, term_ids) - self_feature = self.self_linear(self_feature) - neigh_feature = self.neigh_linear(neigh_feature) - output = self_feature + neigh_feature - if act is not None: - output = getattr(F, act)(output) - output = F.normalize(output, axis=1) - return output diff --git a/examples/text_graph/erniesage/models/encoder.py b/examples/text_graph/erniesage/models/encoder.py deleted file mode 100644 index 9363beb43a45..000000000000 --- a/examples/text_graph/erniesage/models/encoder.py +++ /dev/null @@ -1,133 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -import paddle.nn as nn -import paddle.nn.functional as F -from models.conv import ErnieSageV2Conv, GraphSageConv - - -class Encoder(nn.Layer): - """Base class - Chose different type ErnieSage class. - """ - - def __init__(self, config): - """init function - - Args: - config (Dict): all configs. - """ - super(Encoder, self).__init__() - self.config = config - # Don't add ernie to self, oterwise, there will be more copies of ernie weights - # self.ernie = ernie - - @classmethod - def factory(cls, config, ernie): - """Classmethod for ernie sage model. - - Args: - config (Dict): all configs. - ernie (nn.Layer): the ernie model. - - Raises: - ValueError: Invalid ernie sage model type. - - Returns: - Class: real model class. - """ - model_type = config.model_type - if model_type == "ErnieSageV2": - return ErnieSageV2Encoder(config, ernie) - else: - raise ValueError("Invalid ernie sage model type") - - def forward(self, *args, **kwargs): - raise NotImplementedError - - -class ErnieSageV2Encoder(Encoder): - def __init__(self, config, ernie): - """Ernie sage v2 encoder - - Args: - config (Dict): all config. - ernie (nn.Layer): the ernie model. - """ - super(ErnieSageV2Encoder, self).__init__(config) - # Don't add ernie to self, oterwise, there will be more copies of ernie weights - # self.ernie = ernie - self.convs = nn.LayerList() - fc_lr = self.config.lr / 0.001 - erniesage_conv = ErnieSageV2Conv( - ernie, - ernie.config["hidden_size"], - self.config.hidden_size, - learning_rate=fc_lr, - cls_token_id=self.config.cls_token_id, - aggr_func="sum", - ) - self.convs.append(erniesage_conv) - for i in range(1, self.config.num_layers): - layer = GraphSageConv( - self.config.hidden_size, self.config.hidden_size, learning_rate=fc_lr, aggr_func="sum" - ) - self.convs.append(layer) - - if self.config.final_fc: - self.linear = nn.Linear( - self.config.hidden_size, self.config.hidden_size, weight_attr=paddle.ParamAttr(learning_rate=fc_lr) - ) - - def take_final_feature(self, feature, index): - """Gather the final feature. - - Args: - feature (Tensor): the total featue tensor. - index (Tensor): the index to gather. - - Returns: - Tensor: final result tensor. - """ - feat = paddle.gather(feature, index) - if self.config.final_fc: - feat = self.linear(feat) - if self.config.final_l2_norm: - feat = F.normalize(feat, axis=1) - return feat - - def forward(self, graphs, term_ids, inputs): - """forward train function of the model. - - Args: - graphs (Graph List): list of graph tensors. - inputs (Tensor List): list of input tensors. - - Returns: - Tensor List: list of final feature tensors. - """ - # term_ids for ErnieSageConv is the raw feature. - feature = term_ids - for i in range(len(graphs), self.config.num_layers): - graphs.append(graphs[0]) - for i in range(0, self.config.num_layers): - if i == self.config.num_layers - 1 and i != 0: - act = None - else: - act = "leaky_relu" - feature = self.convs[i](graphs[i], feature, act) - - final_feats = [self.take_final_feature(feature, x) for x in inputs] - return final_feats diff --git a/examples/text_graph/erniesage/models/loss.py b/examples/text_graph/erniesage/models/loss.py deleted file mode 100644 index 3648c27821c1..000000000000 --- a/examples/text_graph/erniesage/models/loss.py +++ /dev/null @@ -1,69 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -import paddle.nn as nn -import paddle.nn.functional as F - - -def LossFactory(config): - """Choose different type of loss by config - - Args: - config (Dict): config file. - - Raises: - ValueError: invalid loss type. - - Returns: - Class: the real class object. - """ - loss_type = config.loss_type - if loss_type == "hinge": - return HingeLoss(config.margin) - elif loss_type == "softmax_with_cross_entropy": - return SoftmaxWithCrossEntropy() - else: - raise ValueError("invalid loss type") - - -class SoftmaxWithCrossEntropy(nn.Layer): - """softmax with cross entropy loss""" - - def __init__(self, config): - super(SoftmaxWithCrossEntropy, self).__init__() - - def forward(self, logits, label): - return F.cross_entropy(logits, label, reduction="mean") - - -class HingeLoss(nn.Layer): - """Hinge Loss for the pos and neg.""" - - def __init__(self, margin): - super(HingeLoss, self).__init__() - self.margin = margin - - def forward(self, pos, neg): - """forward function - - Args: - pos (Tensor): pos score. - neg (Tensor): neg score. - - Returns: - Tensor: final hinge loss. - """ - loss = paddle.mean(F.relu(neg - pos + self.margin)) - return loss diff --git a/examples/text_graph/erniesage/models/model.py b/examples/text_graph/erniesage/models/model.py deleted file mode 100755 index 4884baacc860..000000000000 --- a/examples/text_graph/erniesage/models/model.py +++ /dev/null @@ -1,68 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle -from models.encoder import Encoder -from models.loss import LossFactory - -from paddlenlp.transformers import ErnieModel, ErniePretrainedModel - -__all__ = ["ErnieSageForLinkPrediction"] - - -class ErnieSageForLinkPrediction(ErniePretrainedModel): - """ErnieSage for link prediction task.""" - - def __init__(self, config, config_file): - """Model which Based on the PaddleNLP PretrainedModel - - Note: - 1. the ernie must be the first argument. - 2. must set self.XX = ernie to load weights. - 3. the self.config keyword is taken by PretrainedModel class. - - Args: - ernie (nn.Layer): the submodule layer of ernie model. - config (Dict): the config file - """ - super(ErnieSageForLinkPrediction, self).__init__(config) - self.config_file = config_file - self.ernie = ErnieModel(config) - self.encoder = Encoder.factory(self.config_file, self.ernie) - self.loss_func = LossFactory(self.config_file) - - def forward(self, graphs, data): - """Forward function of link prediction task. - - Args: - graphs (Graph List): the Graph list. - data (Tensor List): other input of the model. - - Returns: - Tensor: loss and output tensors. - """ - term_ids, user_index, pos_item_index, neg_item_index, user_real_index, pos_item_real_index = data - # encoder model - outputs = self.encoder(graphs, term_ids, [user_index, pos_item_index, neg_item_index]) - user_feat, pos_item_feat, neg_item_feat = outputs - - # calc loss - if self.config_file.neg_type == "batch_neg": - neg_item_feat = pos_item_feat - - pos = paddle.sum(user_feat * pos_item_feat, -1, keepdim=True) # [B, 1] - neg = paddle.matmul(user_feat, neg_item_feat, transpose_y=True) # [B, B] - loss = self.loss_func(pos, neg) - # return loss, outputs - return loss, outputs + [user_real_index, pos_item_real_index] diff --git a/examples/text_graph/erniesage/preprocessing/dump_graph.py b/examples/text_graph/erniesage/preprocessing/dump_graph.py deleted file mode 100644 index d2de5674a63f..000000000000 --- a/examples/text_graph/erniesage/preprocessing/dump_graph.py +++ /dev/null @@ -1,154 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import io -import os -from functools import partial -from io import open - -import numpy as np -import pgl -import yaml -from easydict import EasyDict as edict -from pgl.graph_kernel import alias_sample_build_table -from pgl.utils.logger import log - -from paddlenlp.transformers import ErnieTinyTokenizer, ErnieTokenizer - -TOKENIZER_CLASSES = { - "ernie-tiny": ErnieTinyTokenizer, - "ernie-1.0": ErnieTokenizer, -} - - -def term2id(string, tokenizer, max_seqlen): - # string = string.split("\t")[1] - tokens = tokenizer._tokenize(string) - ids = tokenizer.convert_tokens_to_ids(tokens) - ids = ids[: max_seqlen - 1] - ids = ids + [tokenizer.sep_token_id] - ids = ids + [tokenizer.pad_token_id] * (max_seqlen - len(ids)) - return ids - - -def load_graph(config, str2id, term_file, terms, item_distribution): - edges = [] - with io.open(config.graph_data, encoding=config.encoding) as f: - for idx, line in enumerate(f): - if idx % 100000 == 0: - log.info("%s readed %s lines" % (config.graph_data, idx)) - slots = [] - for col_idx, col in enumerate(line.strip("\n").split("\t")): - s = col[: config.max_seqlen] - if s not in str2id: - str2id[s] = len(str2id) - term_file.write(str(col_idx) + "\t" + col + "\n") - item_distribution.append(0) - slots.append(str2id[s]) - - src = slots[0] - dst = slots[1] - edges.append((src, dst)) - edges.append((dst, src)) - item_distribution[dst] += 1 - edges = np.array(edges, dtype="int64") - return edges - - -def load_link_prediction_train_data(config, str2id, term_file, terms, item_distribution): - train_data = [] - neg_samples = [] - with io.open(config.train_data, encoding=config.encoding) as f: - for idx, line in enumerate(f): - if idx % 100000 == 0: - log.info("%s readed %s lines" % (config.train_data, idx)) - slots = [] - for col_idx, col in enumerate(line.strip("\n").split("\t")): - s = col[: config.max_seqlen] - if s not in str2id: - str2id[s] = len(str2id) - term_file.write(str(col_idx) + "\t" + col + "\n") - item_distribution.append(0) - slots.append(str2id[s]) - - src = slots[0] - dst = slots[1] - neg_samples.append(slots[2:]) - train_data.append((src, dst)) - train_data = np.array(train_data, dtype="int64") - np.save(os.path.join(config.graph_work_path, "train_data.npy"), train_data) - if len(neg_samples) != 0: - np.save(os.path.join(config.graph_work_path, "neg_samples.npy"), np.array(neg_samples)) - - -def dump_graph(config): - if not os.path.exists(config.graph_work_path): - os.makedirs(config.graph_work_path) - str2id = dict() - term_file = io.open(os.path.join(config.graph_work_path, "terms.txt"), "w", encoding=config.encoding) - terms = [] - item_distribution = [] - - edges = load_graph(config, str2id, term_file, terms, item_distribution) - if config.task == "link_prediction": - load_link_prediction_train_data(config, str2id, term_file, terms, item_distribution) - else: - raise ValueError - - term_file.close() - num_nodes = len(str2id) - str2id.clear() - - log.info("Building graph...") - graph = pgl.graph.Graph(num_nodes=num_nodes, edges=edges) - # indegree = graph.indegree() - graph.indegree() - graph.outdegree() - graph.dump(config.graph_work_path) - - # dump alias sample table - item_distribution = np.array(item_distribution) - item_distribution = np.sqrt(item_distribution) - distribution = 1.0 * item_distribution / item_distribution.sum() - alias, events = alias_sample_build_table(distribution) - np.save(os.path.join(config.graph_work_path, "alias.npy"), alias) - np.save(os.path.join(config.graph_work_path, "events.npy"), events) - log.info("End Build Graph") - - -def dump_node_feat(config): - log.info("Dump node feat starting...") - id2str = [ - line.strip("\n").split("\t")[-1] - for line in io.open(os.path.join(config.graph_work_path, "terms.txt"), encoding=config.encoding) - ] - # pool = multiprocessing.Pool() - - tokenizer_class = TOKENIZER_CLASSES[config.model_name_or_path] - tokenizer = tokenizer_class.from_pretrained(config.model_name_or_path) - fn = partial(term2id, tokenizer=tokenizer, max_seqlen=config.max_seqlen) - term_ids = [fn(x) for x in id2str] - - np.save(os.path.join(config.graph_work_path, "term_ids.npy"), np.array(term_ids, np.uint16)) - log.info("Dump node feat done.") - - -if __name__ == "__main__": - parser = argparse.ArgumentParser(description="main") - parser.add_argument("--conf", type=str, default="./config.yaml") - args = parser.parse_args() - config = edict(yaml.load(open(args.conf), Loader=yaml.FullLoader)) - dump_graph(config) - dump_node_feat(config) diff --git a/examples/text_to_knowledge/README.md b/examples/text_to_knowledge/README.md deleted file mode 100644 index 39d41248a007..000000000000 --- a/examples/text_to_knowledge/README.md +++ /dev/null @@ -1,168 +0,0 @@ -# 解语(Text to Knowledge) - -[解语官网](https://www.paddlepaddle.org.cn/textToKnowledge) - -解语(Text to Knowledge)是首个覆盖中文全词类的知识库(百科知识树)及知识标注与挖掘框架,拥有可描述所有中文词汇的词类体系、中文知识标注工具集,以及更适用于中文挖掘任务的预训练语言模型。 - - -覆盖中文全词类的知识库和知识标注工具能够帮助你面对更加多元的应用场景,方便地融合自有知识体系,显著提升中文文本解析和挖掘效果,并能够更容易地利用知识增强机器学习模型效果。解语经过大规模工业应用验证,在实际业务中取得了良好的应用效果,适合通用领域中文文本理解任务。 - -image - - -**解语由以下三部分构成:** - -- [百科知识树(TermTree)](./termtree) :包括能够描述所有中文词汇的TermType词类体系,以及Term关系和属性值。 -- 中文知识标注工具集:包括[词类知识标注工具(WordTag)](./wordtag) 和[名词短语标注工具(NPTag)](./nptag),[适用于中文文本挖掘的预训练语言模型(ERNIE-CTM)](./ernie-ctm),为中文文本解析提供词类序列标注框架,结合百科知识树可实现定制化词类序列标注。 -- 中文知识挖掘方案:包括[知识模板挖掘工具](./wordtag-ie),旨在提供灵活可配置,可快速定制的中文知识挖掘方案。 - -**本次发布的解语开源试用版包括:** - -- 百科知识树(TermTree)V1.0试用版:包括简化版的TermType词类体系,和约100w的term集。 -- 中文词类知识标注工具(WordTag)V1.0版。 -- 名词短语标注工具(NPTag)V1.0版。 -- 中文预训练语言模型(ERNIE-CTM)V1.0版。 - - ----- - -## 解语的应用场景 - -解语可直接用于各类中文文本解析与挖掘任务,提升文本解析与挖掘精度;也可以作为中文文本特征生成器,为各类机器学习模型提供文本特征。 - -中文词类知识标注工具(WordTag)整合了传统中文解析的**分词**、**词性标注**、**命名实体识别**的能力,能够将任意中文句子解析为**完整的词类序列**。结合百科知识树(TermTree),可为应用提供一套通用的知识关联(term-linking)框架,方便应用适配关联自己的应用知识图谱,更好地将知识用于中文自然语言处理(NLP)任务。 - -![解语示例](doc/img/text_to_knowledge_example.png) - - -### 应用场景A:文本挖掘/解析模板生成与匹配 - -虽然近年来,深度学习模型尤其是预训练语言模型的广泛使用大幅提升了各项中文NLP任务效果,但在实际的工业应用中,单独使用深度学习模型往往达不到应用需求,还需要结合规则模型以提升精度以及解决恶劣case,如,知识图谱构建、query解析、语义一致性判定等应用。 - -在这些应用中,文本挖掘/解析模板是最常用的规则模型。WordTag包含了覆盖中文所有词汇的词类标注体系,在生成模板以及模板匹配上有着天然的优势。用户可以根据WordTag标注的样本词类序列,自动生成或配置更加丰富、精准的挖掘/解析模板,然后对目标文本使用WordTag标注,即可利用模板进行匹配,从而大大降低人工配置模板的代价,显著提升生产效率。 - -例如,输入文本:*美人鱼是周星驰执导的电影*,得到预测结果: - -```json -{ - "text": "美人鱼是周星驰执导的电影", - "items": [ - { - "item": "美人鱼", - "offset": 0, - "wordtag_label": "作品类_实体", - "length": 3, - "termid": "作品与出版物_eb_美人鱼" - }, - { - "item": "是", - "offset": 3, - "wordtag_label": "肯定词", - "length": 1, - "termid": "肯定否定词_cb_是" - }, - { - "item": "周星驰", - "offset": 4, - "wordtag_label": "人物类_实体", - "length": 3, - "termid": "人物_eb_周星驰" - }, - { - "item": "执导", - "offset": 7, - "wordtag_label": "场景事件", - "length": 2, - "termid": "场景事件_cb_执导" - }, - { - "item": "的", - "offset": 9, - "wordtag_label": "助词", - "length": 1, - "termid": "助词_cb_的" - }, - { - "item": "电影", - "offset": 10, - "wordtag_label": "作品类_概念", - "length": 2, - "termid": "影视作品_cb_电影" - } - ] -} -``` - -将上述标注结果中的词类序列取出,去除虚词、标点等与语义无关的词,可将抽取出的词类直接构造成为挖掘匹配模板: - -``` -[作品类_实体][肯定词|是][人物类_实体][场景事件|执导][作品类_概念|电影] -``` - -利用该模板,以及结合TermTree进行概念扩展,可以匹配出所有该句式的文本,例如: - -> 《狂人日记》是鲁迅创作的第一个短篇白话日记体小说 -> -> 《澳门风云》是王晶创作执导的合家欢贺岁喜剧赌片 -> -> 《千王之王2000》是一部王晶于1999年执导的喜剧电影 -> -> 《射雕英雄传》是金庸创作的长篇武侠小说 - -WordTag的标注结果中,区分了“人物类\_实体”和“人物类\_概念”,以及“作品类\_实体”和“作品类\_概念”,使得模板生成更为精准。同时,TermTree中也区分了命名实体词(eb: entity base)与非实体词(cb: concept base),这样,可以利用TermTree分别进行实体扩展(e.g., 周星驰->王晶)和概念扩展(e.g., 电影->小说),生成更加丰富多样的模板,支持更细化的应用场景。 - -### 应用场景B:词类知识增强的深度学习模型 - -词类特征同时也是一类重要的文本特征,可为原始文本token提供有效的边界信息、归组信息,减少样本中的噪音,防止模型过拟合;还可作为层次泛化特征,弥补统计共现特征的不足。 - -在深度学习模型应用中,可将WordTag产出的词类作为embedding特征,直接叠加到文本token上,作为深度学习模型的输入;在BERT等模型中,也可以将词类作为文本序列中的一部分,利用position id和可见性矩阵控制token和词类特征之间的可见性,作为深度学习模型的输入。 - -### 应用场景C:知识图谱关联(term-linking) - -随着知识图谱技术的普及和越来越多应用知识图谱数据的发布,如何利用知识提升NLP任务效果,成为近年来NLP研究的热点方向。文本与图谱知识结合的前提是将图谱中的实体准确link到文本上,这是知识图谱应用的一大难点。现有的方案多是基于某个特定图谱实现的,缺乏通用的图谱关联解决方案。我们尝试使用“**WordTag+TermTree**”提供一套通用的图谱关联(term-linking)技术框架。 - -**NOTE:** 为了避免歧义,我们 **用term统一指代图谱收录的各类实体、概念、术语**。 - -为了能够适配应用中的不同实体集(例如,不同的企业有不同的人物实体集合,不同的小说站有不同的小说实体集合),我们将term-linking拆分为两个步骤: - -- 第一步是基于词类的linking,主要解决“同名概念词/实体词”、“不同类的同名词”消歧问题,这一步只使用文本本身特征和词类特征,不使用图谱中的实体属性值(SPO)知识,从而支持切换不同应用图谱; -- 第二步是同类同名实体词的linking,主要解决同类下不同属性值的实体消歧问题,这一步需要使用实体词的SPO知识(一般用于实体特征表示计算,以及文本-实体相似度计算)。 - -“WordTag+TermTree”的开源版提供了第一步的解决示例,第二步由于依赖于特定图谱的SPO知识,暂时无法提供通用工具,未来可能提供通用解决方案。 - -### 应用场景D:文本分类和文本挖掘样本优化 - -工业NLP应用场景中,文本分类、文本挖掘是最常见的任务。虽然,预训练语言模型的技术进步大幅提升了小样本学习的效果,但要达到理想的工业应用效果,还是需要大规模高精度监督训练样本。 - -人工标注可以产出高精度小规模训练样本。半监督学习等技术可以帮助用户基于人工标准样本快速扩充样本规模,但无法保证样本精度。这种情况下,可以使用“WordTag+TermTree”辅助筛选和修正样本,提升样本精度,例如: - -- 使用WordTag产出样本模板,再利用TermTree进行泛化约束,筛选出高置信度的样本,或者过滤不合格的样本; - -- 利用词类关系检测类别与样本的一致性,比如,医疗类文本与“疾病损伤、药物、医疗卫生机构”等词类相关,可以利用TermTree知识筛选出该类别高置信度的样本。 - -此外,统计模型容易拟合高频term,导致在低频term上泛化效果不好,这时可以利用TermTree筛选样本,提升样本平衡性,从而提升模型泛化能力。 - -## 后续计划 - -1. 发布百科知识树(TermTree)正式版数据,建立知识共建社区,支持用户提交应用词表/应用图谱 & 定制化TermTree, [TermTree下载链接](https://kg-concept.bj.bcebos.com/TermTree/TermTree.V1.0.tar.gz); -2. 持续优化ERNIE-CTM预训练模型,支持多种参数规模模型发布,探索更好的适配中文解析挖掘任务的预训练模型; -3. 持续优化中文文本知识标注工具集,提供更加精准的知识标注服务;发布多粒度标注工具,支持更加丰富的应用场景。 - -## 在论文中引用解语 - -如果您的工作成果中使用了解语,请增加下述引用。我们非常乐于看到解语对您的工作带来帮助。 - -``` -@article{zhao2020TermTree, - title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding}, - author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong}, - technical report={Baidu, Inc. TR:2020-KG-TermTree}, - year={2020} -} -``` - - - -## 问题与反馈 - -解语在持续优化中,如果您有任何建议或问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/doc/img/ernie_ctm_inputs.png b/examples/text_to_knowledge/doc/img/ernie_ctm_inputs.png deleted file mode 100644 index f5ec1073b759..000000000000 Binary files a/examples/text_to_knowledge/doc/img/ernie_ctm_inputs.png and /dev/null differ diff --git a/examples/text_to_knowledge/doc/img/ernie_ctm_model.png b/examples/text_to_knowledge/doc/img/ernie_ctm_model.png deleted file mode 100644 index 2d886e91f593..000000000000 Binary files a/examples/text_to_knowledge/doc/img/ernie_ctm_model.png and /dev/null differ diff --git a/examples/text_to_knowledge/doc/img/text_to_knowledge.png b/examples/text_to_knowledge/doc/img/text_to_knowledge.png deleted file mode 100644 index 2a158a0b256d..000000000000 Binary files a/examples/text_to_knowledge/doc/img/text_to_knowledge.png and /dev/null differ diff --git a/examples/text_to_knowledge/doc/img/text_to_knowledge_example.png b/examples/text_to_knowledge/doc/img/text_to_knowledge_example.png deleted file mode 100644 index bf2e2212268b..000000000000 Binary files a/examples/text_to_knowledge/doc/img/text_to_knowledge_example.png and /dev/null differ diff --git a/examples/text_to_knowledge/doc/img/wordtag_example.png b/examples/text_to_knowledge/doc/img/wordtag_example.png deleted file mode 100644 index b415962dda24..000000000000 Binary files a/examples/text_to_knowledge/doc/img/wordtag_example.png and /dev/null differ diff --git a/examples/text_to_knowledge/doc/img/wordtag_model.png b/examples/text_to_knowledge/doc/img/wordtag_model.png deleted file mode 100644 index 705e9b7d05a0..000000000000 Binary files a/examples/text_to_knowledge/doc/img/wordtag_model.png and /dev/null differ diff --git a/examples/text_to_knowledge/ernie-ctm/README.md b/examples/text_to_knowledge/ernie-ctm/README.md deleted file mode 100644 index 1e6a85355c60..000000000000 --- a/examples/text_to_knowledge/ernie-ctm/README.md +++ /dev/null @@ -1,167 +0,0 @@ - -# 解语:ERNIE-CTM(ERNIE for **Chinese Text Mining**) - -ERNIE-CTM是适用于中文文本挖掘任务的预训练语言模型,拥有更全面的汉字字表集合,更优的中文文本挖掘任务表现,与PaddleNLP深度结合,提供更加便捷的应用实践。 - -## ERNIE-CTM特点 - -- 全面的中文汉字字表扩充 - - ERNIE-CTM的字符集包含2万+汉字,以及中文常用符号(常用标点、汉语拼音、编号)、部分外语符号(假名、单位)等,大幅减少中文解析挖掘任务中UNK(未识别字符)引发的标注问题。同时,ERNIE-CTM使用了embedding分解,可以更加灵活地扩充应用字表。 -- 更加适配中文文本挖掘任务 - - ERNIE-CTM中在每个表示后面添加了全局信息,在序列特征上叠加了全局的信息,使得在文本挖掘任务上有更加强力的表现。 -- 支持多种特征训练的模型结构 - - ERNIE-CTM的模型结构中,支持多种特征训练,用户可按照自己的需求任意添加任务及对应特征训练模型,而无需考虑任务之间的冲突所造成的灾难性遗忘。 - - - -## ERNIE-CTM模型介绍 - -### 模型结构 - -ERNIE-CTM的模型结构大体与BERT相同,都是双向transformer结构。区别是,ERNIE-CTM为能灵活扩充字表,采用了ALBERT的embedding分解,将embedding层分解为128维,参数列表如下: - -| 模型 | embedding size | hidden size | hidden layers | vocab size | -| -------------- | -------------- | ----------- | ------------- | ---------- | -| ERNIE-CTM-base | 128 | 768 | 12 | 23000 | - -ERNIE-CTM以字粒度建模,英文区分大小写,其输入表示如下: - -![ERNIE-CTM输入](../doc/img/ernie_ctm_inputs.png) - -其中,`[CLS{n}]`是ERNIE-CTM预留出的全局观察位,其中`n`从0开始计数,该全局观察位用于不同的训练任务,建模不同的语义特征,在下游任务中,可以结合使用,如使用attention筛选/融合特征,以达到更好的效果。而在灵活使用`[CLS{n}]`的时候,为中途增减任务token时不影响文本输入,所有的`[CLS{n}]`的位置编码均为0,且可以使用可见性矩阵(visible matrix)控制`[CLS{n}]`位置的特征对序列中其他位置,以及其他的全局观察位的可见性,以获得更加灵活、独立的特征表示。 - -本次开源的ERNIE-CTM-base模型中,使用了两个全局观察位`[CLS0]`和`[CLS1]`,具体作用见下文预训练任务介绍。 - -### 预训练任务 - -ERNIE-CTM使用的预训练任务为掩码语言模型(Masked Language Model,MLM)及ALBERT所使用的句子顺序预测(Sentence Order Prediction,SOP)。 - -其中`[CLS0]`用于训练SOP任务,训练方式如ALBERT中描述,正例为同一篇文章中的两个连续的句子,负例为用一篇文章中两个连续的句子顺序翻转。 - -`[CLS1]`做为全局的监督信号,应用于MLM任务中。训练MLM任务前,将`[CLS1]`特征表示拼接在所有的序列表示之后,通过线性层融合,成为最终的序列表示,之后预测MLM任务。所以,ERNIE-CTM最终输出的文本序列表示中,都融合了`[CLS1]`的特征表示。最终的序列表示中,带有全句的特征,一定程度可避免序列中全局特征捕捉不足,同时,`[CLS1]`最终的表示中也充分融合了句子内容的信息,弥补了SOP任务对文本主题信息捕捉不足的缺陷。 - -![ERNIE-CTM总体结构](../doc/img/ernie_ctm_model.png) - -### WordTag增量训练 - -在Ernie-Ctm微调任务中我们提供了一个基于[WordTag](../wordtag)的百科知识标注任务,该任务旨在解析中文词汇的知识标注,在该词性体系中覆盖了所有中文词汇的词类体系,包括各类实体词与非实体词(如概念、实体/专名、语法词等)。除了使用已有的WordTag工具对通用中文文本进行词类知识标注,WordTag同样支持用户使用自己的数据进行增量训练,下面是在WordTag模型上进行增量训练的具体示例流程。 - -#### 代码结构说明 - -```text -wordtag/ -├── data.py # 训练数据处理脚本 -├── metric.py # 模型效果验证指标脚本 -├── predict.py # 预测脚本 -├── README.md # 使用说明 -├── train.py # 训练脚本 -└── utils.py # 工具函数 -``` - -#### 数据准备 - -我们提供了少数样本用以示例增量训练。执行以下命令,下载并解压示例数据集: - -```bash -wget https://bj.bcebos.com/paddlenlp/datasets/wordtag_dataset_v3.tar.gz && tar -zxvf wordtag_dataset_v3.tar.gz -``` -解压之后 - -```text -data/ -├── dev.txt # 验证集 -├── tags.txt # WordTag标签集合 -└── train.json # 训练数据 -``` - -训练样本示例如下,每个单词以"/type"的形式标记其词性或实体类别,单词之间使用空格作为切分标记 - -```text -砚台/物体类 与/连词 笔/物体类 、/w 墨/物体类 、/w 纸/物体类 是/肯定词 中国/世界地区类 传统/修饰词 的/助词 文房四宝/词汇用语 。/w -《/w 全球化与中国:理论与发展趋势/作品类_实体 》/w 是/肯定词 2010年/时间类 经济管理出版社/组织机构类 出版/场景事件 的/助词 图书/作品类_概念 ,/w 作者/人物类_概念 是/肯定词 余永定/人物类_实体 、/w 路爱国/人物类_实体 、/w 高海红/人物类_实体 。/w -``` - -#### 模型训练 - -```shell -python -m paddle.distributed.launch --gpus "0" train.py \ - --max_seq_len 128 \ - --batch_size 32 \ - --learning_rate 5e-5 \ - --num_train_epochs 3 \ - --logging_steps 10 \ - --save_steps 100 \ - --output_dir ./output \ - --device "gpu" -``` - -其中参数释义如下: -- `max_seq_length` 表示最大句子长度,超过该长度将被截断。 -- `batch_size` 表示每次迭代**每张卡**上的样本数目。 -- `learning_rate` 表示基础学习率大小,将于learning rate scheduler产生的值相乘作为当前学习率。 -- `num_train_epochs` 表示训练轮数。 -- `logging_steps` 表示日志打印间隔。 -- `save_steps` 表示模型保存及评估间隔。 -- `output_dir` 表示模型保存路径。 -- `device` 表示训练使用的设备, 'gpu'表示使用GPU, 'xpu'表示使用百度昆仑卡, 'cpu'表示使用CPU。 - - - -### 模型预测 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python -m paddle.distributed.launch --gpus "0" predict.py \ - --params_path ./output/model_300/model_state.pdparams \ - --batch_size 32 \ - --device "gpu" -``` - -## 自定义模型一键预测 - -Taskflow支持加载增量训练后的模型进行一键预测,通过`task_path`定义用户自定义路径即可。 - -文件组成: -```text -custom_task_path/ -├── model_state.pdparams -├── model_config.json -└── tags.txt -``` - -```python -from paddlenlp import Taskflow - -my_wordtag = Taskflow("knowledge_mining", task_path="./custom_task_path/") - -my_wordtag("美人鱼是周星驰执导的一部电影") -# [{'text': '美人鱼是周星驰执导的一部电影', 'items': [{'item': '美人鱼', 'offset': 0, 'wordtag_label': '作品类_实体', 'length': 3, 'termid': '作品与出版物_eb_美人鱼'}, {'item': '是', 'offset': 3, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '周星驰', 'offset': 4, 'wordtag_label': '人物类_实体', 'length': 3, 'termid': '人物_eb_周星驰'}, {'item': '执导', 'offset': 7, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_执导'}, {'item': '的', 'offset': 9, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '一部', 'offset': 10, 'wordtag_label': '数量词', 'length': 2}, {'item': '电影', 'offset': 12, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '影视作品_cb_电影'}]}] -``` - - -## ERNIE-CTM后续计划 - - -1. 提升预训练语料的多样性(开源版主要使用了百度百科语料),持续优化预训练模型 -2. 发布其他参数量的预训练模型(tiny、large等),便于不同场景应用 -3. 维护开源社区,探索模型优化方向,整合优秀idea - - - -## 在论文中引用ERNIE-CTM - -如果您的工作成果中使用了ERNIE-CTM,请增加下述引用。我们非常乐于看到ERNIE-CTM对您的工作带来帮助。 -``` -@article{zhao2020TermTree, - title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding}, - author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong}, - technical report={Baidu, Inc. TR:2020-KG-TermTree}, - year={2020} -} -``` - - - -## 问题与反馈 - -ERNIE-CTM在持续优化中,如果您有任何建议或问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/ernie-ctm/data_process.py b/examples/text_to_knowledge/ernie-ctm/data_process.py deleted file mode 100644 index f40243dabb70..000000000000 --- a/examples/text_to_knowledge/ernie-ctm/data_process.py +++ /dev/null @@ -1,93 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle - - -def load_dict(dict_path): - vocab = {} - i = 0 - with open(dict_path, "r", encoding="utf-8") as fin: - for line in fin: - vocab[line.strip()] = i - i += 1 - return vocab - - -def convert_example(example, tokenizer, max_seq_len, tags_to_idx=None, summary_num=2, is_test=False): - tokens = example["tokens"] - tokenized_input = tokenizer(tokens, return_length=True, is_split_into_words="token", max_seq_len=max_seq_len) - - if is_test: - return tokenized_input["input_ids"], tokenized_input["token_type_ids"], tokenized_input["seq_len"] - - tags = example["tags"] - if len(tokenized_input["input_ids"]) - 1 - summary_num < len(tags): - tags = tags[: len(tokenized_input["input_ids"]) - 1 - summary_num] - # '[CLS]' and '[SEP]' will get label 'O' - tags = ["O"] * (summary_num) + tags + ["O"] - tags += ["O"] * (len(tokenized_input["input_ids"]) - len(tags)) - tokenized_input["tags"] = [tags_to_idx[x] for x in tags] - return ( - tokenized_input["input_ids"], - tokenized_input["token_type_ids"], - tokenized_input["seq_len"], - tokenized_input["tags"], - ) - - -def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): - if trans_fn: - dataset = dataset.map(trans_fn) - - shuffle = True if mode == "train" else False - if mode == "train": - batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - else: - batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - - return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) - - -def read_custom_data(filename): - """Reads data""" - with open(filename, "r", encoding="utf-8") as f: - for line in f: - example = transfer_str_to_example(line.strip()) - yield example - - -def transfer_str_to_example(sample): - text = "" - tags = [] - items = sample.split(" ") - items = [item.rsplit("/", 1) for item in items] - for w, t in items: - text += w - if len(w) == 1: - tags.append(f"S-{t}") - else: - l = len(w) - for j in range(l): - if j == 0: - tags.append(f"B-{t}") - elif j == l - 1: - tags.append(f"E-{t}") - else: - tags.append(f"I-{t}") - res = { - "tokens": list(text), - "tags": tags, - } - return res diff --git a/examples/text_to_knowledge/ernie-ctm/metric.py b/examples/text_to_knowledge/ernie-ctm/metric.py deleted file mode 100644 index 4341bd743a27..000000000000 --- a/examples/text_to_knowledge/ernie-ctm/metric.py +++ /dev/null @@ -1,219 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -from typing import List, Tuple - -import paddle - - -class SequenceAccuracy(paddle.metric.Metric): - """ - Masked language model pre-train task accuracy. - """ - - def __init__(self): - super(SequenceAccuracy, self).__init__() - self.correct_k = 0 - self.total = 0 - - def compute(self, pred, label, ignore_index): - pred = paddle.argmax(pred, 1) - active_acc = label.reshape([-1]) != ignore_index - active_pred = pred.masked_select(active_acc) - active_labels = label.masked_select(active_acc) - correct = active_pred.equal(active_labels) - return correct - - def update(self, correct): - self.correct_k += correct.cast("float32").sum(0) - self.total += correct.shape[0] - - def reset(self): - self.correct_k = 0 - self.total = 0 - - def accumulate(self): - return float(self.correct_k) / self.total - - def name(self): - return "Masked Language Model Accuracy" - - -def wordseg_hard_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float: - """ - Calculate extra metrics of word-seg - - Args: - list_a: prediction list - list_b: real list - - Returns: - acc: the extra accuracy - """ - p, q = 0, 0 - a_l, b_l = 0, 0 - acc = 0.0 - while q < len(list_b) and p < len(list_a): - a_r = a_l + len(list_a[p][0]) - 1 - b_r = b_l + len(list_b[q][0]) - 1 - if a_r < b_l: - p += 1 - a_l = a_r + 1 - continue - if b_r < a_l: - q += 1 - b_l = b_r + 1 - continue - if a_l == b_l and a_r == b_r: - acc += 1.0 - p += 1 - q += 1 - a_l = a_r + 1 - b_l = b_r + 1 - continue - p += 1 - return acc - - -def wordtag_hard_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float: - """ - Calculate extra metrics of word-tag - - Args: - list_a: prediction list - list_b: real list - - Returns: - acc: the extra accuracy - """ - p, q = 0, 0 - a_l, b_l = 0, 0 - acc = 0.0 - while q < len(list_b) and p < len(list_a): - a_r = a_l + len(list_a[p][0]) - 1 - b_r = b_l + len(list_b[q][0]) - 1 - if a_r < b_l: - p += 1 - a_l = a_r + 1 - continue - if b_r < a_l: - q += 1 - b_l = b_r + 1 - continue - if a_l == b_l and a_r == b_r: - if list_a[p][-1] == list_b[q][-1]: - acc += 1.0 - p += 1 - q += 1 - a_l, b_l = a_r + 1, b_r + 1 - continue - p += 1 - return acc - - -def wordtag_soft_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float: - """ - Calculate extra metrics of word-tag - - Args: - list_a: prediction list - list_b: real list - - Returns: - acc: the extra accuracy - """ - p, q = 0, 0 - a_l, b_l = 0, 0 - acc = 0.0 - while q < len(list_b) and p < len(list_a): - a_r = a_l + len(list_a[p][0]) - 1 - b_r = b_l + len(list_b[q][0]) - 1 - if a_r < b_l: - p += 1 - a_l = a_r + 1 - continue - if b_r < a_l: - q += 1 - b_l = b_r + 1 - continue - if a_l == b_l and a_r == b_r: - if list_a[p][-1] == list_b[q][-1]: - acc += 1.0 - elif list_b[q][-1].startswith(list_a[p][-1]): - acc += 1.0 - elif list_b[q] == "词汇用语": - acc += 1.0 - p += 1 - q += 1 - a_l, b_l = a_r + 1, b_r + 1 - continue - p += 1 - return acc - - -def wordseg_soft_acc(list_a: List[Tuple[str, str]], list_b: List[Tuple[str, str]]) -> float: - """ - Calculate extra metrics of word-seg - - Args: - list_a: prediction list - list_b: real list - - Returns: - acc: the extra accuracy - """ - i, j = 0, 0 - acc = 0.0 - a_l, b_l = 0, 0 - while i < len(list_a) and j < len(list_b): - a_r = a_l + len(list_a[i][0]) - 1 - b_r = b_l + len(list_b[j][0]) - 1 - if a_r < b_l: - i += 1 - a_l = a_r + 1 - continue - if b_r < a_l: - j += 1 - b_l = b_r + 1 - continue - if a_l == b_l and a_r == b_r: - acc += 1.0 - a_l, b_l = a_r + 1, b_r + 1 - i, j = i + 1, j + 1 - continue - if a_l == b_l and a_r < b_r: - cnt = 0.0 - tmp_a_r = a_r - for k in range(i + 1, len(list_a)): - tmp_a_r += len(list_a[k]) - cnt += 1.0 - if tmp_a_r == b_r: - acc += cnt - i, j = k + 1, j + 1 - a_l, b_l = tmp_a_r + 1, b_r + 1 - break - i += 1 - continue - if a_l == b_l and a_r > b_r: - tmp_b_r = b_r - for k in range(j + 1, len(list_b)): - tmp_b_r += len(list_b[k]) - if tmp_b_r == a_r: - acc += 1.0 - i, j = i + 1, k + 1 - a_l, b_l = a_r + 1, tmp_b_r + 1 - break - j += 1 - continue - i += 1 - return acc diff --git a/examples/text_to_knowledge/ernie-ctm/predict.py b/examples/text_to_knowledge/ernie-ctm/predict.py deleted file mode 100644 index 8620f9c393a0..000000000000 --- a/examples/text_to_knowledge/ernie-ctm/predict.py +++ /dev/null @@ -1,88 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os - -import paddle -from data_process import convert_example, load_dict -from utils import decode - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.transformers import ErnieCtmTokenizer, ErnieCtmWordtagModel - -# yapf: disable -parser = argparse.ArgumentParser() -parser.add_argument("--params_path", type=str, default="./output/model_300/model_state.pdparams", required=True, help="The path to model parameters to be loaded.") -parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain name_category_map.json.") -parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") -parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.") -parser.add_argument('--device', type=str, choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") -args = parser.parse_args() -# yapf: enable - - -def do_predict(data, model, tokenizer, viterbi_decoder, tags_to_idx, idx_to_tags, batch_size=1, summary_num=2): - - examples = [] - for text in data: - example = {"tokens": list(text)} - input_ids, token_type_ids, seq_len = convert_example(example, tokenizer, args.max_seq_len, is_test=True) - - examples.append((input_ids, token_type_ids, seq_len)) - - batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] - - batchify_fn = lambda samples, fn=Tuple( # noqa: E731 - Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids - Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids - Stack(dtype="int64"), # seq_len - ): fn(samples) - - all_pred_tags = [] - - model.eval() - for batch in batches: - input_ids, token_type_ids, seq_len = batchify_fn(batch) - input_ids = paddle.to_tensor(input_ids) - token_type_ids = paddle.to_tensor(token_type_ids) - seq_len = paddle.to_tensor(seq_len) - pred_tags = model(input_ids, token_type_ids, lengths=seq_len)[0] - all_pred_tags.extend(pred_tags.numpy().tolist()) - results = decode(data, all_pred_tags, summary_num, idx_to_tags) - return results - - -if __name__ == "__main__": - paddle.set_device(args.device) - - data = [ - "美人鱼是周星驰执导的一部电影", - ] - - tags_to_idx = load_dict(os.path.join(args.data_dir, "tags.txt")) - idx_to_tags = dict(zip(*(tags_to_idx.values(), tags_to_idx.keys()))) - - model = ErnieCtmWordtagModel.from_pretrained("wordtag", num_tag=len(tags_to_idx)) - tokenizer = ErnieCtmTokenizer.from_pretrained("wordtag") - - if args.params_path and os.path.isfile(args.params_path): - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - print("Loaded parameters from %s" % args.params_path) - - results = do_predict( - data, model, tokenizer, model.viterbi_decoder, tags_to_idx, idx_to_tags, batch_size=args.batch_size - ) - print(results) diff --git a/examples/text_to_knowledge/ernie-ctm/train.py b/examples/text_to_knowledge/ernie-ctm/train.py deleted file mode 100644 index a67b5d09486f..000000000000 --- a/examples/text_to_knowledge/ernie-ctm/train.py +++ /dev/null @@ -1,203 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os -import random -import time -from functools import partial - -import numpy as np -import paddle -from data_process import convert_example, create_dataloader, load_dict, read_custom_data -from metric import SequenceAccuracy - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.datasets import load_dataset -from paddlenlp.transformers import ( - ErnieCtmTokenizer, - ErnieCtmWordtagModel, - LinearDecayWithWarmup, -) -from paddlenlp.utils.log import logger - - -def parse_args(): - parser = argparse.ArgumentParser() - - # yapf: disable - parser.add_argument("--data_dir", default="./data", type=str, help="The input data dir, should contain train.json.") - parser.add_argument("--init_from_ckpt", default=None, type=str, help="The path of checkpoint to be loaded.") - parser.add_argument("--output_dir", default="./output", type=str, help="The output directory where the model predictions and checkpoints will be written.",) - parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", ) - parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") - parser.add_argument("--num_train_epochs", default=3, type=int, help="Total number of training epochs to perform.", ) - parser.add_argument("--logging_steps", type=int, default=5, help="Log every X updates steps.") - parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") - parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.", ) - parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight decay if we apply some.") - parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps. If > 0: Override warmup_proportion") - parser.add_argument("--warmup_proportion", default=0.0, type=float, help="Linear warmup proportion over total steps.") - parser.add_argument("--adam_epsilon", default=1e-6, type=float, help="Epsilon for Adam optimizer.") - parser.add_argument("--seed", default=1000, type=int, help="random seed for initialization") - parser.add_argument("--device", default="gpu", type=str, help="The device to select to train the model, is must be cpu/gpu/xpu.") - # yapf: enable - - args = parser.parse_args() - return args - - -def set_seed(seed): - """sets random seed""" - random.seed(seed) - np.random.seed(seed) - paddle.seed(seed) - - -@paddle.no_grad() -def evaluate(model, metric, data_loader, tags, tags_to_idx): - model.eval() - metric.reset() - losses = [] - for batch in data_loader(): - input_ids, token_type_ids, seq_len, tags = batch - loss, seq_logits = model(input_ids, token_type_ids, lengths=seq_len, tag_labels=tags)[:2] - loss = loss.mean() - losses.append(loss.numpy()) - - correct = metric.compute( - pred=seq_logits.reshape([-1, len(tags_to_idx)]), label=tags.reshape([-1]), ignore_index=tags_to_idx["O"] - ) - metric.update(correct) - acc = metric.accumulate() - logger.info("eval loss: %.5f, acc: %.5f" % (np.mean(losses), acc)) - model.train() - metric.reset() - - -def do_train(args): - paddle.set_device(args.device) - rank = paddle.distributed.get_rank() - if paddle.distributed.get_world_size() > 1: - paddle.distributed.init_parallel_env() - - set_seed(args.seed) - - train_ds = load_dataset( - read_custom_data, filename=os.path.join(args.data_dir, "train.txt"), is_test=False, lazy=False - ) - dev_ds = load_dataset(read_custom_data, filename=os.path.join(args.data_dir, "dev.txt"), is_test=False, lazy=False) - tags_to_idx = load_dict(os.path.join(args.data_dir, "tags.txt")) - - tokenizer = ErnieCtmTokenizer.from_pretrained("wordtag") - model = ErnieCtmWordtagModel.from_pretrained("wordtag", num_labels=len(tags_to_idx)) - - trans_func = partial(convert_example, tokenizer=tokenizer, max_seq_len=args.max_seq_len, tags_to_idx=tags_to_idx) - - def batchify_fn(samples): - fn = Tuple( # noqa: E731 - Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids - Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids - Stack(dtype="int64"), # seq_len - Pad(axis=0, pad_val=tags_to_idx["O"], dtype="int64"), # tags - ) - return fn(samples) - - train_data_loader = create_dataloader( - train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func - ) - - dev_data_loader = create_dataloader( - dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func - ) - - if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): - state_dict = paddle.load(args.init_from_ckpt) - model.set_dict(state_dict) - - if paddle.distributed.get_world_size() > 1: - model = paddle.DataParallel(model) - - num_training_steps = len(train_data_loader) * args.num_train_epochs - warmup = args.warmup_steps if args.warmup_steps > 0 else args.warmup_proportion - lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, warmup) - - decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] - optimizer = paddle.optimizer.AdamW( - learning_rate=lr_scheduler, - epsilon=args.adam_epsilon, - parameters=model.parameters(), - weight_decay=args.weight_decay, - apply_decay_param_fun=lambda x: x in decay_params, - ) - - logger.info("Total steps: %s" % num_training_steps) - logger.info("WarmUp steps: %s" % warmup) - - metric = SequenceAccuracy() - - total_loss = 0 - global_step = 0 - - for epoch in range(1, args.num_train_epochs + 1): - logger.info(f"Epoch {epoch} beginnig") - start_time = time.time() - - for total_step, batch in enumerate(train_data_loader): - global_step += 1 - input_ids, token_type_ids, seq_len, tags = batch - - loss = model(input_ids, token_type_ids, lengths=seq_len, tag_labels=tags)[0] - - loss = loss.mean() - total_loss += loss - loss.backward() - - optimizer.step() - optimizer.clear_grad() - lr_scheduler.step() - - if global_step % args.logging_steps == 0 and rank == 0: - end_time = time.time() - speed = float(args.logging_steps) / (end_time - start_time) - logger.info( - "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s" - % (global_step, epoch, total_loss / args.logging_steps, speed) - ) - start_time = time.time() - total_loss = 0 - - if (global_step % args.save_steps == 0 or global_step == num_training_steps) and rank == 0: - output_dir = os.path.join(args.output_dir, "model_%d" % (global_step)) - if not os.path.exists(output_dir): - os.makedirs(output_dir) - model_to_save = model._layers if isinstance(model, paddle.DataParallel) else model - model_to_save.save_pretrained(output_dir) - tokenizer.save_pretrained(output_dir) - - evaluate(model, metric, dev_data_loader, tags, tags_to_idx) - - -def print_arguments(args): - """print arguments""" - print("----------- Configuration Arguments -----------") - for arg, value in sorted(vars(args).items()): - print("%s: %s" % (arg, value)) - print("------------------------------------------------") - - -if __name__ == "__main__": - args = parse_args() - print_arguments(args) - do_train(args) diff --git a/examples/text_to_knowledge/ernie-ctm/utils.py b/examples/text_to_knowledge/ernie-ctm/utils.py deleted file mode 100644 index 293945bef203..000000000000 --- a/examples/text_to_knowledge/ernie-ctm/utils.py +++ /dev/null @@ -1,48 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - - -def reset_offset(pred_words): - for i in range(0, len(pred_words)): - if i > 0: - pred_words[i]["offset"] = pred_words[i - 1]["offset"] + len(pred_words[i - 1]["item"]) - pred_words[i]["length"] = len(pred_words[i]["item"]) - return pred_words - - -def decode(texts, all_pred_tags, summary_num, idx_to_tags): - batch_results = [] - for i, pred_tags in enumerate(all_pred_tags): - pred_words, pred_word = [], [] - - for j, tag in enumerate(pred_tags[summary_num:-1]): - if j >= len(texts[i]): - break - pred_label = idx_to_tags[tag] - if pred_label.find("-") != -1: - _, label = pred_label.split("-") - else: - label = pred_label - if pred_label.startswith("S") or pred_label.startswith("O"): - pred_words.append({"item": texts[i][j], "offset": 0, "wordtag_label": label}) - else: - pred_word.append(texts[i][j]) - if pred_label.startswith("E"): - pred_words.append({"item": "".join(pred_word), "offset": 0, "wordtag_label": label}) - del pred_word[:] - - pred_words = reset_offset(pred_words) - result = {"text": texts[i], "items": pred_words} - batch_results.append(result) - return batch_results diff --git a/examples/text_to_knowledge/nptag/README.md b/examples/text_to_knowledge/nptag/README.md deleted file mode 100644 index aece4d724dfc..000000000000 --- a/examples/text_to_knowledge/nptag/README.md +++ /dev/null @@ -1,160 +0,0 @@ -# 解语:NPTag(名词短语标注工具) - -NPTag(名词短语标注工具)是首个能够覆盖所有中文名词性词汇及短语的细粒度知识标注工具,旨在解决NLP中,名词性短语收录不足,导致的OOV(out-of-vocabulary,超出收录词表)问题。可直接应用构造知识特征,辅助NLP任务。 - -## NPTag特点 - -- 包含2000+细粒度类别,覆盖所有中文名词性短语的词类体系,更丰富的知识标注结果 - - NPTag试用的词类体系未覆盖所有中文名词性短语的词类体系,对所有类目做了更细类目的识别(如注射剂、鱼类、博物馆等),共包含2000+细粒度类别,且可以直接关联百科知识树。 -- 可自由定制的分类框架 - - NPTag开源版标注使用的词类体系是我们在实践中对**百科词条**分类应用较好的一个版本,用户可以自由定制自己的词类体系和训练样本,构建自己的NPTag,以获得更好的适配效果。例如,可按照自定义的类别构造训练样本,使用小学习率、短训练周期微调NPTag模型,即可获得自己定制的NPTag工具。 - -## NPTag模型介绍 - -NPTag使用[ERNIE-CTM](../ernie-ctm)+prompt训练而成,使用启发式搜索解码,保证分类结果都在标签体系之内。 - -### finetune任务 - -在微调任务中提供了一个中文名词短语标注的任务,旨在对中文名词短语进行细粒度分类。 - -#### 代码结构说明 - -```text -nptag/ -├── deploy # 部署 -│   └── python -│   └── predict.py # python预测部署示例 -├── data.py # 训练数据处理脚本 -├── export_model.py # 模型导出脚本 -├── metric.py # 模型效果验证指标脚本 -├── predict.py # 预测脚本 -├── README.md # 使用说明 -├── train.py # 训练脚本 -└── utils.py # 工具函数 -``` - -#### 数据准备 - -执行以下命令,下载并解压示例数据集: - -```bash -wget https://bj.bcebos.com/paddlenlp/paddlenlp/datasets/nptag_dataset.tar.gz && tar -zxvf nptag_dataset.tar.gz -``` - -解压之后 -```text -data/ -├── name_category_map.json # NPTag标签文件 -├── dev.txt # 验证集 -└── train.txt # 训练集 -``` - -数据集`train.txt`和`dev.txt`格式示例(text VS label) -``` -石竹 植物 -杂链聚合物 化学物质 -罗伯特·布雷森 人 -``` - -标签文件`name_category_map.json`格式示例,其中key为细粒度标签,即NPTag的预测结果;value为粗粒度标签,示例中对应WordTag的标签集合,用户可以根据场景需要自定义修改该标签映射 -``` -{ - "植物": "生物类_植物", - "化学物质": "物体类_化学物质", - "人": "人物类_实体", -} -``` - -#### 模型训练 -```bash -python -m paddle.distributed.launch --gpus "0" train.py \ - --batch_size 64 \ - --learning_rate 1e-6 \ - --num_train_epochs 3 \ - --logging_steps 10 \ - --save_steps 100 \ - --output_dir ./output \ - --device "gpu" -``` - -可支持配置的参数: -- `data_dir`: 数据集文件路径,默认数据集存放在当前目录data文件夹下。 -- `init_from_ckpt`: 模型参数路径,热启动模型训练,默认为None。 -- `output_dir`: 模型保存路径,默认保存在当前目录的output文件夹下。 -- `max_seq_len`: 模型使用的最大序列长度,默认为64。 -- `learning_rate`: finetune的最大学习率;默认为1e-6。 -- `num_train_epochs`: 表示训练轮数,默认为3。 -- `logging_steps`: 日志打印步数间隔,默认为10。 -- `save_steps`: 模型保存的步数间隔, 默认为100。 -- `batch_size`: 批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为64。 -- `weight_decay`: 控制正则项力度的参数,用于防止过拟合,默认为0.0。 -- `warmup_proportion`: 学习率warmup策略的比例,如果0.1,则学习率会在前10%训练step的过程中从0慢慢增长到learning_rate, 而后再缓慢衰减,默认为0.0。 -- `adam_epsilon`: Adam优化器的参数,避免分母为零,默认为1e-8。 -- `seed`: 随机种子,默认为1000。 -- `device`: 选用什么设备进行训练,可选cpu或gpu。如使用gpu训练则参数gpus指定GPU卡号。 - -### 基于动态图的预测 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python -m paddle.distributed.launch --gpus "0" predict.py \ - --device=gpu \ - --params_path ./output/model_100/model_state.pdparams -``` - -### 基于静态图的预测部署 - -使用动态图训练结束之后,可以将动态图参数导出成静态图参数,从而获得最优的预测部署性能,执行如下命令完成动态图转换静态图的功能: -```shell -python export_model.py --params_path=./output/model_100/model_state.pdparams --output_path=./export -``` - -导出静态图模型之后,可以用于部署,`deploy/python/predict.py`脚本提供了python部署预测示例。运行方式: -```shell -python deploy/python/predict.py --model_dir=./export -``` - -## Taskflow一键预测 - -除了以上的finetune示例,Taskflow内置了一个百度基于大规模标注汉语短语数据集训练的名词短语标注工具`NPTag`。用户可以方便地使用该工具完成对中文名词短语的一键预测。 - -```python -from paddlenlp import Taskflow - -nptag = Taskflow("knowledge_mining", model="nptag") -nptag("糖醋排骨") -''' -[{'text': '糖醋排骨', 'label': '菜品'}] -''' - -nptag(["糖醋排骨", "红曲霉菌"]) -''' -[{'text': '糖醋排骨', 'label': '菜品'}, {'text': '红曲霉菌', 'label': '微生物'}] -''' - -# 输出粗粒度类别标签`category`,即WordTag的词汇标签。 -nptag = Taskflow("knowledge_mining", model="nptag", linking=True) -nptag(["糖醋排骨", "红曲霉菌"]) - -''' -[{'text': '糖醋排骨', 'label': '菜品', 'category': '饮食类_菜品'}, {'text': '红曲霉菌', 'label': '微生物', 'category': '生物类_微生物'}] -''' -``` - -## 在论文中引用NPTag - -如果您的工作成果中使用了NPTag,请增加下述引用。我们非常乐于看到解语对您的工作带来帮助。 - -``` -@article{zhao2020TermTree, - title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding}, - author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong}, - technical report={Baidu, Inc. TR:2020-KG-TermTree}, - year={2020} -} -``` - - -## 问题与反馈 - -解语在持续优化中,如果您有任何建议或问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/nptag/data.py b/examples/text_to_knowledge/nptag/data.py deleted file mode 100644 index 4d482fe69f8e..000000000000 --- a/examples/text_to_knowledge/nptag/data.py +++ /dev/null @@ -1,81 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import numpy as np -import paddle - - -def convert_example(example, tokenzier, max_seq_len=512, max_cls_len=5, summary_num=2, is_test=False): - """ - Builds model inputs from a sequence for noun phrase classification task. - A prompt template is added to the end of the sequence. - - Prompt template: - - - ``[是] + [MASK] * max_cls_len`` - - Model input example: - - - ``[CLS0][CLS1] X [是][MASK]...[MASK][SEP]`` - - where X is the input text. - - Args: - example(obj:`list[str]`): List of input data, containing text and label if it have label. - tokenizer(obj:`PretrainedTokenizer`): This tokenizer inherits from :class:`~paddlenlp.transformers.PretrainedTokenizer` - which contains most of the methods. Users should refer to the superclass for more information regarding methods. - max_seq_len(obj:`int`): The maximum total input sequence length after tokenization. - Sequences longer than this will be truncated, sequences shorter will be padded. - max_cls_len(obj:`int`): The maximum length of labels. - summary_num(obj:`int`): The number of summary tokens, e.g. `[CLS0]` and `[CLS1]`. - is_test(obj:`bool`): If True, it will not return the label. - - """ - - if len(example["text"]) + max_cls_len + 1 + summary_num + 1 > max_seq_len: - example["text"] = example["text"][: (max_seq_len - (max_cls_len + 1 + summary_num + 1))] - - tokens = list(example["text"]) + ["是"] + ["[MASK]"] * max_cls_len - inputs = tokenzier(tokens, return_length=True, is_split_into_words="token", max_length=max_seq_len) - - label_indices = list(range(inputs["seq_len"] - 1 - max_cls_len, inputs["seq_len"] - 1)) - - if is_test: - return inputs["input_ids"], inputs["token_type_ids"], label_indices - - label_tokens = list(example["label"]) + ["[PAD]"] * (max_cls_len - len(example["label"])) - labels = np.full([inputs["seq_len"]], fill_value=-100, dtype=np.int64) - labels[label_indices] = tokenzier.convert_tokens_to_ids(label_tokens) - return inputs["input_ids"], inputs["token_type_ids"], labels - - -def create_dataloader(dataset, mode="train", batch_size=1, batchify_fn=None, trans_fn=None): - if trans_fn: - dataset = dataset.map(trans_fn) - - shuffle = True if mode == "train" else False - if mode == "train": - batch_sampler = paddle.io.DistributedBatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - else: - batch_sampler = paddle.io.BatchSampler(dataset, batch_size=batch_size, shuffle=shuffle) - - return paddle.io.DataLoader(dataset=dataset, batch_sampler=batch_sampler, collate_fn=batchify_fn, return_list=True) - - -def read_custom_data(filename): - """Reads data""" - with open(filename, "r", encoding="utf-8") as f: - for line in f: - text, label = line.strip().split("\t") - yield {"text": text, "label": label} diff --git a/examples/text_to_knowledge/nptag/deploy/python/predict.py b/examples/text_to_knowledge/nptag/deploy/python/predict.py deleted file mode 100644 index 5bc505639132..000000000000 --- a/examples/text_to_knowledge/nptag/deploy/python/predict.py +++ /dev/null @@ -1,150 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os -import sys - -import paddle - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.transformers import ErnieCtmTokenizer - -sys.path.append("./") - -from data import convert_example # noqa: E402 -from utils import construct_dict_map, decode, find_topk, search # noqa: E402 - -# fmt: off -parser = argparse.ArgumentParser() -parser.add_argument("--model_dir", type=str, required=True, default="./export/", help="The directory to static model.") -parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain name_category_map.json.") -parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") -parser.add_argument("--batch_size", type=int, default=3, help="Batch size per GPU/CPU for training.") -parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.") -args = parser.parse_args() -# fmt: on - - -class Predictor(object): - def __init__(self, model_dir, device): - model_file = model_dir + "/inference.pdmodel" - params_file = model_dir + "/inference.pdiparams" - - if not os.path.exists(model_file): - raise ValueError("not find model file path {}".format(model_file)) - if not os.path.exists(params_file): - raise ValueError("not find params file path {}".format(params_file)) - config = paddle.inference.Config(model_file, params_file) - # Disable IR optimization for NPTag - config.switch_ir_optim(False) - - if device == "gpu": - # set GPU configs accordingly - config.enable_use_gpu(100, 0) - elif device == "cpu": - # set CPU configs accordingly, - # such as enable_mkldnn, set_cpu_math_library_num_threads - config.disable_gpu() - elif device == "xpu": - # set XPU configs accordingly - config.enable_xpu(100) - config.switch_use_feed_fetch_ops(False) - self.predictor = paddle.inference.create_predictor(config) - - self.input_handles = [self.predictor.get_input_handle(name) for name in self.predictor.get_input_names()] - - self.output_handle = self.predictor.get_output_handle(self.predictor.get_output_names()[0]) - - def predict(self, data, tokenizer): - examples = [] - for text in data: - example = {"text": text} - input_ids, token_type_ids, label_indices = convert_example( - example, tokenizer, max_seq_len=args.max_seq_len, is_test=True - ) - examples.append((input_ids, token_type_ids, label_indices)) - - batches = [examples[idx : idx + args.batch_size] for idx in range(0, len(examples), args.batch_size)] - - batchify_fn = lambda samples, fn=Tuple( - Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids - Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids - Stack(dtype="int64"), # label_indices - ): fn(samples) - - name_dict, bk_tree, id_vocabs, vocab_ids = construct_dict_map( - tokenizer, os.path.join(args.data_dir, "name_category_map.json") - ) - - all_scores_can = [] - all_preds_can = [] - pred_ids = [] - - for batch in batches: - input_ids, token_type_ids, label_indices = batchify_fn(batch) - self.input_handles[0].copy_from_cpu(input_ids) - self.input_handles[1].copy_from_cpu(token_type_ids) - self.predictor.run() - logits = self.output_handle.copy_to_cpu() - - for i, l in zip(label_indices, logits): - score = l[i[0] : i[-1] + 1, vocab_ids] - # Find topk candidates of scores and predicted indices. - score_can, pred_id_can = find_topk(score, k=4, axis=-1) - - all_scores_can.extend([score_can.tolist()]) - all_preds_can.extend([pred_id_can.tolist()]) - pred_ids.extend([pred_id_can[:, 0].tolist()]) - - results = [] - for i, d in enumerate(data): - label = decode(pred_ids[i], id_vocabs) - result = { - "text": d, - "label": label, - } - if label not in name_dict: - scores_can = all_scores_can[i] - pred_ids_can = all_preds_can[i] - labels_can = search(scores_can, pred_ids_can, 0, [], 0) - labels_can.sort(key=lambda d: -d[1]) - for labels in labels_can: - cls_label_can = decode(labels[0], id_vocabs) - if cls_label_can in name_dict: - result["label"] = cls_label_can - break - else: - labels_can = bk_tree.search_similar_word(label) - result["label"] = labels_can[0][0] - - result["category"] = name_dict[result["label"]] - results.append(result) - return results - - -if __name__ == "__main__": - # Define predictor to do prediction. - predictor = Predictor(args.model_dir, args.device) - - tokenizer = ErnieCtmTokenizer.from_pretrained("nptag") - - data = [ - "刘德华", - "快乐薯片", - "自适应共振理论映射", - ] - - results = predictor.predict(data, tokenizer) - print(results) diff --git a/examples/text_to_knowledge/nptag/export_model.py b/examples/text_to_knowledge/nptag/export_model.py deleted file mode 100644 index 7956b0032260..000000000000 --- a/examples/text_to_knowledge/nptag/export_model.py +++ /dev/null @@ -1,47 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os - -import paddle -from paddlenlp.transformers import ErnieCtmNptagModel - -# yapf: disable -parser = argparse.ArgumentParser() -parser.add_argument("--params_path", type=str, required=True, default='./output/model_100/model_state.pdparams', help="The path to model parameters to be loaded.") -parser.add_argument("--output_path", type=str, default='./export', help="The path of model parameter in static graph to be saved.") -args = parser.parse_args() -# yapf: enable - -if __name__ == "__main__": - model = ErnieCtmNptagModel.from_pretrained("nptag") - - if args.params_path and os.path.isfile(args.params_path): - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - print("Loaded parameters from %s" % args.params_path) - model.eval() - - # Convert to static graph with specific input description - model = paddle.jit.to_static( - model, - input_spec=[ - paddle.static.InputSpec(shape=[None, None], dtype="int64"), # input_ids - paddle.static.InputSpec(shape=[None, None], dtype="int64"), # token_type_ids - ], - ) - # Save in static graph model. - save_path = os.path.join(args.output_path, "inference") - paddle.jit.save(model, save_path) diff --git a/examples/text_to_knowledge/nptag/metric.py b/examples/text_to_knowledge/nptag/metric.py deleted file mode 100644 index 8e9ceaf02aee..000000000000 --- a/examples/text_to_knowledge/nptag/metric.py +++ /dev/null @@ -1,55 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import paddle - - -class NPTagAccuracy(paddle.metric.Metric): - """ - Accuracy for NPTag Prompt Model. - """ - - def __init__(self): - super(NPTagAccuracy, self).__init__() - self.reset() - - def reset(self): - self.corrects = 0 - self.total = 0 - - def compute(self, preds, labels): - correct = [] - for pred, label in zip(preds, labels): - real_pred, real_label = ([] for _ in range(2)) - for i in range(len(label)): - if label[i] == -100 or label[i] == 0: - continue - real_pred.append(pred[i]) - real_label.append(label[i]) - - if all(real_pred[i] == real_label[i] for i in range(len(real_label))): - correct.append(1) - else: - correct.append(0) - return correct - - def update(self, correct): - self.corrects += sum(correct) - self.total += len(correct) - - def accumulate(self): - return float(self.corrects) / self.total - - def name(self): - return "NPTag Prompt Model Accuracy" diff --git a/examples/text_to_knowledge/nptag/predict.py b/examples/text_to_knowledge/nptag/predict.py deleted file mode 100644 index ea791b59a4bf..000000000000 --- a/examples/text_to_knowledge/nptag/predict.py +++ /dev/null @@ -1,125 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os - -import paddle -from data import convert_example -from utils import construct_dict_map, decode, find_topk, search - -from paddlenlp.data import Pad, Stack, Tuple -from paddlenlp.transformers import ErnieCtmNptagModel, ErnieCtmTokenizer - -# yapf: disable -parser = argparse.ArgumentParser() -parser.add_argument("--params_path", type=str, default="./output/model_100/model_state.pdparams", required=True, help="The path to model parameters to be loaded.") -parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain name_category_map.json.") -parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.") -parser.add_argument("--batch_size", type=int, default=32, help="Batch size per GPU/CPU for training.") -parser.add_argument('--device', type=str, choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.") -args = parser.parse_args() -# yapf: enable - - -def do_predict(data, model, tokenizer, batch_size=1, max_cls_len=5, summary_num=2): - examples = [] - for text in data: - example = {"text": text} - input_ids, token_type_ids, label_indices = convert_example( - example, tokenizer, max_seq_len=args.max_seq_len, is_test=True - ) - examples.append((input_ids, token_type_ids, label_indices)) - - batches = [examples[idx : idx + batch_size] for idx in range(0, len(examples), batch_size)] - - batchify_fn = lambda samples, fn=Tuple( # noqa: E731 - Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids - Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids - Stack(dtype="int64"), # label_indices - ): fn(samples) - - name_dict, bk_tree, id_vocabs, vocab_ids = construct_dict_map( - tokenizer, os.path.join(args.data_dir, "name_category_map.json") - ) - - all_scores_can = [] - all_preds_can = [] - pred_ids = [] - - model.eval() - for batch in batches: - input_ids, token_type_ids, label_indices = batchify_fn(batch) - - input_ids = paddle.to_tensor(input_ids) - token_type_ids = paddle.to_tensor(token_type_ids) - logits = model(input_ids, token_type_ids)[0].numpy() - for i, l in zip(label_indices, logits): - score = l[i[0] : i[-1] + 1, vocab_ids] - # Find topk candidates of scores and predicted indices. - score_can, pred_id_can = find_topk(score, k=4, axis=-1) - - all_scores_can.extend([score_can.tolist()]) - all_preds_can.extend([pred_id_can.tolist()]) - pred_ids.extend([pred_id_can[:, 0].tolist()]) - - results = [] - for i, d in enumerate(data): - label = decode(pred_ids[i], id_vocabs) - - result = { - "text": d, - "label": label, - } - - if label not in name_dict: - scores_can = all_scores_can[i] - pred_ids_can = all_preds_can[i] - labels_can = search(scores_can, pred_ids_can, 0, [], 0) - labels_can.sort(key=lambda d: -d[1]) - for labels in labels_can: - cls_label_can = decode(labels[0], id_vocabs) - if cls_label_can in name_dict: - result["label"] = cls_label_can - break - else: - labels_can = bk_tree.search_similar_word(label) - if len(labels_can) != 0: - result["label"] = labels_can[0][0] - - if result["label"] in name_dict: - result["category"] = name_dict[result["label"]] - results.append(result) - return results - - -if __name__ == "__main__": - paddle.set_device(args.device) - - data = [ - "刘德华", - "快乐薯片", - "自适应共振理论映射", - ] - - model = ErnieCtmNptagModel.from_pretrained("nptag") - tokenizer = ErnieCtmTokenizer.from_pretrained("nptag") - - if args.params_path and os.path.isfile(args.params_path): - state_dict = paddle.load(args.params_path) - model.set_dict(state_dict) - print("Loaded parameters from %s" % args.params_path) - - results = do_predict(data, model, tokenizer, batch_size=args.batch_size) - print(results) diff --git a/examples/text_to_knowledge/nptag/train.py b/examples/text_to_knowledge/nptag/train.py deleted file mode 100644 index d76c809d6010..000000000000 --- a/examples/text_to_knowledge/nptag/train.py +++ /dev/null @@ -1,191 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse -import os -import random -import time -from functools import partial - -import numpy as np -import paddle -import paddle.nn.functional as F -from data import convert_example, create_dataloader, read_custom_data -from metric import NPTagAccuracy - -from paddlenlp.data import Pad, Tuple -from paddlenlp.datasets import load_dataset -from paddlenlp.transformers import ( - ErnieCtmNptagModel, - ErnieCtmTokenizer, - LinearDecayWithWarmup, -) -from paddlenlp.utils.log import logger - - -def parse_args(): - parser = argparse.ArgumentParser() - - # yapf: disable - parser.add_argument("--data_dir", type=str, default="./data", help="The input data dir, should contain train.json and dev.json.") - parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.") - parser.add_argument("--output_dir", type=str, default="./output", help="The output directory where the model predictions and checkpoints will be written.",) - parser.add_argument("--max_seq_len", type=int, default=64, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", ) - parser.add_argument("--learning_rate", type=float, default=1e-6, help="The initial learning rate for Adam.") - parser.add_argument("--num_train_epochs", type=int, default=3, help="Total number of training epochs to perform.", ) - parser.add_argument("--logging_steps", type=int, default=10, help="Log every X updates steps.") - parser.add_argument("--save_steps", type=int, default=100, help="Save checkpoint every X updates steps.") - parser.add_argument("--batch_size", type=int, default=64, help="Batch size per GPU/CPU for training.", ) - parser.add_argument("--weight_decay", type=float, default=0.0, help="Weight decay if we apply some.") - parser.add_argument("--warmup_proportion", type=float, default=0.0, help="Linear warmup proportion over total steps.") - parser.add_argument("--adam_epsilon", type=float, default=1e-8, help="Epsilon for Adam optimizer.") - parser.add_argument("--seed", type=int, default=1000, help="random seed for initialization") - parser.add_argument("--device", type=str, default="gpu", choices=["cpu", "gpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.") - # yapf: enable - - args = parser.parse_args() - return args - - -def set_seed(seed): - """sets random seed""" - random.seed(seed) - np.random.seed(seed) - paddle.seed(seed) - - -@paddle.no_grad() -def evaluate(model, metric, criterion, data_loader, vocab_size): - model.eval() - metric.reset() - losses = [] - for batch in data_loader(): - input_ids, token_type_ids, labels = batch - outputs = model(input_ids, token_type_ids) - logits = outputs[0] - loss = criterion(logits.reshape([-1, vocab_size]), labels.reshape([-1])) - losses.append(loss.numpy()) - probs = F.softmax(logits, axis=-1) - preds = paddle.argmax(probs, axis=-1).numpy() - correct = metric.compute(preds, labels) - metric.update(correct) - acc = metric.accumulate() - logger.info("eval loss: %.5f, acc: %.5f" % (np.mean(losses), acc)) - model.train() - metric.reset() - - -def do_train(args): - paddle.set_device(args.device) - rank = paddle.distributed.get_rank() - if paddle.distributed.get_world_size() > 1: - paddle.distributed.init_parallel_env() - - set_seed(args.seed) - - train_ds = load_dataset( - read_custom_data, filename=os.path.join(args.data_dir, "train.txt"), is_test=False, lazy=False - ) - dev_ds = load_dataset(read_custom_data, filename=os.path.join(args.data_dir, "dev.txt"), is_test=False, lazy=False) - - tokenizer = ErnieCtmTokenizer.from_pretrained("nptag") - model = ErnieCtmNptagModel.from_pretrained("nptag") - vocab_size = model.ernie_ctm.config["vocab_size"] - - trans_func = partial(convert_example, tokenzier=tokenizer, max_seq_len=args.max_seq_len) - - batchify_fn = lambda samples, fn=Tuple( # noqa: E731 - Pad(axis=0, pad_val=tokenizer.pad_token_id, dtype="int64"), # input_ids - Pad(axis=0, pad_val=tokenizer.pad_token_type_id, dtype="int64"), # token_type_ids - Pad(axis=0, pad_val=-100, dtype="int64"), # labels - ): fn(samples) - - train_data_loader = create_dataloader( - train_ds, mode="train", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func - ) - - dev_data_loader = create_dataloader( - dev_ds, mode="dev", batch_size=args.batch_size, batchify_fn=batchify_fn, trans_fn=trans_func - ) - - if args.init_from_ckpt and os.path.isfile(args.init_from_ckpt): - state_dict = paddle.load(args.init_from_ckpt) - model.set_dict(state_dict) - model = paddle.DataParallel(model) - num_training_steps = len(train_data_loader) * args.num_train_epochs - - lr_scheduler = LinearDecayWithWarmup(args.learning_rate, num_training_steps, args.warmup_proportion) - - decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])] - optimizer = paddle.optimizer.AdamW( - learning_rate=lr_scheduler, - epsilon=args.adam_epsilon, - parameters=model.parameters(), - weight_decay=args.weight_decay, - apply_decay_param_fun=lambda x: x in decay_params, - ) - - logger.info("Total steps: %s" % num_training_steps) - - metric = NPTagAccuracy() - criterion = paddle.nn.CrossEntropyLoss() - - global_step = 0 - for epoch in range(1, args.num_train_epochs + 1): - logger.info(f"Epoch {epoch} beginnig") - start_time = time.time() - - for step, batch in enumerate(train_data_loader): - global_step += 1 - input_ids, token_type_ids, labels = batch - outputs = model(input_ids, token_type_ids) - logits = outputs[0] - loss = criterion(logits.reshape([-1, vocab_size]), labels.reshape([-1])) - - loss.backward() - optimizer.step() - optimizer.clear_grad() - lr_scheduler.step() - - if global_step % args.logging_steps == 0 and rank == 0: - end_time = time.time() - speed = float(args.logging_steps) / (end_time - start_time) - logger.info( - "global step %d, epoch: %d, loss: %.5f, speed: %.2f step/s" - % (global_step, epoch, loss.item(), speed) - ) - start_time = time.time() - - if (global_step % args.save_steps == 0 or global_step == num_training_steps) and rank == 0: - output_dir = os.path.join(args.output_dir, "model_%d" % (global_step)) - if not os.path.exists(output_dir): - os.makedirs(output_dir) - model._layers.save_pretrained(output_dir) - tokenizer.save_pretrained(output_dir) - - evaluate(model, metric, criterion, dev_data_loader, vocab_size) - - -def print_arguments(args): - """print arguments""" - print("----------- Configuration Arguments -----------") - for arg, value in sorted(vars(args).items()): - print("%s: %s" % (arg, value)) - print("------------------------------------------------") - - -if __name__ == "__main__": - args = parse_args() - print_arguments(args) - do_train(args) diff --git a/examples/text_to_knowledge/nptag/utils.py b/examples/text_to_knowledge/nptag/utils.py deleted file mode 100644 index ad2e233e3223..000000000000 --- a/examples/text_to_knowledge/nptag/utils.py +++ /dev/null @@ -1,195 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import json -from collections import OrderedDict -from typing import List - -import numpy as np - - -def construct_dict_map(tokenizer, name_dict_path): - """Construct dict map""" - with open(name_dict_path, encoding="utf-8") as fp: - name_dict = json.load(fp) - cls_vocabs = OrderedDict() - bk_tree = BurkhardKellerTree() - for k in name_dict: - bk_tree.add(k) - for c in k: - if c not in cls_vocabs: - cls_vocabs[c] = len(cls_vocabs) - cls_vocabs["[PAD]"] = len(cls_vocabs) - id_vocabs = dict(zip(cls_vocabs.values(), cls_vocabs.keys())) - vocab_ids = tokenizer.vocab.to_indices(list(cls_vocabs.keys())) - return name_dict, bk_tree, id_vocabs, vocab_ids - - -def decode(pred_ids, id_vocabs): - tokens = [id_vocabs[i] for i in pred_ids] - valid_token = [] - for token in tokens: - if token == "[PAD]": - break - valid_token.append(token) - return "".join(valid_token) - - -def search(scores_can, pred_ids_can, depth, path, score): - if depth >= 5: - return [(path, score)] - res = [] - for i in range(len(pred_ids_can[0])): - tmp_res = search( - scores_can, pred_ids_can, depth + 1, path + [pred_ids_can[depth][i]], score + scores_can[depth][i] - ) - res.extend(tmp_res) - return res - - -def find_topk(a, k, axis=-1, largest=True, sorted=True): - if axis is None: - axis_size = a.size - else: - axis_size = a.shape[axis] - assert 1 <= k <= axis_size - - a = np.asanyarray(a) - if largest: - index_array = np.argpartition(a, axis_size - k, axis=axis) - topk_indices = np.take(index_array, -np.arange(k) - 1, axis=axis) - else: - index_array = np.argpartition(a, k - 1, axis=axis) - topk_indices = np.take(index_array, np.arange(k), axis=axis) - topk_values = np.take_along_axis(a, topk_indices, axis=axis) - if sorted: - sorted_indices_in_topk = np.argsort(topk_values, axis=axis) - if largest: - sorted_indices_in_topk = np.flip(sorted_indices_in_topk, axis=axis) - sorted_topk_values = np.take_along_axis(topk_values, sorted_indices_in_topk, axis=axis) - sorted_topk_indices = np.take_along_axis(topk_indices, sorted_indices_in_topk, axis=axis) - return sorted_topk_values, sorted_topk_indices - return topk_values, topk_indices - - -def levenstein_distance(s1: str, s2: str) -> int: - """Calculate minimal Levenstein distance between s1 and s2. - - Args: - s1 (str): string - s2 (str): string - - Returns: - int: the minimal distance. - """ - m, n = len(s1) + 1, len(s2) + 1 - - # Initialize - dp = [[0] * n for i in range(m)] - dp[0][0] = 0 - for i in range(1, m): - dp[i][0] = dp[i - 1][0] + 1 - for j in range(1, n): - dp[0][j] = dp[0][j - 1] + 1 - - for i in range(1, m): - for j in range(1, n): - if s1[i - 1] != s2[j - 1]: - dp[i][j] = min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) + 1 - else: - dp[i][j] = dp[i - 1][j - 1] - return dp[m - 1][n - 1] - - -class BurkhardKellerNode(object): - """Node implementatation for BK-Tree. A BK-Tree node stores the information of current word, and its approximate words calculated by levenstein distance. - - Args: - word (str): word of current node. - """ - - def __init__(self, word: str): - self.word = word - self.next = {} - - -class BurkhardKellerTree(object): - """Implementataion of BK-Tree""" - - def __init__(self): - self.root = None - self.nodes = {} - - def __add(self, cur_node: BurkhardKellerNode, word: str): - """Insert a word into current tree. If tree is empty, set this word to root. - - Args: - word (str): word to be inserted. - """ - if self.root is None: - self.root = BurkhardKellerNode(word) - return - if word in self.nodes: - return - dist = levenstein_distance(word, cur_node.word) - if dist not in cur_node.next: - self.nodes[word] = cur_node.next[dist] = BurkhardKellerNode(word) - else: - self.__add(cur_node.next[dist], word) - - def add(self, word: str): - """Insert a word into current tree. If tree is empty, set this word to root. - - Args: - word (str): word to be inserted. - """ - return self.__add(self.root, word) - - def __search_similar_word(self, cur_node: BurkhardKellerNode, s: str, threshold: int = 2) -> List[str]: - res = [] - if cur_node is None: - return res - dist = levenstein_distance(cur_node.word, s) - if dist <= threshold: - res.append((cur_node.word, dist)) - start = max(dist - threshold, 1) - while start < dist + threshold: - tmp_res = self.__search_similar_word(cur_node.next.get(start, None), s)[:] - res.extend(tmp_res) - start += 1 - return res - - def search_similar_word(self, word: str) -> List[str]: - """Search the most similar (minimal levenstain distance) word between `s`. - - Args: - s (str): target word - - Returns: - List[str]: similar words. - """ - res = self.__search_similar_word(self.root, word) - - def max_prefix(s1: str, s2: str) -> int: - res = 0 - length = min(len(s1), len(s2)) - for i in range(length): - if s1[i] == s2[i]: - res += 1 - else: - break - return res - - res.sort(key=lambda d: (d[1], -max_prefix(d[0], word))) - return res diff --git a/examples/text_to_knowledge/termtree/README.md b/examples/text_to_knowledge/termtree/README.md deleted file mode 100644 index 7fd126b43fed..000000000000 --- a/examples/text_to_knowledge/termtree/README.md +++ /dev/null @@ -1,271 +0,0 @@ -# 解语:TermTree(百科知识树) -TermTree(百科知识树)是一个描述所有中文词汇(包括概念、实体/专名、领域术语、语法词等,统一称之为Term)的树状知识库,完整的TermTree由两部分构成: - -> I. TermType词类体系:覆盖所有中文词汇词类的树状知识体系,是对中文词汇集合的一种全划分层次表示; -> -> II. Term关系和属性值:描述具体Term之间关系和Term属性值网状图谱,用于整合各应用知识图谱; - -本次发布的TermTreeV1.0试用版是TermTree的一个常用子集,包括两部分内容: - -> A. 简化版的TermType词类体系,由160+ termtype(三层结构)和 7000+ subtype构成。 -> -> B. 约100w的term集(挂接在TermType词类体系下),包括大多数常用概念(src=cb,基础概念库,termtype准确率为98%)和一部分高频百科实体(src=eb,基础实体库,termtype准确率为95%)。 -> -> 开源版不包括Term关系和属性值,但给出了实体的百科词条链接,应用方可以利用百科链接整合其他知识图谱使用。 - -我们提供了TermTreeV1.0试用版的下载链接供大家使用,[下载链接](https://kg-concept.bj.bcebos.com/TermTree/TermTree.V1.0.tar.gz) 。 - -**注:** 与其他常见应用知识图谱不同,TermTree的核心是概念词,而非专名实体词。因为,在中文文本中,概念词的含义是相对稳定的,而专名实体词随应用变化(例如,不同电商有不同的商品实体集,不同的小说站有不同的小说实体集),因此,TermTree通过 “提供常用概念集 + 可插拔的应用实体集/应用知识图谱” 来达到支持不同的应用适配。 - -## 自定义TermTree - -`termtree.py`文件中的TermTree类支持TermTree的加载、增加、保存操作,因为涉及到数据结构整体性和一致性,暂不支持删除和修改操作。下面提供了离线维护自定义TermTree的代码示例 - -### 文件准备 - -首先下载已有的TermTreeV1.0 -```shell -wget https://kg-concept.bj.bcebos.com/TermTree/TermTree.V1.0.tar.gz && tar -zxvf TermTree.V1.0.tar.gz -``` - -### TermTree维护与修改 - -加载TermTreeV1.0,增加新的term -```python -from termtree import TermTree - -# 加载百科知识树 -termtree = TermTree.from_dir("termtree_type.csv", "TermTree.V1.0") - -# 增加新term: 平原上的火焰 -termtree.add_term(term="平原上的火焰", - base="eb", - term_type="影视作品") - -# 保存修改, 执行后将在当前路径生成文件`termtree_data`,即新的自定义TermTree -termtree.save("./") -``` - -#### API说明 - -- ```python - def add_term() - ``` - -- **参数** - - term (str): 待增加的term名称。 - - base (str): term属于概念词(cb)还是实体词(eb)。 - - term_type (str): term的主类别。 - - sub_type (Optional[List[str]], optional): term的辅助类别或细分类别,非必选。 - - sub_terms (Optional[List[str]], optional): 用于描述同类同名的term集,非必选。 - - alias (Optional[List[str]], optional): term的常用别名,非必选。 - - alias_ext (Optional[List[str]], optional): term的常用扩展别名,非必选。 - - data (Optional[Dict[str, Any]], optional): 以dict形式构造该term节点,非必选。 - - -### 自定义Term-Linking - -Taskflow支持使用自定义TermTree实现自定义Term-Linking,该示例中"平原上的火焰"的Term-Linking如下: -作品类_实体(wordtag_label) -> 影视作品_eb_平原上的火焰(term_id) - -通过`task_path`定义用户自定义路径,文件组成: -```text -custom_task_path/ -├── termtree_type.csv -└── termtree_data -``` - -使用Taskflow加载自定义TermTree来进行预测: - -```python -from paddlenlp import Taskflow - -wordtag = Taskflow("knowledge_mining", task_path="./custom_task_path/") - -wordtag("《平原上的火焰》是今年新上映的电影") -# [{'text': '《平原上的火焰》是今年新上映的电影', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '平原上的火焰', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 6, 'termid': '影视作品_eb_平原上的火焰'}, {'item': '》', 'offset': 7, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 8, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '今年', 'offset': 9, 'wordtag_label': '时间类', 'length': 2, 'termid': '时间阶段_cb_今年'}, {'item': '新', 'offset': 11, 'wordtag_label': '修饰词', 'length': 1, 'termid': '修饰词_cb_新'}, {'item': '上映', 'offset': 12, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_上映'}, {'item': '的', 'offset': 14, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '电影', 'offset': 15, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '影视作品_cb_电影'}]}] -``` - -## 常见问题 - -**常见问题1:为什么TermTree采用树状结构(Tree),而不是网状结构(Net/Graph)?** - -- 树结构是对知识空间的全划分,网状结构是对相关关系的描述和提炼。树结构更方便做到对词类体系的全面描述。 - -- 树结构适合概念层次的泛化推理,网状结构适合相关性的泛化推理。树结构的知识对统计相关知识有很好的互补作用,在应用中能够更好地弥补统计模型的不足。 -- 两者可以结合表示和使用:Term集合整体以树结构组织(TermType词类体系),Term间的关系用网状结构描述(Term关系和属性值)。可以将TermTree视为中文词汇的层次描述框架,应用知识图谱可以基于TermType词类体系方便地整合到TermTree。 - -**常见问题2:为什么TermTree叫做百科知识树?是否只能用于描述百科知识?** - -- 一方面,Term可以泛指任意概念、实体/专名、领域术语、语法词等,用“百科”是为了表达Term的多样性,而不是限定Term的来源,Term可以来自任意中文文本; -- 另一方面,各类别的词汇都可以在百科词条中找到样例,用“百科”也是为了表示对所有中文词汇词类的描述能力。 - -**常见问题3:中文词汇词类描述体系有很多,为什么采用这个体系?** - -- TermTree的词类体系是在大规模工业应用实践(如百科文本解析挖掘、query理解)中打磨出来的中文词类体系,在理论上可能不是一个完备体系,但很适合通用领域中文解析挖掘任务。 - - -## TermTree字段说明 - -| 字段 | 说明 | 备注 | -| ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | -| id | 【必有】唯一标识符 | 可基于termid生成 | -| term | 【必有】term的名字 | | -| termid | 【必有】term的id(唯一),构造方式为termtype_src_term | 采用显式构造id的方式,便于应用数据扩展和整合 | -| src | 【必有】term的来源库,当前包括两个基础库cb和eb。其中cb为基础概念库(concept base,收录常用词汇用语,可作为各类应用的基础集),eb为基础实体库(entity base, 收录常见命名实体,可根据应用需求扩展) | cb、eb的划分标准不同应用不一样,可根据需求调整;应用方也可以构造自己的应用库,与cb、eb整合使用。 | -| termtype | 【必有】term的主类别,详细描述参见 [termtree\_type](./termtree_type.csv) | 多上位的term会选择其中一个作为termtype,其他上位作为subtype,方便应用筛选 | -| subtype | 【非必须】term的辅助类别或细分类别 | 如果应用特别关注某个subtype,也可以将其升级为termtype使用(需要相应更新termid和id) | -| subterms | 【非必须】用于描述同类同名的term集,若“termtype+src”下term只对应一个实例,则subterms为空;若“termtype+src”下term对应多个实例,则subterms记录这些实例,其字段与term相同 | 不需要区分subterm的两种常见场景:1. 应用只需词类特征;2. 上下文信息不足,无法区分具体实例 | -| subterms_num | 【非必须】subterms中的subterm数量 | 如果没有subterm,则值为0 | -| alias | 【非必须】term的常用别名 | 通常为歧义小的别名 | -| alias\_ext | 【非必须】term的常用扩展别名,经常是term或alias的一个子片段,单独出现有其他含义,结合上下文可识别为别名。 | 通常为歧义大的别名,便于应用筛选使用。e.g., 四维彩超的alias_ext“四维” | -| links | 【非必须】该term对应的其他term的id,可以是本知识库中的id,也可以是其他知识库如百度百科id | 如果是本知识库中的id,则表示两者可以指代同一实体 | - -## 数据示例 -```json -// 示例1:无subterms的term -{ - "id": "c472a6fe74eb2008c4e5b958a047eb5c", - "termid": "植物_cb_苹果", - "term": "苹果", - "src": "cb", - "termtype": "植物", - "subtype": [], - "subterms": [], - "subterms_num": 0, - "alias": [ - "苹果树" - ], - "alias_ext": [], - "links": [ - { - "bdbkUrl": [ - "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/14822460" - ] - } - ] -} - -// 示例2:有subterms的term -{ - "id": "824716062a4d74efc0897d676700a24e", - "termid": "影视作品_eb_苹果", - "term": "苹果", - "src": "eb", - "termtype": "影视作品", - "subtype": [], - "subterms": [ - { - "id": "9bb5b38dc50233b1ccd28d1c33c37605", - "subtype": [ - "影视作品_cb_电影", - "影视动漫作品_cb_剧情片" - ], - "alias": [], - "alias_ext": [], - "links": [ - { - "bdbkUrl": [ - "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/6011191" - ] - } - ] - }, - { - "id": "688dc07cc98f02cbd4d21e2700290590", - "subtype": [ - "影视作品_cb_韩国电影" - ], - "alias": [], - "alias_ext": [], - "links": [ - { - "bdbkUrl": [ - "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/6011208" - ] - } - ] - }, - { - "id": "bbf4abe6ac412b181eac383333ca9fef", - "subtype": [ - "影视作品_cb_剧情电影" - ], - "alias": [], - "alias_ext": [], - "links": [ - { - "bdbkUrl": [ - "http://baike.baidu.com/item/%E8%8B%B9%E6%9E%9C/6011176" - ] - } - ] - } - ], - "subterms_num": 3, - "alias": [], - "alias_ext": [], - "links": [] -} -``` - -## TermTree特点 - - 1. 将所有中文词汇放在一个统一类别体系下表示,包括**概念、实体/专名、领域术语、语法词**。 -- 解决传统标注技术下(e.g., 词性标注、命名实体识别),概念、实体、词性特征难以统一计算的问题。 - - 2. 为中文精准解析挖掘服务的词汇类别体系,以全面覆盖**百科词条、搜索query、新闻资讯**中出现的中文词汇为目标,支持通用场景文本理解。 - - 应用可以通过指定词表的TermType,方便地整合到TermTree中,定制应用特化版。 - - 3. 尽可能收录常用概念词,并区分常用概念词(src=cb)和专名实体词(src=eb),以解决专名实体与概念在计算中容易混淆的问题。为此,特别补充收录了很多百科中缺少的概念词。 - - 例:“琴房(歌曲类实体)” VS. “琴房(区域场所类概念)” - - 例:“甩掉(歌曲类实体)” VS. “甩掉(场景事件类概念)” - - 4. 将同类同名实体拆分为term和subterm两层(参见数据示例),term作为给定termtype下所有同名实体的表示,subterm作为同类同名实体集中每一个具体实体的表示: - - 一方面解决文本中信息不足无法区分具体实体时的标注问题; - - 一方面减少同名词汇的消歧计算代价(只需要计算同类下的同名实体,有效解决概念词和实体词识别混淆的问题) - - 5. 为重要的概念/实体构建完整上位归类路径(**注:** TermTreeV1.0试用版暂不包括),用于细粒度特征计算和知识推断,参见以下示例 - - | term | 类别| src| 上位归类路径示例 | - |---|---|---|---| - |苹果 | 植物类|cb|苹果 → 苹果属 → 蔷薇科 → 蔷薇目 → 双子叶植物纲 → 被子植物门 → 种子植物 → 植物界 → 真核生物域 → 生物| - | 黄香蕉苹果| 饮食类|cb|黄香蕉苹果 →苹果 →水果 → 蔬果和菌藻类 →食材 →食物 →饮食| - |甲型流感 | 疾病类|cb|甲型流感 → 流行性感冒 → 感冒 → 呼吸道感染 → 呼吸系统疾病 → 疾病损伤 → 生物疾病| - |甲型流感病毒| 微生物类|cb|甲型流感病毒 → 流行性感冒病毒 → 正粘病毒科 → RNA病毒 → 生物病毒 → 病原微生物 → 微生物 → 生物| - |琴房| 区域场所类|cb|琴房 → 音乐室 → 活动室 →活动场所 →区域场所| - |琴房| 音乐类|eb|琴房 → 歌曲 →音乐作品 →艺术作品 →作品 → 作品与出版物| - |认同感 | 生活用语类|cb|认同感 →正面感受 → 感受 → 知觉感受 → 个体描述 → 生活用语| - | 认同感| 图书类|eb|认同感 →书籍 →图书 →书刊 →出版物 → 作品与出版物| - |佛罗伦萨足球俱乐部| 体育组织机构|eb|佛罗伦萨足球俱乐部 →意大利足球联赛球队→职业足球俱乐部→足球俱乐部 →足球队 →球队 →运动队 →体育组织机构 →组织机构| - |佛罗伦萨市 | 世界地区类|cb|佛罗伦萨市 →托斯卡纳大区 →意大利 →南欧 →欧洲 →地球区域 →世界地区| - |言情小说 | 小说类|cb|言情小说 →情感小说 →小说 →文学作品 →作品 →作品与出版物| - | 言情小说| 音乐类|eb|言情小说 → 歌曲 →音乐作品 →艺术作品 →作品 → 作品与出版物| -> **注:** TermType词类体系可视为所有上位归类路径的集合。 - -## TermTree应用方式 - -1. 直接作为词表使用,利用termtype和subtype筛选应用所需的词表(停用词表、黑白名单、概念扩展词表等)。 -2. 结合中文文本知识标注工具(WordTag等)使用,用于文本词类特征生成、挖掘/解析pattern生成、样本构建和优化等等,参见"[解语的应用场景](../)"。 -3. 整合应用知识图谱,为应用知识图谱提供通用词汇知识补充。 - -## TermTree后续规划 - -1. 数据覆盖扩展到全量百度百科词条,提升TermType归类准确率,便于应用方筛选构建应用适配的TermTree; -2. 建立知识共建社区,支持用户提交自己的term词表,生成定制版TermTree。 - - -## 在论文中引用TermTree -如果您的工作成果中使用了TermTree,请增加下述引用。我们非常乐于看到TermTree对您的工作带来帮助。 -``` -@article{zhao2020TermTree, - title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding}, - author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong}, - technical report={Baidu, Inc. TR:2020-KG-TermTree}, - year={2020} -} -``` - -## 问题与反馈 - -百科知识树在持续扩充优化中,如果您有任何建议或发现数据问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/termtree/termtree.py b/examples/text_to_knowledge/termtree/termtree.py deleted file mode 100644 index 7b09795ef113..000000000000 --- a/examples/text_to_knowledge/termtree/termtree.py +++ /dev/null @@ -1,416 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import csv -import json -import os -import warnings -from typing import Any, Dict, List, Optional, Tuple, Union - - -class TermTreeNode(object): - """Defination of term node. All members are protected, to keep rigorism of data struct. - - Args: - sid (str): term id of node. - term (str): term, common name of this term. - base (str): `cb` indicates concept base, `eb` indicates entity base. - term_type (Optional[str], optional): type of this term, constructs hirechical of `term` node. Defaults to None. - hyper (Optional[str], optional): parent type of a `type` node. Defaults to None. - node_type (str, optional): type statement of node, `type` or `term`. Defaults to "term". - alias (Optional[List[str]], optional): alias of this term. Defaults to None. - alias_ext (Optional[List[str]], optional): extended alias of this term, CANNOT be used in matching. - Defaults to None. - sub_type (Optional[List[str]], optional): grouped by some term. Defaults to None. - sub_term (Optional[List[str]], optional): some lower term. Defaults to None. - data (Optional[Dict[str, Any]], optional): to sore full imformation of a term. Defaults to None. - - """ - - def __init__( - self, - sid: str, - term: str, - base: str, - node_type: str = "term", - term_type: Optional[str] = None, - hyper: Optional[str] = None, - level: Optional[int] = None, - alias: Optional[List[str]] = None, - alias_ext: Optional[List[str]] = None, - sub_type: Optional[List[str]] = None, - sub_term: Optional[List[str]] = None, - data: Optional[Dict[str, Any]] = None, - ): - self._sid = sid - self._term = term - self._base = base - self._term_type = term_type - self._hyper = hyper - self._sub_term = sub_term if sub_term is not None else [] - self._sub_type = sub_type if sub_type is not None else [] - self._alias = alias if alias is not None else [] - self._alias_ext = alias_ext if alias_ext is not None else [] - self._data = data - self._level = level - self._node_type = node_type - self._sons = set() - - def __str__(self): - if self._data is not None: - return json.dumps(self._data, ensure_ascii=False) - else: - res = { - "termid": self._sid, - "term": self._term, - "src": self._base, - "alias": self._alias, - "alias_ext": self._alias_ext, - "termtype": self._term_type, - "subterms": self._sub_term, - "subtype": self._sub_type, - "links": [], - } - return json.dumps(res, ensure_ascii=False) - - @property - def sid(self): - return self._sid - - @property - def term(self): - return self._term - - @property - def base(self): - return self._base - - @property - def alias(self): - return self._alias - - @property - def alias_ext(self): - return self._alias_ext - - @property - def termtype(self): - return self._term_type - - @property - def subtype(self): - return self._sub_type - - @property - def subterm(self): - return self._sub_term - - @property - def hyper(self): - return self._hyper - - @property - def level(self): - return self._level - - @property - def sons(self): - return self._sons - - @property - def node_type(self): - return self._node_type - - def add_son(self, son_name): - self._sons.add(son_name) - - @classmethod - def from_dict(cls, data: Dict[str, Any]): - """Build a node from dictionary data. - - Args: - data (Dict[str, Any]): Dictionary data contain all k-v data. - - Returns: - [type]: TermTree node object. - """ - return cls( - sid=data["termid"], - term=data["term"], - base=data["src"], - term_type=data["termtype"], - sub_type=data["subtype"], - sub_term=data["subterms"], - alias=data["alias"], - alias_ext=data["alias_ext"], - data=data, - ) - - @classmethod - def from_json(cls, json_str: str): - """Build a node from JSON string. - - Args: - json_str (str): JSON string formatted by TermTree data. - - Returns: - [type]: TermTree node object. - """ - dict_data = json.loads(json_str) - return cls.from_dict(dict_data) - - -class TermTree(object): - """TermTree class.""" - - def __init__(self): - self._nodes: Dict[str, TermTreeNode] = {} - self._root = TermTreeNode(sid="root", term="root", base="cb", node_type="root", level=0) - self._nodes["root"] = self.root - self._index = {} - - def __build_sons(self): - for node in self._nodes: - self.__build_son(self._nodes[node]) - - def __getitem__(self, item): - return self._nodes[item] - - def __contains__(self, item): - return item in self._nodes - - def __iter__(self): - return self._nodes.__iter__() - - @property - def root(self): - return self._root - - def __load_type(self, file_path: str): - with open(file_path, "rt", newline="", encoding="utf8") as csvfile: - file_handler = csv.DictReader(csvfile, delimiter="\t") - for row in file_handler: - if row["type-1"] not in self: - self.add_type(type_name=row["type-1"], hyper_type="root") - if row["type-2"] != "" and row["type-2"] not in self: - self.add_type(type_name=row["type-2"], hyper_type=row["type-1"]) - if row["type-3"] != "" and row["type-3"] not in self: - self.add_type(type_name=row["type-3"], hyper_type=row["type-2"]) - - def __judge_term_node(self, node: TermTreeNode) -> bool: - if node.termtype not in self: - raise ValueError(f"Term type of new node {node.termtype} does not exists.") - if node.sid in self: - warnings.warn(f"{node.sid} exists, will be replaced by new node.") - - def add_term( - self, - term: Optional[str] = None, - base: Optional[str] = None, - term_type: Optional[str] = None, - sub_type: Optional[List[str]] = None, - sub_term: Optional[List[str]] = None, - alias: Optional[List[str]] = None, - alias_ext: Optional[List[str]] = None, - data: Optional[Dict[str, Any]] = None, - ): - """Add a term into TermTree. - - Args: - term (str): common name of name. - base (str): term is concept or entity. - term_type (str): term type of this term - sub_type (Optional[List[str]], optional): sub type of this term, must exists in TermTree. Defaults to None. - sub_terms (Optional[List[str]], optional): sub terms of this term. Defaults to None. - alias (Optional[List[str]], optional): alias of this term. Defaults to None. - alias_ext (Optional[List[str]], optional): . Defaults to None. - data (Optional[Dict[str, Any]], optional): [description]. Defaults to None. - """ - if data is not None: - new_node = TermTreeNode.from_dict(data) - else: - new_node = TermTreeNode( - sid=f"{term_type}_{base}_{term}", - term=term, - base=base, - term_type=term_type, - sub_term=sub_term, - sub_type=sub_type, - alias=alias, - alias_ext=alias_ext, - node_type="term", - ) - self.__judge_term_node(new_node) - self._nodes[new_node.sid] = new_node - self.__build_index(new_node) - - def add_type(self, type_name, hyper_type): - if type_name in self._nodes: - raise ValueError(f"Term Type {type_name} exists.") - if hyper_type not in self._nodes: - raise ValueError(f"Hyper type {hyper_type} does not exist, please add it first.") - if self._nodes[hyper_type].level == 3: - raise ValueError( - "Term type schema must be 3-LEVEL, 3rd level type node should not be a parent of type node." - ) - self._nodes[type_name] = TermTreeNode( - sid=type_name, - term=type_name, - base=None, - hyper=hyper_type, - node_type="type", - level=self._nodes[hyper_type].level + 1, - ) - self.__build_index(self._nodes[type_name]) - - def __load_file(self, file_path: str): - with open(file_path, encoding="utf-8") as fp: - for line in fp: - data = json.loads(line) - self.add_term(data=data) - - def __build_son(self, node: TermTreeNode): - """Build sons of a node - - Args: - node (TermTreeNode): son node. - """ - type_node = None - if node.termtype is not None: - type_node = self._nodes[node.termtype] - elif node.hyper is not None: - type_node = self._nodes[node.hyper] - if type_node is not None: - type_node.add_son(node.sid) - for sub_type in node.subtype: - sub_type_node = self._nodes[sub_type] - sub_type_node.add_son(node.sid) - - def build_son(self, node: str): - self.__build_son(self[node]) - - def __build_index(self, node: TermTreeNode): - if node.term not in self._index: - self._index[node.term] = [] - self._index[node.term].append(node.sid) - for alia in node.alias: - if alia not in self._index: - self._index[alia] = [] - self._index[alia].append(node.sid) - - def __judge_hyper(self, source_id, target_id) -> bool: - queue = [source_id] - visited_node = {source_id} - while len(queue) > 0: - cur_id = queue.pop(0) - if cur_id == target_id: - return True - cur_node = self._nodes[cur_id] - edge = [] - if cur_node.hyper is not None: - edge.append(cur_node.hyper) - if cur_node.termtype is not None: - edge.append(cur_node.termtype) - edge.extend(cur_node.subtype) - for next_id in edge: - if next_id not in visited_node: - queue.append(next_id) - visited_node.add(next_id) - return False - - def find_term(self, term: str, term_type: Optional[str] = None) -> Tuple[bool, Union[List[str], None]]: - """Find a term in Term Tree. If term not exists, return None. - If `term_type` is not None, will find term with this type. - - Args: - term (str): term to look up. - term_type (Optional[str], optional): find term in this term_type. Defaults to None. - - Returns: - Union[None, List[str]]: [description] - """ - if term not in self._index: - return False, None - else: - if term_type is None: - return True, self._index[term] - else: - out = [] - for term_id in self._index[term]: - if self.__judge_hyper(term_id, term_type) is True: - out.append(term_id) - if len(out) > 0: - return True, out - else: - return False, None - - def build_from_dir(self, term_schema_path, term_data_path, linking=True): - """Build TermTree from a directory which should contain type schema and term data. - - Args: - dir ([type]): [description] - """ - self.__load_type(term_schema_path) - if linking: - self.__load_file(term_data_path) - self.__build_sons() - - @classmethod - def from_dir(cls, term_schema_path, term_data_path, linking=True) -> "TermTree": - """Build TermTree from a directory which should contain type schema and term data. - - Args: - source_dir ([type]): [description] - - Returns: - TermTree: [description] - """ - term_tree = cls() - term_tree.build_from_dir(term_schema_path, term_data_path, linking) - return term_tree - - def __dfs(self, cur_id: str, depth: int, path: Dict[str, str], writer: csv.DictWriter): - cur_node = self._nodes[cur_id] - if cur_node.node_type == "term": - return - if depth > 0: - path[f"type-{depth}"] = cur_id - if path["type-1"] != "": - writer.writerow(path) - for son in cur_node.sons: - self.__dfs(son, depth + 1, path, writer) - if depth > 0: - path[f"type-{depth}"] = "" - - def save(self, save_dir): - """Save term tree to directory `save_dir` - - Args: - save_dir ([type]): Directory. - """ - if os.path.exists(save_dir) is False: - os.makedirs(save_dir, exist_ok=True) - out_path = {} - for i in range(1, 3): - out_path[f"type-{i}"] = "" - with open(f"{save_dir}/termtree_type.csv", "wt", encoding="utf-8", newline="") as fp: - fieldnames = ["type-1", "type-2", "type-3"] - csv_writer = csv.DictWriter(fp, delimiter="\t", fieldnames=fieldnames) - csv_writer.writeheader() - self.__dfs("root", 0, out_path, csv_writer) - with open(f"{save_dir}/termtree_data", "w", encoding="utf-8", newline="") as fp: - for nid in self: - node = self[nid] - if node.node_type == "term": - print(node, file=fp) diff --git a/examples/text_to_knowledge/termtree/termtree_type.csv b/examples/text_to_knowledge/termtree/termtree_type.csv deleted file mode 100644 index 0ab88eefe103..000000000000 --- a/examples/text_to_knowledge/termtree/termtree_type.csv +++ /dev/null @@ -1,164 +0,0 @@ -type-1 type-2 type-3 说明 主要对应词性 subtype示例(开放集合) src(C表示cb、E表示eb) -角色 n/普通名词;nr/人名 群体 C/E -角色 人物 人物类概念、人物类实体 nr/人名 职业角色、历史人物、行业人物 C/E -角色 民族族群 民族和族群 五十六个民族 C -角色 虚拟角色 非现实的角色 虚拟人物、虚拟生物 C/E -作品与出版物 作品类概念、作品类实体 nw/作品名 拓片 C/E -作品与出版物 游戏 电子游戏、视频小游戏、网页游戏 C/E -作品与出版物 影视动漫作品 视频作品 C/E -作品与出版物 影视动漫作品 动漫作品 漫画、动画 C/E -作品与出版物 影视动漫作品 影视作品 电影、电视剧 C/E -作品与出版物 影视动漫作品 视频节目 脱口秀节目、新闻类节目、访谈节目 C/E -作品与出版物 音乐 歌曲、音乐专辑 C/E -作品与出版物 小说 网络小说、言情小说 C/E -作品与出版物 诗歌 诗词 C/E -作品与出版物 计算机软件 工具软件、办公软件 C/E -作品与出版物 舞蹈 C/E -作品与出版物 美术 雕塑作品、油画作品、工艺美术作品 C/E -作品与出版物 图书 书籍、词典、教材 C/E -作品与出版物 刊物 报纸、期刊 C/E -作品与出版物 文件 文书 C/E -作品与出版物 作品IP C/E -区域场所 ns/地名 宗教场所、建筑物 C/E -区域场所 景点 公园、植物园、动物园、博物馆 C/E -区域场所 楼盘住宅 商业楼盘、住宅楼盘、住宅小区 C/E -区域场所 交通场所 机场、车站、港口、交通道路、交通线路 C/E -区域场所 住宿场所 酒店、旅馆 C/E -区域场所 餐饮场所 咖啡馆、餐馆 C/E -区域场所 网上组织机构场所 网站、虚拟社区 C/E -位置方位 位置方位词 f/方位名词;s/处所名词 C -组织机构 组织机构类 nt/机构团体名 委员会、论坛 C/E -组织机构 演艺团体 乐队、艺术团、偶像组合 C/E -组织机构 国家机关 政府部门、党政机关 C/E -组织机构 企事业单位 公司、厂商、企业 C/E -组织机构 教育组织机构 学校、大学、幼儿园、培训机构 C/E -组织机构 居民服务机构 母婴护理机构、婚介机构、美容护理机构、家政服务机构 C/E -组织机构 医疗卫生机构 医院、药店、诊所、科室 C/E -组织机构 体育组织机构 运动队、体育俱乐部 C/E -组织机构 金融组织机构 银行、交易所、投资机构 C/E -组织机构 军事组织机构 部队、军区 C/E -品牌 品牌名 n/普通名词;nz/其他专名 C/E -品牌 汽车品牌 C/E -品牌 手机品牌 C/E -品牌 个护用品品牌 护肤品牌、彩妆品牌 C/E -物体与物品 包括物品和物质 n/普通名词;nz/其他专名 物体构造、化学物质 C/E -物体与物品 物品 飞机、船舶、轴承、摄影器材 C/E -物体与物品 物品 汽车 C/E -物体与物品 物品 手机 C/E -物体与物品 物品 美容美发用品 化妆品、美发用品 C/E -物体与物品 物品 电子电器产品 计算机、家用电器 C/E -物体与物品 物品 衣物饰品 服装、箱包、鞋靴、饰品配件 C/E -物体与物品 物品 兵器 武器、导弹、冷兵器.. C/E -物体与物品 设施 C/E -饮食 饮食类 n/普通名词;nz/其他专名 食材 C/E -饮食 菜品 菜品类 汤品、面食 C/E -饮食 饮品 饮品类 茶叶、酒、饮料和冷饮类 C/E -生物 生物类 n/普通名词;nz/其他专名 C -生物 动物 动物类 猫、狗、鸟纲、昆虫纲、鱼纲 C -生物 植物 植物类 C -生物 微生物 微生物类 真菌、细菌、生物病毒 C -世界地区 世界地区,包括地球外区域 ns/地名 首都、地球山脉、地球河流、地球岛屿、历史地区 C -世界地区 中国地区 中国地区 中国省级行政区、中国省会 C -世界地区 国家 现代意义的国家 现存国家、历史国家 C -虚拟事物 非现实事物 n/普通名词;nz/其他专名 虚拟场所、虚拟场景 C/E -虚拟事物 虚拟物品 虚拟宝物、游戏卡牌、游戏道具、游戏装备 C/E -文化 文化相关的特定类目 n/普通名词;nz/其他专名 C -文化 姓氏与人名 中文姓氏、英文姓氏 C -文化 语言文字 汉语、方言 C -文化 民俗 方术、数术、十二生肖、占星学星座、周易六十四卦 C -文化 政权朝代 历史政权、中国朝代 C -文化 制度政策协议 C/E -文化 奖项赛事活动 奖项、活动 C/E -事件 事件类 n/普通名词;nz/其他专名 展览、会议、案件、事故、战争 C/E -术语 领域术语、专名 nz/其他专名 C -术语 编码符号指标 价格、符号、信号、度量单位、邮政编码..... C -术语 教育用语 C -术语 教育用语 学科 C -术语 教育用语 学历学位 学历、学位 C -术语 教育用语 专业 C -术语 游戏用语 C -术语 游戏用语 麻将术语 C -术语 医药学术语 中医术语、西医术语、医学指标、诊断治疗方法 C -术语 医药学术语 医疗美容项目 C -术语 医药学术语 药物 中药、西药 C -术语 医药学术语 疾病损伤 疾病、疾病症状 C -术语 医药学术语 动物疾病 C -术语 金融术语 股票术语、证券术语、保险术语、银行术语 C -术语 金融术语 股票 C/E -术语 金融术语 保险 C/E -术语 金融术语 基金 C/E -术语 金融术语 银行卡 借记卡、信用卡 C/E -术语 经济术语 会计术语 C -术语 法律术语 C -术语 法律术语 法律法规 法律体系、法律、法规 C/E -术语 法律术语 罪名 C -术语 体育术语 围棋术语、象棋术语、篮球术语 C -术语 体育术语 体育运动项目 球类运动、武术功夫 C -术语 赌博博彩用语 赌博用语 C -术语 赌博博彩用语 彩票 C -术语 天文学术语 星系、恒星 C -术语 天文学术语 星座 八十八星座 C -术语 天文学术语 星体 小行星 C -术语 生物学术语 C -术语 生物学术语 动物体构造 动物器官系统、骨 C -术语 生物学术语 植物病虫害 植物病害、植物虫害 C -术语 机械工程术语 机械制造术语 C -术语 机械工程术语 汽车术语 C -术语 大气科学术语 气象学术语、气候学术语 C -术语 大气科学术语 台风 C/E -术语 计算机术语 病毒程序、计算机网络术语、编程技术术语 C -术语 文化术语 摄影术语、音乐术语、文学术语 C -术语 数学术语 数学概念、数学公式、几何学术语 C -术语 物理术语 电学术语、力学术语 C -术语 化学术语 化学结构 C -术语 统计术语 数理统计术语 C -术语 地学术语 地理学术语、地质学术语 C -术语 农业学术语 土壤学术语 C -术语 心理学术语 心理现象 C -术语 语言学术语 语法、词法、音韵学术语 C -术语 建筑术语 土木工程术语、装修术语 C -术语 军事术语 C -术语 政治术语 C -术语 哲学术语 哲学理论、伦理学术语、逻辑学术语 C -术语 宗教术语 道教术语、佛教术语 C -术语 通信术语 C -术语 材料科学术语 C -术语 航空科技术语 C -术语 水利科技术语 水利工程 C -术语 测绘学术语 测量术语 C -术语 电力术语 C -术语 社会学术语 C -术语 交通术语 船舶工程术语 C -术语 钓鱼术语 C -术语 ACGN术语 C -生活用语 日常生活中常用词 n/普通名词;nz/其他专名 信息知识资料、标识物、行业、服务 C -生活用语 情绪 C -生活用语 态度 C -生活用语 表情 笑、哭、眼神 C -生活用语 人物造型 妆容、发型 C -生活用语 个性特点 C -生活用语 颜色 C -生活用语 场景事件 包括常见动词 v/普通动词;vn/名动词;vd/动副词 考试 C/E -时间阶段 时间相关词 t/时间名词 时间、年代、世纪... C -时间阶段 地质年代 C -时间阶段 特殊日 农历二十四节气、假日、节日、纪念日 C -词汇用语 语法词类、汉字、成语等,用于兜底 n/普通名词 C -词汇用语 汉字 汉字字表 C/E -词汇用语 成语 成语词表 C/E -词汇用语 俗语 非成语的俗语 歇后语、顺口溜、谚语 C/E -词汇用语 诗句 诗句 C/E -词汇用语 介词 介词 p/介词 C -词汇用语 助词 助词 u/助词 C -词汇用语 代词 代词 r/代词 C -词汇用语 连词 连词 c/连词 C -词汇用语 副词 副词 d/副词 C -词汇用语 疑问词 疑问词 C -词汇用语 肯定否定词 常用肯定词和否定词 C -词汇用语 量词 量词 q/量词 C -词汇用语 数量词 数量词 m/数量词 C -词汇用语 叹词 叹词 C -词汇用语 拟声词 拟声词 C -词汇用语 修饰词 修饰词,包括常见形容词 n/普通名词;a/形容词;ad/副形词;an/名形词 C -词汇用语 汉字偏旁部首 汉字偏旁部首 C -词汇用语 日文假名 日文假名 平假名、片假名 C -词汇用语 汉语拼音 汉语拼音字母 C diff --git a/examples/text_to_knowledge/wordtag-ie/README.md b/examples/text_to_knowledge/wordtag-ie/README.md deleted file mode 100644 index fd75d6670da3..000000000000 --- a/examples/text_to_knowledge/wordtag-ie/README.md +++ /dev/null @@ -1,135 +0,0 @@ -# 解语:WordTag-IE(基于中文词类知识的信息抽取工具) - -WordTag-IE(基于中文词类知识的信息抽取工具)是在WordTag标注结果之上实现的信息抽取工具,旨在提供一个灵活、可配置,能够精准、全面覆盖简单句式的**规则信息抽取工具**。我们已提供了通用配置,可覆盖一些常见的抽取句式。用户也可以根据我们提供的配置方法,完成自己的配置,应用于自己的领域、专业文本。其产出数据,可作为模型的训练样本,也可以直接当作挖掘结果使用。 - -![](https://user-images.githubusercontent.com/1371212/172542329-754cb4f9-3526-400b-be6e-d60e078af872.png) - -## WordTag-IE特点 - -- **灵活、方便的配置,即时生效** - - WordTag-IE是在WordTag标注结果的基础上,完全使用规则实现的关系抽取工具。其配置完全基于WordTag的词类知识以及TermTree中的词表实现,实现了灵活、简单配置,且保证了产出数据的一致性 - -## 使用示例 - -在WordTag的任务中基础上可以打开`with_ie` 开关即可输出信息抽取的结果, 下面是使用PaddleNLP Taskflow使用WordTag-IE的使用示例。 -```python ->>> from paddlenlp import Taskflow ->>> wordtag_ie = Taskflow("knowledge_mining", with_ie=True) ->>> wordtag_ie('《忘了所有》是一首由王杰作词、作曲并演唱的歌曲,收录在专辑同名《忘了所有》中,由波丽佳音唱片于1996年08月31日发行。') -[[{'text': '《忘了所有》是一首由王杰作词、作曲并演唱的歌曲,收录在专辑同名《忘了所有》中,由波丽佳音唱片于1996年08月31日发行。', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 5, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 6, 'wordtag_label': '肯定词', 'length': 1}, {'item': '一首', 'offset': 7, 'wordtag_label': '数量词_单位数量词', 'length': 2}, {'item': '由', 'offset': 9, 'wordtag_label': '介词', 'length': 1}, {'item': '王杰', 'offset': 10, 'wordtag_label': '人物类_实体', 'length': 2}, {'item': '作词', 'offset': 12, 'wordtag_label': '场景事件', 'length': 2}, {'item': '、', 'offset': 14, 'wordtag_label': 'w', 'length': 1}, {'item': '作曲', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2}, {'item': '并', 'offset': 17, 'wordtag_label': '连词', 'length': 1}, {'item': '演唱', 'offset': 18, 'wordtag_label': '场景事件', 'length': 2}, {'item': '的', 'offset': 20, 'wordtag_label': '助词', 'length': 1}, {'item': '歌曲', 'offset': 21, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': ',', 'offset': 23, 'wordtag_label': 'w', 'length': 1}, {'item': '收录', 'offset': 24, 'wordtag_label': '场景事件', 'length': 2}, {'item': '在', 'offset': 26, 'wordtag_label': '介词', 'length': 1}, {'item': '专辑', 'offset': 27, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '同名', 'offset': 29, 'wordtag_label': '场景事件', 'length': 2}, {'item': '《', 'offset': 31, 'wordtag_label': 'w', 'length': 1}, {'item': '忘了所有', 'offset': 32, 'wordtag_label': '作品类_实体', 'length': 4}, {'item': '》', 'offset': 36, 'wordtag_label': 'w', 'length': 1}, {'item': '中', 'offset': 37, 'wordtag_label': '词汇用语', 'length': 1}, {'item': ',', 'offset': 38, 'wordtag_label': 'w', 'length': 1}, {'item': '由', 'offset': 39, 'wordtag_label': '介词', 'length': 1}, {'item': '波丽佳音', 'offset': 40, 'wordtag_label': '人物类_实体', 'length': 4}, {'item': '唱片', 'offset': 44, 'wordtag_label': '作品类_概念', 'length': 2}, {'item': '于', 'offset': 46, 'wordtag_label': '介词', 'length': 1}, {'item': '1996年08月31日', 'offset': 47, 'wordtag_label': '时间类_具体时间', 'length': 11}, {'item': '发行', 'offset': 58, 'wordtag_label': '场景事件', 'length': 2}, {'item': '。', 'offset': 60, 'wordtag_label': 'w', 'length': 1}]}], [[{'HEAD_ROLE': {'item': '王杰', 'offset': 10, 'type': '人物类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '创作', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '王杰', 'offset': 10, 'type': '人物类_实体'}], 'GROUP': '创作者', 'SRC': 'HTG', 'TRIG': [{'item': '作词', 'offset': 12}, {'item': '作曲', 'offset': 15}, {'item': '演唱', 'offset': 18}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '歌曲', 'offset': 21, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}], 'GROUP': '收录', 'TRIG': [{'item': '收录', 'offset': 24}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 1}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '收录于', 'SRC': 'HGT', 'TRIG': [{'item': '收录', 'offset': 24}]}, {'HEAD_ROLE': {'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}, 'TAIL_ROLE': [{'item': '王杰', 'type': '人物类_实体', 'offset': 10}], 'GROUP': '创作者', 'TRIG': [{'item': '专辑', 'offset': 27}], 'SRC': 'REVERSE'}, {'HEAD_ROLE': {'item': '王杰', 'type': '人物类_实体', 'offset': 10}, 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 32, 'type': '作品类_实体'}], 'GROUP': '创作', 'SRC': 'HGT', 'TRIG': [{'item': '专辑', 'offset': 27}]}, {'HEAD_ROLE': {'item': '忘了所有', 'type': '作品类_实体', 'offset': 32}, 'TAIL_ROLE': [{'item': '唱片', 'offset': 44, 'type': '作品类_概念'}], 'GROUP': '类型', 'SRC': 'TAIL'}]]] -``` -同时可以通过 `schema` 来配置相关关系类型, 抽取自定义的关系组合 - -``` python ->>> from pprint import pprint ->>> schema = [ - { - "head_role": "作品类_实体", #头实体词类 - "group": "创作者", #关系名 - "tail_role": [ - { - "main": [ - "人物类_实体" #尾实体词类 - ], - "support": [] #相关词类,可作为该关系的补充,不可作为尾实体独立存在 - } - ], - "trig_word": [ - "作词", #触发词,对于没有触发词,而是由头尾实体直接触发的关系,可为null - ], - "trig_type": "trigger", #trigger表明由触发词触发,tail表明为尾实体触发 - "reverse": False, #是否为反向配置,即尾实体实际是头,头实体实际是尾 - "trig_direction": "B", #触发P的方向,表示在自然表达中,尾实体在触发词的哪一边,L为左,R为右,B为双向都有可能,默认为B - "rel_group": "创作" #对应的反关系,即头尾实体对调后,对应的关系,用于逻辑推断 - }] ->>> wordtag_ie.set_schema(schema) ->>> pprint(wordtag_ie('《忘了所有》是一首由王杰作词、作曲并演唱的歌曲,收录在专辑同名《忘了所有》中,由波丽佳音唱片于1996年08月31日发行。')[1]) -[[{'GROUP': '创作', - 'HEAD_ROLE': {'item': '王杰', 'offset': 10, 'type': '人物类_实体'}, - 'SRC': 'REVERSE', - 'TAIL_ROLE': [{'item': '忘了所有', 'offset': 1, 'type': '作品类_实体'}], - 'TRIG': [{'item': '作词', 'offset': 12}]}, - {'GROUP': '创作者', - 'HEAD_ROLE': {'item': '忘了所有', 'offset': 1, 'type': '作品类_实体'}, - 'SRC': 'HTG', - 'TAIL_ROLE': [{'item': '王杰', 'offset': 10, 'type': '人物类_实体'}], - 'TRIG': [{'item': '作词', 'offset': 12}]}]] -``` - -## 配置示例 - -我们提供了配置示例文件[demo_config.json](./demo_config.json),用户可以直接基于这个文件实现自己想要的配置。 - -我们以“出版方”这个关系为例: - -```json -{ - "head_role": "作品类_实体", //头实体词类 - "group": "出版方", //关系名 - "tail_role": [ - { - "main": [ - "组织机构类" - ], //尾实体词类 - "support": [ - "时间类_具体时间" - ] //相关词类,可作为该关系的补充,不可作为尾实体独立存在 - } - ], //尾实体配置 - "trig_word": [ - "出版" - ], //触发词,对于没有触发词,而是由头尾实体直接触发的关系,可为null - "trig_direction": "L", //触发P的方向,表示在自然表达中,尾实体在触发词的哪一边,L为左,R为右,B为双向都有可能,默认为B - "trig_type": "trigger", //trigger表明由触发词触发,tail表明为尾实体触发 - "reverse": false, //是否为反向配置,即尾实体实际是头,头实体实际是尾 - "rel_group": "出版" //对应的反关系,即头尾实体对调后,对应的关系,用于逻辑推断 -} -``` - -### 配置原则 - -1. 文本中的头实体(head_role)一定在尾实体(tail_role)的前面(即左边),可以通过配置反向标记(reverse)和反向关系名(rel_group)生成反关系 -2. 两种触发模式:触发词触发(trig_type为trigger)和尾实体触发(trig_type为tail),两者的触发方向(trig_direction)配置不同 - - - 触发词的触发方向约束的是文本中尾实体在触发词的左边还是右边,默认是双向触发(B),可以配置向左触发(L)或向右(R)触发,以提升挖掘精度 - - - 尾实体触发不用配置方向,因为必然在头实体之后 -## 实现方法 - -使用WordTag的标注结果,相当于已实现将**无限的词收拢到了有限的词类体系中**,而待抽取的关系,则变成了仅发生在词类与词类之间,便可以枚举出来。例如,`人物类_实体`与`作品类_实体`之间的关系可以是“创作”,而“创作”的触发词(如作词、作曲、演唱、执导、出演等)或触发pattern,则可以通过知识库枚举得到,如此,则实现了灵活配置。 - -那么,接下来一个问题则是,我们如何从现在的序列解析结果中,得到关系三元组数据呢? - -要解决这个问题,我们依旧要从中文语言学的成果中寻找答案:中文更偏孤立语,注重**意合**,依靠词序和词之间的意义联系成句,词性、句法特征弱。也就是说,我们在解析的时候,可以尝试摒弃所谓句法特征,只是从次序上下手。于是,我们发现,只需要覆盖好 SPO 的几种常用表达顺序,单向搜索,即可覆盖大部分简单句。 - -例如,对于`<张艺谋,创作,十面埋伏>`这一 SPO 三元组,常用表达顺序有如下几种: - -- S-P-O:张艺谋执导了《十面埋伏》。 -- S-O-P:张艺谋是《十面埋伏》的导演。 -- O-S-P:《十面埋伏》是张艺谋执导的电影。 -- O-P-S:《十面埋伏》的导演是张艺谋。 - -然而,这种模式仍然过于复杂,如遇到多组 SPO 关系并存的文本,如果要完全照顾到这四种表达顺序,则很容易发生混乱,难以得到严格对应的三元组。所以,我们设计了**互反关系**的概念,即头实体和尾实体对调后,对应的反向关系。例如三元组`<张艺谋,创作,十面埋伏>`,则存在一个反向三元组`<十面埋伏,创作者,三元组>`。那么,当我们找到一个头实体之后,只需要考虑它之后的部分(即 `S-P-O` 和 `S-O-P` 两种表达顺序)就行了。 - -另外,我们认为,规范表达中,关系触发和尾实体一定实在同一个短语中出现,所以,触发关系之后,寻找尾实体的过程中,我们仅搜索与触发在同一个短语中的实体及相关元素。 - -## 后续计划 - -- 实现基于语义结构的抽取,覆盖复杂句式 - -## 在论文中引用WordTag-IE - -如果您的工作成果中使用了WordTag-IE,请增加下述引用。我们非常乐于看到WordTag-IE对您的工作带来帮助。 - -``` -@article{qin2022WordTag-IE, - title={WordTag-IE: a Rule-based Tool for Chinese Information Extraction}, - author={Qin, Huapeng and Zhao, Min and Tang, Wei}, - technical report={Baidu, Inc. TR:2022-KG-WordTag-IE}, - year={2022} -} -``` - -## 问题与反馈 - -WordTag-IE在持续优化中,如果您有任何建议或问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/wordtag-ie/demo_config.json b/examples/text_to_knowledge/wordtag-ie/demo_config.json deleted file mode 100644 index e8d4f9377cdf..000000000000 --- a/examples/text_to_knowledge/wordtag-ie/demo_config.json +++ /dev/null @@ -1,955 +0,0 @@ -[ - { - "head_role": "人物类_实体", - "group": "名字", - "tail_role": [ - { - "main": [ - "文化类_姓氏与人名", - "其他角色类", - "人物类_实体" - ], - "support": [] - } - ], - "trig_word": [ - "原名" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "人物类_实体", - "group": "性别", - "tail_role": [ - { - "main": [ - "信息资料" - ], - "support": [] - } - ], - "trig_word": [ - "男", - "女" - ], - "trig_type": "role", - "reverse": false, - "trig_direction": null - }, - { - "head_role": "人物类_实体", - "group": "出生于", - "tail_role": [ - { - "main": [ - "时间类_具体时间" - ], - "support": [ - "世界地区类" - ] - }, - { - "main": [ - "世界地区类" - ], - "support": [ - "时间类_具体时间" - ] - } - ], - "trig_word": [ - "出生", - "出生于", - "生于" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B" - }, - { - "head_role": "时间类_具体时间", - "group": "出生于", - "tail_role": [ - { - "main": [ - "人物类_实体" - ], - "support": [ - "世界地区类" - ] - } - ], - "trig_word": [ - "出生", - "出生于", - "生于" - ], - "trig_type": "trigger", - "reverse": true, - "trig_direction": "B" - }, - { - "head_role": "人物类_实体", - "group": "参加工作时间", - "tail_role": [ - { - "main": [ - "时间类_具体时间" - ], - "support": [] - } - ], - "trig_word": [ - "参加工作" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "L" - }, - { - "head_role": "人物类_实体", - "group": "入党时间", - "tail_role": [ - { - "main": [ - "时间类_具体时间" - ], - "support": [] - } - ], - "trig_word": [ - "入党" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "L" - }, - { - "head_role": "人物类_实体", - "group": "加入组织", - "tail_role": [ - { - "main": [ - "组织机构类", - "组织机构类_概念" - ], - "support": [ - "时间类_具体时间" - ] - } - ], - "trig_word": [ - "加入", - "参加" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "时间类_具体时间", - "group": "加入组织", - "tail_role": [ - { - "main": [ - "人物类_实体" - ], - "support": [ - "组织机构类", - "组织机构类_概念" - ] - } - ], - "trig_word": [ - "加入", - "参加" - ], - "trig_type": "trigger", - "reverse": true, - "trig_direction": "R" - }, - { - "head_role": "人物类_实体", - "group": "享年", - "tail_role": [ - { - "main": [ - "数量词" - ], - "support": [] - } - ], - "trig_word": [ - "年仅" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "人物类_实体", - "group": "创作", - "tail_role": [ - { - "main": [ - "作品类_实体" - ], - "support": [] - } - ], - "trig_word": [ - "创作", - "监制", - "监制人", - "作词", - "作词人", - "作曲", - "作曲人", - "编曲", - "演唱", - "演唱者", - "制作人", - "制作", - "制片人", - "制片", - "主持人", - "主持", - "导演", - "执导", - "编剧", - "作者", - "所著", - "主编", - "撰写", - "编著", - "编撰" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B", - "rel_group": "创作者" - }, - { - "head_role": "人物类_实体", - "group": "出演", - "tail_role": [ - { - "main": [ - "作品类_实体" - ], - "support": [] - } - ], - "trig_word": [ - "配音" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B", - "rel_group": "演员" - }, - { - "head_role": "人物类_实体", - "group": "饰演", - "tail_role": [ - { - "main": [ - "其他角色类", - "人物类_实体" - ], - "support": [ - "作品类_实体" - ] - } - ], - "trig_word": [ - "扮演", - "饰演", - "饰" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B" - }, - { - "head_role": "人物类_实体", - "group": "代言", - "tail_role": [ - { - "main": [ - "品牌名" - ], - "support": [] - } - ], - "trig_word": [ - "代言" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B", - "rel_group": "代言人" - }, - { - "head_role": "人物类_实体", - "group": "创建", - "tail_role": [ - { - "main": [ - "组织机构类" - ], - "support": [] - } - ], - "trig_word": [ - "创办", - "创建" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R", - "rel_group": "创建人" - }, - { - "head_role": "人物类_实体", - "group": "获奖", - "tail_role": [ - { - "main": [ - "文化类_奖项赛事活动" - ], - "support": [ - "作品类_实体", - "数量词_序数词" - ] - } - ], - "trig_word": [ - "获", - "获得", - "荣获", - "获颁" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "作品类_实体", - "group": "类型", - "tail_role": [ - { - "main": [ - "作品类_概念" - ], - "support": [] - } - ], - "trig_word": [ - "是" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "作品类_实体", - "group": "出品方", - "tail_role": [ - { - "main": [ - "组织机构类" - ], - "support": [ - "时间类_具体时间" - ] - } - ], - "trig_word": [ - "出品" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "L" - }, - { - "head_role": "作品类_实体", - "group": "出版方", - "tail_role": [ - { - "main": [ - "组织机构类" - ], - "support": [ - "时间类_具体时间" - ] - } - ], - "trig_word": [ - "出版" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "L" - }, - { - "head_role": "作品类_实体", - "group": "发表于", - "tail_role": [ - { - "main": [ - "场所类_网上场所" - ], - "support": [] - } - ], - "trig_word": [ - "发表", - "连载", - "发表于", - "连载于" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B" - }, - { - "head_role": "作品类_实体", - "group": "创作者", - "tail_role": [ - { - "main": [ - "人物类_实体" - ], - "support": [] - } - ], - "trig_word": [ - "创作", - "监制", - "监制人", - "作词", - "作词人", - "作曲", - "作曲人", - "编曲", - "演唱", - "演唱者", - "制作人", - "制作", - "制片人", - "制片", - "主持人", - "主持", - "导演", - "执导", - "编剧", - "作者", - "所著", - "主编", - "撰写", - "编著", - "编撰" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B", - "rel_group": "创作" - }, - { - "head_role": "作品类_实体", - "group": "演员", - "tail_role": [ - { - "main": [ - "人物类_实体" - ], - "support": [] - } - ], - "trig_word": [ - "配音" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B", - "rel_group": "出演" - }, - { - "head_role": "作品类_实体", - "group": "收录于", - "tail_role": [ - { - "main": [ - "作品类_实体" - ], - "support": [] - } - ], - "trig_word": [ - "收录于" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B" - }, - { - "head_role": "作品类_实体", - "group": "改编自", - "tail_role": [ - { - "main": [ - "作品类_实体", - "作品类_概念" - ], - "support": [] - } - ], - "trig_word": [ - "改编", - "改编自" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B" - }, - { - "head_role": "作品类_实体", - "group": "获奖", - "tail_role": [ - { - "main": [ - "文化类_奖项赛事活动" - ], - "support": [] - } - ], - "trig_word": [ - "获", - "获得", - "荣获", - "获颁" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "作品类_实体", - "group": "上市于", - "tail_role": [ - { - "main": [ - "时间类_具体时间" - ], - "support": [] - } - ], - "trig_word": [ - "上市" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "L" - }, - { - "head_role": "组织机构类", - "group": "创建于", - "tail_role": [ - { - "main": [ - "时间类_具体时间" - ], - "support": [ - "世界地区类", - "组织机构类_国家机关" - ] - }, - { - "main": [ - "世界地区类", - "组织机构类_国家机关" - ], - "support": [ - "时间类_具体时间" - ] - } - ], - "trig_word": [ - "成立", - "创办", - "创建", - "建立", - "登记成立", - "成立登记" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B" - }, - { - "head_role": "组织机构类", - "group": "创建人", - "tail_role": [ - { - "main": [ - "人物类_实体" - ], - "support": [] - } - ], - "trig_word": [ - "创办", - "创建" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "L", - "rel_group": "创建" - }, - { - "head_role": "组织机构类", - "group": "上市于", - "tail_role": [ - { - "main": [ - "时间类_具体时间" - ], - "support": [ - "{[education:场外交易市场]}" - ] - }, - { - "main": [ - "{[education:场外交易市场]}" - ], - "support": [ - "时间类_具体时间" - ] - } - ], - "trig_word": [ - "上市" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "L" - }, - { - "head_role": "组织机构类", - "group": "成立于", - "tail_role": [ - { - "main": [ - "时间类_具体时间" - ], - "support": [ - "世界地区类" - ] - }, - { - "main": [ - "世界地区类" - ], - "support": [ - "时间类_具体时间" - ] - } - ], - "trig_word": [ - "成立于", - "成立" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B" - }, - { - "head_role": "时间类_具体时间", - "group": "成立于", - "tail_role": [ - { - "main": [ - "组织机构类" - ], - "support": [ - "世界地区类" - ] - } - ], - "trig_word": [ - "成立于", - "成立" - ], - "trig_type": "trigger", - "reverse": true, - "trig_direction": "L" - }, - { - "head_role": "组织机构类", - "group": "所属组织", - "tail_role": [ - { - "main": [ - "组织机构类" - ], - "support": [] - } - ], - "trig_word": [ - "隶属", - "隶属于" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "世界地区类", - "group": "所属地区", - "tail_role": [ - { - "main": [ - "世界地区类" - ], - "support": [] - } - ], - "trig_word": [ - "首都", - "省会", - "首府" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "L" - }, - { - "head_role": "世界地区类", - "group": "所属地区", - "tail_role": [ - { - "main": [ - "世界地区类" - ], - "support": [] - } - ], - "trig_word": [ - "隶属", - "隶属于" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "世界地区类", - "group": "所属地区", - "tail_role": [ - { - "main": [ - "世界地区类" - ], - "support": [] - } - ], - "trig_word": [ - "下辖" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "L" - }, - { - "head_role": "世界地区类", - "group": "官方语言", - "tail_role": [ - { - "main": [ - "文化类_语言文字" - ], - "support": [] - } - ], - "trig_word": [ - "官方语言" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B" - }, - { - "head_role": "世界地区类", - "group": "海拔", - "tail_role": [ - { - "main": [ - "数量词" - ], - "support": [] - } - ], - "trig_word": [ - "海拔" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "世界地区类", - "group": "面积", - "tail_role": [ - { - "main": [ - "数量词" - ], - "support": [] - } - ], - "trig_word": [ - "面积", - "占地" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "场所类", - "group": "类型", - "tail_role": [ - { - "main": [ - "场所类_概念" - ], - "support": [] - } - ], - "trig_word": [ - "是" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "场所类", - "group": "面积", - "tail_role": [ - { - "main": [ - "数量词" - ], - "support": [] - } - ], - "trig_word": [ - "面积", - "占地" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "物体类", - "group": "类型", - "tail_role": [ - { - "main": [ - "物体类_概念" - ], - "support": [] - } - ], - "trig_word": [ - "是" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "物体类_兵器", - "group": "类型", - "tail_role": [ - { - "main": [ - "物体类_兵器" - ], - "support": [] - } - ], - "trig_word": [ - "是" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - }, - { - "head_role": "物体类", - "group": "上市于", - "tail_role": [ - { - "main": [ - "时间类_具体时间" - ], - "support": [] - } - ], - "trig_word": [ - "上市" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "B" - }, - { - "head_role": "物体类", - "group": "制造方", - "tail_role": [ - { - "main": [ - "组织机构类_企事业单位" - ], - "support": [ - "时间类_具体时间" - ] - } - ], - "trig_word": [ - "生产", - "制造", - "推出", - "发布" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "L" - }, - { - "head_role": "品牌名", - "group": "类型", - "tail_role": [ - { - "main": [ - "品牌名_品牌类型" - ], - "support": [ - "世界地区类_国家" - ] - } - ], - "trig_word": [ - "是" - ], - "trig_type": "trigger", - "reverse": false, - "trig_direction": "R" - } -] diff --git a/examples/text_to_knowledge/wordtag/README.md b/examples/text_to_knowledge/wordtag/README.md deleted file mode 100644 index ce194b26e07b..000000000000 --- a/examples/text_to_knowledge/wordtag/README.md +++ /dev/null @@ -1,248 +0,0 @@ -# 解语:WordTag(中文词类知识标注工具) - -WordTag(中文词类知识标注工具)是首个能够覆盖所有中文词汇的词类知识标注工具,旨在为中文文本解析提供全面、丰富的知识标注结果,可以应用于模板(挖掘模板、解析模板)生成与匹配、知识挖掘(新词发现、关系挖掘)等自然语言处理任务中,提升文本解析与挖掘精度;也可以作为中文文本特征生成器,为各类机器学习模型提供文本特征。 - -![wordtag示例](../doc/img/wordtag_example.png) - -## WordTag特点 - -- **覆盖所有中文词汇的词类体系,更丰富的知识标注结果** - - WordTag使用的词类体系为覆盖所有中文词汇的词类体系,包括各类实体词与非实体词(如概念、实体/专名、语法词等)。WordTag开源版对部分类目(如组织机构等),做了更细类目的划分识别(如,医疗卫生机构、体育组织机构),对仅使用文本信息难以细分的类目(如人物类、作品类、品牌名等),不做更细粒度的词类识别。用户需要细粒度的词类识别时,可利用百科知识树的类别体系自行定制。 -- **整合百科知识树链接结果,获得更丰富的标注知识** - - 如上图示例所示,各个切分标注结果中,除词类标注外,还整合了百科知识树的链接结果,用户可以结合百科知识树数据共同使用:如,利用百科知识树中的subtype获得更细的上位粒度,利用term的百科信息获得更加丰富的知识等。 -- **可定制的词类序列标注框架** - - WordTag开源版标注使用的词类体系是我们在实践中对**百科文本**解析应用较好的一个版本,不同类型文本(如,搜索query、新闻资讯)的词类分布不同,用户可以利用百科知识树定制自己的词类体系和训练样本,构建自己的WordTag应用版,以获得更好的适配效果。例如,可将自定义的词表按照百科知识树的字段定义好,挂接/整合到百科知识树上,即可使用自己的Term数据定制标注样本和标注任务。 - -## 模型结构 - -模型使用[ERNIE-CTM](../ernie-ctm)+CRF训练而成,预测时使用viterbi解码,模型结构如下: - -wordtag模型结构 - - -## Term-Linking实现 - -WordTag提供从文本到百科知识树的链接方法,即Term-Linking,只需将term词类体系与百科知识树数据加载到工具中,即可在解析结果中得到term-linking结果。 - -为了能够适配应用中的不同实体集(例如,不同的企业有不同的人物实体集合,不同的小说站有不同的小说实体集合),我们将term-linking拆分为两个步骤: - -- 第一步是基于词类的linking,主要解决“同名概念词/实体词”、“不同类的同名词”消歧问题,这一步只使用文本本身特征和词类特征,不使用图谱中的实体属性值(SPO)知识,从而支持切换不同应用知识图谱; -- 第二步是同类同名实体词的linking,主要解决同类下不同属性值的实体消歧问题,这一步需要使用实体词的SPO知识(一般用于实体特征表示计算,以及文本-实体相似度计算)。 - -“WordTag+百科知识树”的开源版提供了第一步的解决示例,第二步由于依赖于特定图谱的SPO知识,无法提供通用工具,未来可能提供通用解决方案。 - -WordTag模型对所有的词预测到上位词类之后,会直接根据预测到的词类,映射到term体系(映射表参见代码配置),查找相应的term,进行link。用户也可根据自己的数据分布,定制term-linking策略: - -- link到自己定制的term词表:只需将term词表按照TermTree挂接好之后更换数据即可; -- 调整WordTag预测词类与term词表的映射关系(如,增加自定义类别):在代码配置中直接调整映射表即可。 - -## WordTag类别标签集合 - -WordTag共包含91种词性及专名类别标签,标签集合如下表 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
WordTag标签集合
人物类_实体组织机构类_军事组织机构_概念文化类_制度政策协议位置方位术语类_医药学术语信息资料_性别否定词
人物类_概念组织机构类_医疗卫生机构文化类_姓氏与人名世界地区类术语类_生物体链接地址数量词
作品类_实体组织机构类_医疗卫生机构_概念生物类世界地区类_国家疾病损伤类个性特征数量词_序数词
作品类_概念组织机构类_教育组织机构生物类_植物世界地区类_区划概念疾病损伤类_植物病虫害感官特征数量词_单位数量词
组织机构类组织机构类_教育组织机构_概念生物类_动物世界地区类_地理概念宇宙类场景事件叹词
组织机构类_概念物体类品牌名饮食类事件类介词拟声词
组织机构类_企事业单位物体类_概念品牌名_品牌类型饮食类_菜品时间类介词_方位介词修饰词
组织机构类_企事业单位_概念物体类_兵器场所类饮食类_饮品时间类_特殊日助词修饰词_性质
组织机构类_国家机关物体类_化学物质场所类_概念药物类时间类_朝代代词修饰词_类型
组织机构类_国家机关_概念其他角色类场所类_交通场所药物类_中药时间类_具体时间连词修饰词_化
组织机构类_体育组织机构文化类场所类_交通场所_概念术语类时间类_时长副词外语单词
组织机构类_体育组织机构_概念文化类_语言文字场所类_网上场所术语类_术语类型词汇用语疑问词汉语拼音
组织机构类_军事组织机构文化类_奖项赛事活动场所类_网上场所_概念术语类_符号指标类信息资料肯定词w(标点)
- - -## WordTag应用场景 - -参见"[解语的应用场景](../)" - - -## WordTag示例代码 -下面提供了WordTag模型进行文本到百科知识树链接的示例程序。 - -### Term-Linking示例程序 - -Term-Linking示例程序可以对无标签数据启动模型预测, 例如想对下面几段文本进行百科知识树的链接解析 -``` -"《孤女》是2010年九州出版社出版的小说,作者是余兼羽。", -"热梅茶是一道以梅子为主要原料制作的茶饮" -``` - -执行下面的脚本即可快速获取上面两段文本的百科知识树链接的结果 - -```python -from paddlenlp import Taskflow -wordtag = Taskflow("knowledge_mining", model="wordtag", linking=True) -wordtag(["热梅茶是一道以梅子为主要原料制作的茶饮", - "《孤女》是2010年九州出版社出版的小说,作者是余兼羽"]) -# Support the input text directly -wordtag("热梅茶是一道以梅子为主要原料制作的茶饮") - -``` -下面是运行WordTag工具后的知识链接的预测结果 - -```json -[{'text': '《孤女》是2010年九州出版社出版的小说,作者是余兼羽。', 'items': [{'item': '《', 'offset': 0, 'wordtag_label': 'w', 'length': 1}, {'item': '孤女', 'offset': 1, 'wordtag_label': '作品类_实体', 'length': 2, 'termid': '小说_eb_孤女'}, {'item': '》', 'offset': 3, 'wordtag_label': 'w', 'length': 1}, {'item': '是', 'offset': 4, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '2010年', 'offset': 5, 'wordtag_label': '时间类', 'length': 5, 'termid': '时间阶段_cb_2010年'}, {'item': '九州出版社', 'offset': 10, 'wordtag_label': '组织机构类', 'length': 5, 'termid': '组织机构_eb_九州出版社'}, {'item': '出版', 'offset': 15, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_出版'}, {'item': '的', 'offset': 17, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '小说', 'offset': 18, 'wordtag_label': '作品类_概念', 'length': 2, 'termid': '小说_cb_小说'}, {'item': ',', 'offset': 20, 'wordtag_label': 'w', 'length': 1}, {'item': '作者', 'offset': 21, 'wordtag_label': '人物类_概念', 'length': 2, 'termid': '人物_cb_作者'}, {'item': '是', 'offset': 23, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '余兼羽', 'offset': 24, 'wordtag_label': '人物类_实体', 'length': 3}, {'item': '。', 'offset': 27, 'wordtag_label': 'w', 'length': 1}]}, {'text': '热梅茶是一道以梅子为主要原料制作的茶饮', 'items': [{'item': '热梅茶', 'offset': 0, 'wordtag_label': '饮食类_饮品', 'length': 3}, {'item': '是', 'offset': 3, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '一道', 'offset': 4, 'wordtag_label': '数量词', 'length': 2}, {'item': '以', 'offset': 6, 'wordtag_label': '介词', 'length': 1, 'termid': '介词_cb_以'}, {'item': '梅子', 'offset': 7, 'wordtag_label': '饮食类', 'length': 2, 'termid': '饮食_cb_梅'}, {'item': '为', 'offset': 9, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_为'}, {'item': '主要原料', 'offset': 10, 'wordtag_label': '物体类', 'length': 4, 'termid': '物品_cb_主要原料'}, {'item': '制作', 'offset': 14, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_制作'}, {'item': '的', 'offset': 16, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '茶饮', 'offset': 17, 'wordtag_label': '饮食类_饮品', 'length': 2, 'termid': '饮品_cb_茶饮'}]}] -{'text': '热梅茶是一道以梅子为主要原料制作的茶饮', 'items': [{'item': '热梅茶', 'offset': 0, 'wordtag_label': '饮食类_饮品', 'length': 3}, {'item': '是', 'offset': 3, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_是'}, {'item': '一道', 'offset': 4, 'wordtag_label': '数量词', 'length': 2}, {'item': '以', 'offset': 6, 'wordtag_label': '介词', 'length': 1, 'termid': '介词_cb_以'}, {'item': '梅子', 'offset': 7, 'wordtag_label': '饮食类', 'length': 2, 'termid': '饮食_cb_梅'}, {'item': '为', 'offset': 9, 'wordtag_label': '肯定词', 'length': 1, 'termid': '肯定否定词_cb_为'}, {'item': '主要原料', 'offset': 10, 'wordtag_label': '物体类', 'length': 4, 'termid': '物品_cb_主要原料'}, {'item': '制作', 'offset': 14, 'wordtag_label': '场景事件', 'length': 2, 'termid': '场景事件_cb_制作'}, {'item': '的', 'offset': 16, 'wordtag_label': '助词', 'length': 1, 'termid': '助词_cb_的'}, {'item': '茶饮', 'offset': 17, 'wordtag_label': '饮食类_饮品', 'length': 2, 'termid': '饮品_cb_茶饮'}]} -``` - -同时我们也提供了基于上述taskflow的python执行脚本,具体的执行方式如下: -```shell -python predict.py --max_seq_len 128 --batch_size 2 -``` -其中参数释义如下: -- `max_seq_len` 表示最大句子长度,超过该长度将被截断。 -- `batch_size` 表示每个预测批次的样本数目。 - -## WordTag进阶使用 - -### 自定义模型一键预测 - -用户可以使用自有数据对WordTag模型进行增量训练,然后使用Taskflow进行一键预测,参见[WordTag增量训练示例](../ernie-ctm)。 - -### 自定义Term-Linking - -Taskflow默认使用TermTreeV1.0实现Term-Linking, 用户也可以基于自己的TermTree实现Term-Linking,参见[自定义TermTree](../termtree)。 - -## Release Note - -- 2022.06:新增25个细化词类,用于下游挖掘任务 - -## WordTag后续计划 - -1. 持续优化知识标注模型,获得更加精准的标注结果; -2. 发布多粒度、多种参数规模的知识标注模型; -3. 提供细粒度term及subterm消歧的解决方案。 - - -## 在论文中引用WordTag - -如果您的工作成果中使用了WordTag,请增加下述引用。我们非常乐于看到WordTag对您的工作带来帮助。 -``` -@article{zhao2020TermTree, - title={TermTree and Knowledge Annotation Framework for Chinese Language Understanding}, - author={Zhao, Min and Qin, Huapeng and Zhang, Guoxin and Lyu, Yajuan and Zhu, Yong}, - technical report={Baidu, Inc. TR:2020-KG-TermTree}, - year={2020} -} -``` - - - -## 问题与反馈 - -WordTag在持续优化中,如果您有任何建议或问题,欢迎提交issue到Github。 diff --git a/examples/text_to_knowledge/wordtag/predict.py b/examples/text_to_knowledge/wordtag/predict.py deleted file mode 100644 index 5f4fe353782d..000000000000 --- a/examples/text_to_knowledge/wordtag/predict.py +++ /dev/null @@ -1,56 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -import argparse - -import paddle - -from paddlenlp import Taskflow - - -def parse_args(): - parser = argparse.ArgumentParser() - - # fmt: off - parser.add_argument("--max_seq_len", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer than this will be truncated, sequences shorter will be padded.", ) - parser.add_argument("--batch_size", default=32, type=int, help="Batch size per GPU/CPU for training.", ) - parser.add_argument("--device", default="gpu", type=str, choices=["cpu", "gpu", "xpu"], help="The device to select to train the model, is must be cpu/gpu/xpu.") - # fmt: on - - args = parser.parse_args() - return args - - -def do_predict(args): - paddle.set_device(args.device) - wordtag = Taskflow( - "knowledge_mining", model="wordtag", batch_size=args.batch_size, max_seq_length=args.max_seq_len, linking=True - ) - txts = ["《孤女》是2010年九州出版社出版的小说,作者是余兼羽。", "热梅茶是一道以梅子为主要原料制作的茶饮"] - res = wordtag(txts) - print(res) - - -def print_arguments(args): - """print arguments""" - print("----------- Configuration Arguments -----------") - for arg, value in sorted(vars(args).items()): - print("%s: %s" % (arg, value)) - print("------------------------------------------------") - - -if __name__ == "__main__": - args = parse_args() - print_arguments(args) - do_predict(args) diff --git a/examples/benchmark/ceval/README.md b/legacy/examples/benchmark/ceval/README.md similarity index 100% rename from examples/benchmark/ceval/README.md rename to legacy/examples/benchmark/ceval/README.md diff --git a/examples/benchmark/ceval/eval.py b/legacy/examples/benchmark/ceval/eval.py similarity index 100% rename from examples/benchmark/ceval/eval.py rename to legacy/examples/benchmark/ceval/eval.py diff --git a/examples/benchmark/ceval/evaluator.py b/legacy/examples/benchmark/ceval/evaluator.py similarity index 100% rename from examples/benchmark/ceval/evaluator.py rename to legacy/examples/benchmark/ceval/evaluator.py diff --git a/examples/benchmark/ceval/model_evaluator.py b/legacy/examples/benchmark/ceval/model_evaluator.py similarity index 100% rename from examples/benchmark/ceval/model_evaluator.py rename to legacy/examples/benchmark/ceval/model_evaluator.py diff --git a/examples/benchmark/ceval/subject_mapping.json b/legacy/examples/benchmark/ceval/subject_mapping.json similarity index 100% rename from examples/benchmark/ceval/subject_mapping.json rename to legacy/examples/benchmark/ceval/subject_mapping.json diff --git a/examples/benchmark/clue/README.md b/legacy/examples/benchmark/clue/README.md similarity index 100% rename from examples/benchmark/clue/README.md rename to legacy/examples/benchmark/clue/README.md diff --git a/examples/benchmark/clue/classification/run_clue_classifier.py b/legacy/examples/benchmark/clue/classification/run_clue_classifier.py similarity index 100% rename from examples/benchmark/clue/classification/run_clue_classifier.py rename to legacy/examples/benchmark/clue/classification/run_clue_classifier.py diff --git a/examples/benchmark/clue/classification/run_clue_classifier_trainer.py b/legacy/examples/benchmark/clue/classification/run_clue_classifier_trainer.py similarity index 100% rename from examples/benchmark/clue/classification/run_clue_classifier_trainer.py rename to legacy/examples/benchmark/clue/classification/run_clue_classifier_trainer.py diff --git a/examples/benchmark/clue/grid_search_tools/draw_pic.py b/legacy/examples/benchmark/clue/grid_search_tools/draw_pic.py similarity index 100% rename from examples/benchmark/clue/grid_search_tools/draw_pic.py rename to legacy/examples/benchmark/clue/grid_search_tools/draw_pic.py diff --git a/examples/benchmark/clue/grid_search_tools/extract_result.sh b/legacy/examples/benchmark/clue/grid_search_tools/extract_result.sh similarity index 100% rename from examples/benchmark/clue/grid_search_tools/extract_result.sh rename to legacy/examples/benchmark/clue/grid_search_tools/extract_result.sh diff --git a/examples/benchmark/clue/grid_search_tools/grid_search.py b/legacy/examples/benchmark/clue/grid_search_tools/grid_search.py similarity index 100% rename from examples/benchmark/clue/grid_search_tools/grid_search.py rename to legacy/examples/benchmark/clue/grid_search_tools/grid_search.py diff --git a/examples/benchmark/clue/grid_search_tools/run_cls.sh b/legacy/examples/benchmark/clue/grid_search_tools/run_cls.sh similarity index 100% rename from examples/benchmark/clue/grid_search_tools/run_cls.sh rename to legacy/examples/benchmark/clue/grid_search_tools/run_cls.sh diff --git a/examples/benchmark/clue/grid_search_tools/run_mrc.sh b/legacy/examples/benchmark/clue/grid_search_tools/run_mrc.sh similarity index 100% rename from examples/benchmark/clue/grid_search_tools/run_mrc.sh rename to legacy/examples/benchmark/clue/grid_search_tools/run_mrc.sh diff --git a/examples/benchmark/clue/grid_search_tools/warmup_dataset_and_model.py b/legacy/examples/benchmark/clue/grid_search_tools/warmup_dataset_and_model.py similarity index 100% rename from examples/benchmark/clue/grid_search_tools/warmup_dataset_and_model.py rename to legacy/examples/benchmark/clue/grid_search_tools/warmup_dataset_and_model.py diff --git a/examples/benchmark/clue/mrc/run_c3.py b/legacy/examples/benchmark/clue/mrc/run_c3.py similarity index 100% rename from examples/benchmark/clue/mrc/run_c3.py rename to legacy/examples/benchmark/clue/mrc/run_c3.py diff --git a/examples/benchmark/clue/mrc/run_chid.py b/legacy/examples/benchmark/clue/mrc/run_chid.py similarity index 100% rename from examples/benchmark/clue/mrc/run_chid.py rename to legacy/examples/benchmark/clue/mrc/run_chid.py diff --git a/examples/benchmark/clue/mrc/run_cmrc2018.py b/legacy/examples/benchmark/clue/mrc/run_cmrc2018.py similarity index 100% rename from examples/benchmark/clue/mrc/run_cmrc2018.py rename to legacy/examples/benchmark/clue/mrc/run_cmrc2018.py diff --git a/examples/benchmark/glue/README.md b/legacy/examples/benchmark/glue/README.md similarity index 100% rename from examples/benchmark/glue/README.md rename to legacy/examples/benchmark/glue/README.md diff --git a/examples/benchmark/glue/run_glue.py b/legacy/examples/benchmark/glue/run_glue.py similarity index 100% rename from examples/benchmark/glue/run_glue.py rename to legacy/examples/benchmark/glue/run_glue.py diff --git a/examples/benchmark/glue/run_glue_trainer.py b/legacy/examples/benchmark/glue/run_glue_trainer.py similarity index 100% rename from examples/benchmark/glue/run_glue_trainer.py rename to legacy/examples/benchmark/glue/run_glue_trainer.py diff --git a/examples/benchmark/peft/README.md b/legacy/examples/benchmark/peft/README.md similarity index 100% rename from examples/benchmark/peft/README.md rename to legacy/examples/benchmark/peft/README.md diff --git a/examples/benchmark/peft/paddle/benchmark.py b/legacy/examples/benchmark/peft/paddle/benchmark.py similarity index 100% rename from examples/benchmark/peft/paddle/benchmark.py rename to legacy/examples/benchmark/peft/paddle/benchmark.py diff --git a/examples/benchmark/peft/paddle/inference_benchmark.py b/legacy/examples/benchmark/peft/paddle/inference_benchmark.py similarity index 100% rename from examples/benchmark/peft/paddle/inference_benchmark.py rename to legacy/examples/benchmark/peft/paddle/inference_benchmark.py diff --git a/examples/benchmark/peft/paddle/utils.py b/legacy/examples/benchmark/peft/paddle/utils.py similarity index 100% rename from examples/benchmark/peft/paddle/utils.py rename to legacy/examples/benchmark/peft/paddle/utils.py diff --git a/examples/benchmark/peft/torch/benchmark.py b/legacy/examples/benchmark/peft/torch/benchmark.py similarity index 100% rename from examples/benchmark/peft/torch/benchmark.py rename to legacy/examples/benchmark/peft/torch/benchmark.py diff --git a/examples/benchmark/peft/torch/ds_config_stage2.json b/legacy/examples/benchmark/peft/torch/ds_config_stage2.json similarity index 100% rename from examples/benchmark/peft/torch/ds_config_stage2.json rename to legacy/examples/benchmark/peft/torch/ds_config_stage2.json diff --git a/examples/benchmark/peft/torch/ds_config_stage3.json b/legacy/examples/benchmark/peft/torch/ds_config_stage3.json similarity index 100% rename from examples/benchmark/peft/torch/ds_config_stage3.json rename to legacy/examples/benchmark/peft/torch/ds_config_stage3.json diff --git a/examples/benchmark/peft/torch/inference_benchmark.py b/legacy/examples/benchmark/peft/torch/inference_benchmark.py similarity index 100% rename from examples/benchmark/peft/torch/inference_benchmark.py rename to legacy/examples/benchmark/peft/torch/inference_benchmark.py diff --git a/examples/benchmark/peft/torch/requirements.txt b/legacy/examples/benchmark/peft/torch/requirements.txt similarity index 100% rename from examples/benchmark/peft/torch/requirements.txt rename to legacy/examples/benchmark/peft/torch/requirements.txt diff --git a/examples/benchmark/peft/torch/utils.py b/legacy/examples/benchmark/peft/torch/utils.py similarity index 100% rename from examples/benchmark/peft/torch/utils.py rename to legacy/examples/benchmark/peft/torch/utils.py diff --git a/examples/benchmark/wiki_lambada/README.md b/legacy/examples/benchmark/wiki_lambada/README.md similarity index 100% rename from examples/benchmark/wiki_lambada/README.md rename to legacy/examples/benchmark/wiki_lambada/README.md diff --git a/examples/benchmark/wiki_lambada/eval.py b/legacy/examples/benchmark/wiki_lambada/eval.py similarity index 100% rename from examples/benchmark/wiki_lambada/eval.py rename to legacy/examples/benchmark/wiki_lambada/eval.py diff --git a/examples/dialogue/dgu/README.md b/legacy/examples/dialogue/README_DGU.md similarity index 97% rename from examples/dialogue/dgu/README.md rename to legacy/examples/dialogue/README_DGU.md index 90a9f57ec209..98541b51fb59 100644 --- a/examples/dialogue/dgu/README.md +++ b/legacy/examples/dialogue/README_DGU.md @@ -1,5 +1,7 @@ # 对话通用理解模型 (DGU, Dialogue General Understanding) +> **注意** 部分内容在PaddleNLP 3.0以后不再进行维护,更多历史内容请参考[PaddleNLP 2.8](https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/examples/dialogue)。 + ## 模型简介 对话系统 (Dialogue System) 常常需要根据应用场景的变化去解决多种多样的任务。任务的多样性(意图识别、槽填充、行为识别、状态追踪等等),以及领域训练数据的稀少,给Dialogue System的研究和应用带来了巨大的困难和挑战,要使得Dialogue System得到更好的发展,需要开发一个通用的对话理解模型。为此,我们给出了基于BERT的对话通用理解模型 (DGU: Dialogue General Understanding),通过实验表明,使用base-model (BERT)并结合常见的学习范式,就可以在几乎全部对话理解任务上取得比肩甚至超越各个领域业内最好的模型的效果,展现了学习一个通用对话理解模型的巨大潜力。 diff --git a/examples/dialogue/lic2021_baseline/README.md b/legacy/examples/dialogue/README_LIC2021.md similarity index 97% rename from examples/dialogue/lic2021_baseline/README.md rename to legacy/examples/dialogue/README_LIC2021.md index b0354b0f18d0..3a882a2b2637 100644 --- a/examples/dialogue/lic2021_baseline/README.md +++ b/legacy/examples/dialogue/README_LIC2021.md @@ -1,5 +1,7 @@ # LIC 2021对话比赛baseline +> **注意** 部分内容在PaddleNLP 3.0以后不再进行维护,更多历史内容请参考[PaddleNLP 2.8](https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/examples/dialogue)。 + ## 模型简介 近年来,人机对话系统受到了学术界和产业界的广泛关注并取得了不错的发展。开放域对话系统旨在建立一个开放域的多轮对话系统,使得机器可以流畅自然地与人进行语言交互,既可以进行日常问候类的闲聊,又可以完成特定功能,以使得开放域对话系统具有实际应用价值,例如进行对话式推荐,或围绕一个主题进行深入的知识对话等。具体的说,开放域对话可以继续拆分为支持不同功能的对话形式,例如对话式推荐,知识对话技术等,如何解决并有效融合以上多个技能面临诸多挑战。 diff --git a/examples/dialogue/plato-2/README.md b/legacy/examples/dialogue/README_PLATO-2.md similarity index 92% rename from examples/dialogue/plato-2/README.md rename to legacy/examples/dialogue/README_PLATO-2.md index 20d6e6644d70..479586c861a6 100644 --- a/examples/dialogue/plato-2/README.md +++ b/legacy/examples/dialogue/README_PLATO-2.md @@ -1,5 +1,7 @@ # PLATO-2 +> **注意** 部分内容在PaddleNLP 3.0以后不再进行维护,更多历史内容请参考[PaddleNLP 2.8](https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/examples/dialogue)。 + ## 模型简介 构建高质量的开放领域(Open-Domain)的对话机器人,使得它能用自然语言与人自由地交流,这一直是自然语言处理领域终极目标之一。 diff --git a/legacy/examples/dialogue/README_PLATO-XL.md b/legacy/examples/dialogue/README_PLATO-XL.md new file mode 100644 index 000000000000..ac28944d6b1d --- /dev/null +++ b/legacy/examples/dialogue/README_PLATO-XL.md @@ -0,0 +1,150 @@ +# PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation + +> **注意** 部分内容在PaddleNLP 3.0以后不再进行维护,更多历史内容请参考[PaddleNLP 2.8](https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/examples/dialogue)。 + +## 模型简介 + +构建高质量的开放领域(Open-Domain)的对话机器人,使得它能用自然语言与人自由地交流,这一直是自然语言处理领域终极目标之一。 + +PLATO-XL 是业界首个开源的百亿超大规模开放域对话预训练模型,其使用了参数高效(encoder-decoder共享参数)的 UnifiedTransformer(prefix LM)模型架构,将模型参数量提升到了11B量级,经过了十亿级样本对话数据的预训练,并引入role embedding区分多方对话中的对话角色提升预训练效果,最终模型闲聊测试效果超过了众多代表性的对话模型。可以直接使用 PLATO-XL 构建高质量的开放领域对话机器人。 + +PaddleNLP 内置了 PLATO-XL 英文预训练模型以供使用。由于 PLATO-XL 模型规模较大,这使得其在预测时生成对话回复的时间较长,并且 11B 的参数量也可能超出部分型号 GPU 显存容量,这是大模型推理与落地存在的普遍和关键问题。PaddleNLP FastGeneration 为 PLATO-XL 提供了 GPU 上的高性能生成加速能力,并且支持模型并行(张量并行)推理允许通过多张小显存容量的 GPU 使用百亿大模型,相比单卡代码中也只增加了`enable_ft_para()`一行,此外模型并行能进一步提升预测速度。 + +本项目提供了 PLATO-XL 英文模型使用 PaddleNLP FastGeneration 进行高性能预测的使用示例。PLATO-XL 的训练及更多内容请参考 [PaddlePaddle/Knover](https://github.com/PaddlePaddle/Knover/tree/develop/projects/PLATO-XL)。 + +## 开始运行 +### 单卡高性能推理 + +`infer.py` 是 PLATO-XL 高性能预测使用示例脚本,可以使用如下命令运行: + +```shell +python infer.py --topk 4 --max_out_len 64 --use_faster --use_fp16 +``` + +该脚本各个参数含义如下: + +- `topk` 用于Top-K采样策略,采样时将只从概率最高K个token中采样,默认为1,即greedy search。 +- `topp` 用于Top-P采样策略,采样时将只从概率最高且累加概率不超过该值的token中采样,默认为1.0。 +- `max_out_len` 指定生成的最大长度,默认为64。 +- `min_out_len` 指定生成的最小长度,默认为1。 +- `temperature` 用于调整预测概率分布,默认为1.0,即保持模型原有的预测概率。 +- `use_faster` 使用 FastGeneration +- `use_fp16` 使用FP16,只在使用FastGeneration时生效 + +脚本中使用了一条如下的多轮对话的样本数据, 由`List[str]`表示,其中每个`str`表示一句话,将根据历史对话内容生成回复。 + +```python + history = [ + "hi , Mary ! What do you usually like to do in your spare time ?", + "well , I spend a lot of time watching movies .", + "what a confidence ! I always watch a lot of movies , too ." + "oh really , Frank ? What kind of movies do you like ?" + ] +``` + +**注意** 由于 PLATO-XL 模型较大,单卡预测至少需要22G显存(使用FP16时),且模型下载需要一定时间(FP32的权重文件约41G)。 + +### 多卡并行推理 + +多卡并行推理当前依赖 MPI([MPICH](https://www.mpich.org)、[OpenMPI](https://www.open-mpi.org)均可)和[NCCL](https://developer.nvidia.com/nccl),如需使用还请先安装依赖。安装完成后仍然使用 `infer.py` 来进行预测,相比单卡时不同的只是通过mpi来启动运行,如下: + +```shell +mpirun -n 4 python infer.py --topk 4 --max_out_len 64 --use_faster --use_fp16 +``` + +其中`-n 4`指明使用的进程和GPU数,在`n`设置为1时仍将进行单卡推理。由于多卡并行推理使用和单卡使用不同的依赖库,第一次运行时将重新进行JIT编译。 + +### 性能测试 + +`infer.py` 中同时提供了性能测试的支持,在上面预测命令的基础上加上 `--profile` 即可,如下: + +```shell +mpirun -n 4 python infer.py --batch_size 8 --min_out_len 20 --max_out_len 20 --topk 1 --use_faster --use_fp16 --profile +``` + +此外还指定了`batch_size`和`min_out_len`来得到特定输入输出大小下的性能,性能测试将给出循环运行多次的平均时延。以下为单卡高性能推理和4卡张量并行推理性能数据(V100,CUDA 10.2,输入长度60、输出长度20),可以看出4卡并行速度为单卡的2倍左右。 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
PLATO-XL 高性能推理速度  (in ms/batch)
batch sizeKFastGeneration
1卡
FP16
FastGeneration
4卡
FP16
多卡并行
SpeedUp
11706.937348.6532.027
110707.514348.6992.029
41768.597384.7301.997
410770.008385.2441.998
81862.017418.3132.060
810866.490418.9652.068
1611016.362486.9742.087
16101060.472488.1562.172
3211325.700606.7702.184
32101326.222608.4792.179
+ +## Reference + +1. Bao S, He H, Wang F, et al. PLATO-XL: Exploring the Large-scale Pre-training of Dialogue Generation[J]. arXiv preprint arXiv:2109.09519, 2021. diff --git a/examples/dialogue/unified_transformer/README.md b/legacy/examples/dialogue/README_UnifiedTransformer.md similarity index 98% rename from examples/dialogue/unified_transformer/README.md rename to legacy/examples/dialogue/README_UnifiedTransformer.md index 45cf6f32b239..ea139132423b 100644 --- a/examples/dialogue/unified_transformer/README.md +++ b/legacy/examples/dialogue/README_UnifiedTransformer.md @@ -1,5 +1,7 @@ # UnifiedTransformer +> **注意** 部分内容在PaddleNLP 3.0以后不再进行维护,更多历史内容请参考[PaddleNLP 2.8](https://github.com/PaddlePaddle/PaddleNLP/tree/release/2.8/examples/dialogue)。 + ## 模型简介 近年来,人机对话系统受到了学术界和产业界的广泛关注并取得了不错的发展。开放域对话系统旨在建立一个开放域的多轮对话系统,使得机器可以流畅自然地与人进行语言交互,既可以进行日常问候类的闲聊,又可以完成特定功能,以使得开放域对话系统具有实际应用价值。具体的说,开放域对话可以继续拆分为支持不同功能的对话形式,例如对话式推荐,知识对话技术等,如何解决并有效融合以上多个技能面临诸多挑战。 diff --git a/examples/information_extraction/DuEE/README.md b/legacy/examples/information_extraction/DuEE/README.md similarity index 100% rename from examples/information_extraction/DuEE/README.md rename to legacy/examples/information_extraction/DuEE/README.md diff --git a/examples/information_extraction/DuEE/classifier.py b/legacy/examples/information_extraction/DuEE/classifier.py similarity index 100% rename from examples/information_extraction/DuEE/classifier.py rename to legacy/examples/information_extraction/DuEE/classifier.py diff --git a/examples/information_extraction/DuEE/duee_1_data_prepare.py b/legacy/examples/information_extraction/DuEE/duee_1_data_prepare.py similarity index 100% rename from examples/information_extraction/DuEE/duee_1_data_prepare.py rename to legacy/examples/information_extraction/DuEE/duee_1_data_prepare.py diff --git a/examples/information_extraction/DuEE/duee_1_postprocess.py b/legacy/examples/information_extraction/DuEE/duee_1_postprocess.py similarity index 100% rename from examples/information_extraction/DuEE/duee_1_postprocess.py rename to legacy/examples/information_extraction/DuEE/duee_1_postprocess.py diff --git a/examples/information_extraction/DuEE/duee_fin_data_prepare.py b/legacy/examples/information_extraction/DuEE/duee_fin_data_prepare.py similarity index 100% rename from examples/information_extraction/DuEE/duee_fin_data_prepare.py rename to legacy/examples/information_extraction/DuEE/duee_fin_data_prepare.py diff --git a/examples/information_extraction/DuEE/duee_fin_postprocess.py b/legacy/examples/information_extraction/DuEE/duee_fin_postprocess.py similarity index 100% rename from examples/information_extraction/DuEE/duee_fin_postprocess.py rename to legacy/examples/information_extraction/DuEE/duee_fin_postprocess.py diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/ee.png b/legacy/examples/information_extraction/DuEE/pictures/DuEE-Fin/ee.png similarity index 100% rename from examples/information_extraction/DuEE/pictures/DuEE-Fin/ee.png rename to legacy/examples/information_extraction/DuEE/pictures/DuEE-Fin/ee.png diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/enum_model.png b/legacy/examples/information_extraction/DuEE/pictures/DuEE-Fin/enum_model.png similarity index 100% rename from examples/information_extraction/DuEE/pictures/DuEE-Fin/enum_model.png rename to legacy/examples/information_extraction/DuEE/pictures/DuEE-Fin/enum_model.png diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/role_model.png b/legacy/examples/information_extraction/DuEE/pictures/DuEE-Fin/role_model.png similarity index 100% rename from examples/information_extraction/DuEE/pictures/DuEE-Fin/role_model.png rename to legacy/examples/information_extraction/DuEE/pictures/DuEE-Fin/role_model.png diff --git a/examples/information_extraction/DuEE/pictures/DuEE-Fin/trigger_model.png b/legacy/examples/information_extraction/DuEE/pictures/DuEE-Fin/trigger_model.png similarity index 100% rename from examples/information_extraction/DuEE/pictures/DuEE-Fin/trigger_model.png rename to legacy/examples/information_extraction/DuEE/pictures/DuEE-Fin/trigger_model.png diff --git a/examples/information_extraction/DuEE/run_classifier.sh b/legacy/examples/information_extraction/DuEE/run_classifier.sh similarity index 77% rename from examples/information_extraction/DuEE/run_classifier.sh rename to legacy/examples/information_extraction/DuEE/run_classifier.sh index a75fcaef4635..a5b0c8f604ab 100644 --- a/examples/information_extraction/DuEE/run_classifier.sh +++ b/legacy/examples/information_extraction/DuEE/run_classifier.sh @@ -1,3 +1,17 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + data_dir=${1} conf_path=${2} ckpt_dir=${3} diff --git a/examples/information_extraction/DuEE/run_duee_1.sh b/legacy/examples/information_extraction/DuEE/run_duee_1.sh similarity index 78% rename from examples/information_extraction/DuEE/run_duee_1.sh rename to legacy/examples/information_extraction/DuEE/run_duee_1.sh index 4d599594123c..6522bdf8bc85 100644 --- a/examples/information_extraction/DuEE/run_duee_1.sh +++ b/legacy/examples/information_extraction/DuEE/run_duee_1.sh @@ -1,3 +1,17 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + dataset_name=DuEE1.0 data_dir=./data/${dataset_name} conf_dir=./conf/${dataset_name} diff --git a/examples/information_extraction/DuEE/run_duee_fin.sh b/legacy/examples/information_extraction/DuEE/run_duee_fin.sh similarity index 82% rename from examples/information_extraction/DuEE/run_duee_fin.sh rename to legacy/examples/information_extraction/DuEE/run_duee_fin.sh index 5b0b7b12e4a4..2ef7472e9941 100644 --- a/examples/information_extraction/DuEE/run_duee_fin.sh +++ b/legacy/examples/information_extraction/DuEE/run_duee_fin.sh @@ -1,3 +1,17 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + dataset_name=DuEE-Fin data_dir=./data/${dataset_name} conf_dir=./conf/${dataset_name} diff --git a/examples/information_extraction/DuEE/run_sequence_labeling.sh b/legacy/examples/information_extraction/DuEE/run_sequence_labeling.sh similarity index 76% rename from examples/information_extraction/DuEE/run_sequence_labeling.sh rename to legacy/examples/information_extraction/DuEE/run_sequence_labeling.sh index 05f3e337f3a9..18380a07b2b7 100644 --- a/examples/information_extraction/DuEE/run_sequence_labeling.sh +++ b/legacy/examples/information_extraction/DuEE/run_sequence_labeling.sh @@ -1,3 +1,16 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. data_dir=$1 conf_path=$2 diff --git a/examples/information_extraction/DuEE/sequence_labeling.py b/legacy/examples/information_extraction/DuEE/sequence_labeling.py similarity index 100% rename from examples/information_extraction/DuEE/sequence_labeling.py rename to legacy/examples/information_extraction/DuEE/sequence_labeling.py diff --git a/examples/information_extraction/DuEE/utils.py b/legacy/examples/information_extraction/DuEE/utils.py similarity index 100% rename from examples/information_extraction/DuEE/utils.py rename to legacy/examples/information_extraction/DuEE/utils.py diff --git a/examples/information_extraction/DuIE/README.md b/legacy/examples/information_extraction/DuIE/README.md similarity index 100% rename from examples/information_extraction/DuIE/README.md rename to legacy/examples/information_extraction/DuIE/README.md diff --git a/examples/information_extraction/DuIE/data/id2spo.json b/legacy/examples/information_extraction/DuIE/data/id2spo.json similarity index 100% rename from examples/information_extraction/DuIE/data/id2spo.json rename to legacy/examples/information_extraction/DuIE/data/id2spo.json diff --git a/examples/information_extraction/DuIE/data/predicate2id.json b/legacy/examples/information_extraction/DuIE/data/predicate2id.json similarity index 100% rename from examples/information_extraction/DuIE/data/predicate2id.json rename to legacy/examples/information_extraction/DuIE/data/predicate2id.json diff --git a/examples/information_extraction/DuIE/data_loader.py b/legacy/examples/information_extraction/DuIE/data_loader.py similarity index 100% rename from examples/information_extraction/DuIE/data_loader.py rename to legacy/examples/information_extraction/DuIE/data_loader.py diff --git a/examples/information_extraction/DuIE/extract_chinese_and_punct.py b/legacy/examples/information_extraction/DuIE/extract_chinese_and_punct.py similarity index 100% rename from examples/information_extraction/DuIE/extract_chinese_and_punct.py rename to legacy/examples/information_extraction/DuIE/extract_chinese_and_punct.py diff --git a/examples/information_extraction/DuIE/images/tagging_strategy.png b/legacy/examples/information_extraction/DuIE/images/tagging_strategy.png similarity index 100% rename from examples/information_extraction/DuIE/images/tagging_strategy.png rename to legacy/examples/information_extraction/DuIE/images/tagging_strategy.png diff --git a/examples/text_graph/erniesage/data/__init__.py b/legacy/examples/information_extraction/DuIE/predict.sh similarity index 58% rename from examples/text_graph/erniesage/data/__init__.py rename to legacy/examples/information_extraction/DuIE/predict.sh index bd14d6954839..247f5759179a 100644 --- a/examples/text_graph/erniesage/data/__init__.py +++ b/legacy/examples/information_extraction/DuIE/predict.sh @@ -1,19 +1,28 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at -# +# # http://www.apache.org/licenses/LICENSE-2.0 -# +# # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -from data import dataset, graph_reader +set -eux + +export CUDA_VISIBLE_DEVICES=0 +export BATCH_SIZE=64 +export CKPT=./checkpoints/model_90000.pdparams +export DATASET_FILE=./data/test1.json + +python run_duie.py \ + --do_predict \ + --init_checkpoint $CKPT \ + --predict_data_file $DATASET_FILE \ + --max_seq_length 128 \ + --batch_size $BATCH_SIZE -__all__ = [] -__all__ += dataset.__all__ -__all__ += graph_reader.__all__ diff --git a/examples/information_extraction/DuIE/re_official_evaluation.py b/legacy/examples/information_extraction/DuIE/re_official_evaluation.py similarity index 100% rename from examples/information_extraction/DuIE/re_official_evaluation.py rename to legacy/examples/information_extraction/DuIE/re_official_evaluation.py diff --git a/examples/information_extraction/DuIE/run_duie.py b/legacy/examples/information_extraction/DuIE/run_duie.py similarity index 100% rename from examples/information_extraction/DuIE/run_duie.py rename to legacy/examples/information_extraction/DuIE/run_duie.py diff --git a/examples/information_extraction/DuIE/train.sh b/legacy/examples/information_extraction/DuIE/train.sh similarity index 51% rename from examples/information_extraction/DuIE/train.sh rename to legacy/examples/information_extraction/DuIE/train.sh index 89a69e9ab9fb..5cb04ae55e7f 100644 --- a/examples/information_extraction/DuIE/train.sh +++ b/legacy/examples/information_extraction/DuIE/train.sh @@ -1,3 +1,17 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + set -eux export BATCH_SIZE=8 diff --git a/examples/information_extraction/DuIE/utils.py b/legacy/examples/information_extraction/DuIE/utils.py similarity index 100% rename from examples/information_extraction/DuIE/utils.py rename to legacy/examples/information_extraction/DuIE/utils.py diff --git a/examples/information_extraction/DuUIE/README.md b/legacy/examples/information_extraction/DuUIE/README.md similarity index 100% rename from examples/information_extraction/DuUIE/README.md rename to legacy/examples/information_extraction/DuUIE/README.md diff --git a/examples/information_extraction/DuUIE/config/multi-task-duuie.yaml b/legacy/examples/information_extraction/DuUIE/config/multi-task-duuie.yaml similarity index 100% rename from examples/information_extraction/DuUIE/config/multi-task-duuie.yaml rename to legacy/examples/information_extraction/DuUIE/config/multi-task-duuie.yaml diff --git a/examples/information_extraction/DuUIE/inference.py b/legacy/examples/information_extraction/DuUIE/inference.py similarity index 98% rename from examples/information_extraction/DuUIE/inference.py rename to legacy/examples/information_extraction/DuUIE/inference.py index b0a0662837c0..1d70d484995c 100644 --- a/examples/information_extraction/DuUIE/inference.py +++ b/legacy/examples/information_extraction/DuUIE/inference.py @@ -16,17 +16,17 @@ # limitations under the License. import json -import os import math -from tqdm import tqdm +import os import paddle +from tqdm import tqdm +from uie.evaluation.sel2record import MapConfig, RecordSchema, SEL2Record +from uie.seq2struct.t5_bert_tokenizer import T5BertTokenizer + from paddlenlp.data import Pad from paddlenlp.transformers import T5ForConditionalGeneration -from uie.evaluation.sel2record import RecordSchema, MapConfig, SEL2Record -from uie.seq2struct.t5_bert_tokenizer import T5BertTokenizer - special_to_remove = {"", ""} diff --git a/examples/information_extraction/DuUIE/process_data.py b/legacy/examples/information_extraction/DuUIE/process_data.py similarity index 99% rename from examples/information_extraction/DuUIE/process_data.py rename to legacy/examples/information_extraction/DuUIE/process_data.py index 0269223ccb7a..84b6f0280843 100644 --- a/examples/information_extraction/DuUIE/process_data.py +++ b/legacy/examples/information_extraction/DuUIE/process_data.py @@ -16,11 +16,12 @@ # limitations under the License. import copy -from typing import List, Dict -from collections import defaultdict -import yaml import json import os +from collections import defaultdict +from typing import Dict, List + +import yaml from uie.evaluation.sel2record import RecordSchema, merge_schema diff --git a/examples/information_extraction/DuUIE/requirements.txt b/legacy/examples/information_extraction/DuUIE/requirements.txt similarity index 100% rename from examples/information_extraction/DuUIE/requirements.txt rename to legacy/examples/information_extraction/DuUIE/requirements.txt diff --git a/examples/information_extraction/DuUIE/run_seq2struct.py b/legacy/examples/information_extraction/DuUIE/run_seq2struct.py similarity index 100% rename from examples/information_extraction/DuUIE/run_seq2struct.py rename to legacy/examples/information_extraction/DuUIE/run_seq2struct.py diff --git a/examples/information_extraction/DuUIE/uie/__init__.py b/legacy/examples/information_extraction/DuUIE/uie/__init__.py similarity index 100% rename from examples/information_extraction/DuUIE/uie/__init__.py rename to legacy/examples/information_extraction/DuUIE/uie/__init__.py diff --git a/examples/information_extraction/DuUIE/uie/evaluation/__init__.py b/legacy/examples/information_extraction/DuUIE/uie/evaluation/__init__.py similarity index 100% rename from examples/information_extraction/DuUIE/uie/evaluation/__init__.py rename to legacy/examples/information_extraction/DuUIE/uie/evaluation/__init__.py diff --git a/examples/information_extraction/DuUIE/uie/evaluation/constants.py b/legacy/examples/information_extraction/DuUIE/uie/evaluation/constants.py similarity index 100% rename from examples/information_extraction/DuUIE/uie/evaluation/constants.py rename to legacy/examples/information_extraction/DuUIE/uie/evaluation/constants.py diff --git a/examples/information_extraction/DuUIE/uie/evaluation/scorer.py b/legacy/examples/information_extraction/DuUIE/uie/evaluation/scorer.py similarity index 100% rename from examples/information_extraction/DuUIE/uie/evaluation/scorer.py rename to legacy/examples/information_extraction/DuUIE/uie/evaluation/scorer.py diff --git a/examples/information_extraction/DuUIE/uie/evaluation/sel2record.py b/legacy/examples/information_extraction/DuUIE/uie/evaluation/sel2record.py similarity index 99% rename from examples/information_extraction/DuUIE/uie/evaluation/sel2record.py rename to legacy/examples/information_extraction/DuUIE/uie/evaluation/sel2record.py index 77885638d508..6d499e077e47 100644 --- a/examples/information_extraction/DuUIE/uie/evaluation/sel2record.py +++ b/legacy/examples/information_extraction/DuUIE/uie/evaluation/sel2record.py @@ -15,16 +15,23 @@ # See the License for the specific language governing permissions and # limitations under the License. -from typing import Tuple, List, Dict -from collections import defaultdict, OrderedDict, Counter -import os -import numpy +import json import logging +import os import re -import json +from collections import Counter, OrderedDict, defaultdict +from typing import Dict, List, Tuple + +import numpy from nltk.tree import ParentedTree -from uie.evaluation.constants import span_start, type_start, type_end, null_span, offset_map_strategy -from uie.evaluation.scorer import EntityScorer, RelationScorer, EventScorer +from uie.evaluation.constants import ( + null_span, + offset_map_strategy, + span_start, + type_end, + type_start, +) +from uie.evaluation.scorer import EntityScorer, EventScorer, RelationScorer logger = logging.getLogger("__main__") diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/__init__.py b/legacy/examples/information_extraction/DuUIE/uie/seq2struct/__init__.py similarity index 100% rename from examples/information_extraction/DuUIE/uie/seq2struct/__init__.py rename to legacy/examples/information_extraction/DuUIE/uie/seq2struct/__init__.py diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/data_collator.py b/legacy/examples/information_extraction/DuUIE/uie/seq2struct/data_collator.py similarity index 100% rename from examples/information_extraction/DuUIE/uie/seq2struct/data_collator.py rename to legacy/examples/information_extraction/DuUIE/uie/seq2struct/data_collator.py diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py b/legacy/examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py similarity index 99% rename from examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py rename to legacy/examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py index cb761d8093f2..97603044288a 100644 --- a/examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py +++ b/legacy/examples/information_extraction/DuUIE/uie/seq2struct/t5_bert_tokenizer.py @@ -16,9 +16,10 @@ # limitations under the License. import logging -from typing import Optional, Union, List +from typing import List, Optional, Union from paddle import Tensor + from paddlenlp.transformers import BertTokenizer logger = logging.getLogger(__name__) diff --git a/examples/information_extraction/DuUIE/uie/seq2struct/utils.py b/legacy/examples/information_extraction/DuUIE/uie/seq2struct/utils.py similarity index 100% rename from examples/information_extraction/DuUIE/uie/seq2struct/utils.py rename to legacy/examples/information_extraction/DuUIE/uie/seq2struct/utils.py diff --git a/examples/information_extraction/msra_ner/README.md b/legacy/examples/information_extraction/msra_ner/README.md similarity index 100% rename from examples/information_extraction/msra_ner/README.md rename to legacy/examples/information_extraction/msra_ner/README.md diff --git a/examples/information_extraction/msra_ner/eval.py b/legacy/examples/information_extraction/msra_ner/eval.py similarity index 100% rename from examples/information_extraction/msra_ner/eval.py rename to legacy/examples/information_extraction/msra_ner/eval.py diff --git a/examples/information_extraction/msra_ner/predict.py b/legacy/examples/information_extraction/msra_ner/predict.py similarity index 100% rename from examples/information_extraction/msra_ner/predict.py rename to legacy/examples/information_extraction/msra_ner/predict.py diff --git a/examples/information_extraction/msra_ner/train.py b/legacy/examples/information_extraction/msra_ner/train.py similarity index 100% rename from examples/information_extraction/msra_ner/train.py rename to legacy/examples/information_extraction/msra_ner/train.py diff --git a/examples/machine_reading_comprehension/DuReader-robust/README.md b/legacy/examples/machine_reading_comprehension/DuReader-robust/README.md similarity index 100% rename from examples/machine_reading_comprehension/DuReader-robust/README.md rename to legacy/examples/machine_reading_comprehension/DuReader-robust/README.md diff --git a/examples/machine_reading_comprehension/DuReader-robust/args.py b/legacy/examples/machine_reading_comprehension/DuReader-robust/args.py similarity index 100% rename from examples/machine_reading_comprehension/DuReader-robust/args.py rename to legacy/examples/machine_reading_comprehension/DuReader-robust/args.py diff --git a/examples/machine_reading_comprehension/DuReader-robust/run_du.py b/legacy/examples/machine_reading_comprehension/DuReader-robust/run_du.py similarity index 100% rename from examples/machine_reading_comprehension/DuReader-robust/run_du.py rename to legacy/examples/machine_reading_comprehension/DuReader-robust/run_du.py diff --git a/examples/machine_reading_comprehension/DuReader-yesno/README.md b/legacy/examples/machine_reading_comprehension/DuReader-yesno/README.md similarity index 100% rename from examples/machine_reading_comprehension/DuReader-yesno/README.md rename to legacy/examples/machine_reading_comprehension/DuReader-yesno/README.md diff --git a/examples/machine_reading_comprehension/DuReader-yesno/args.py b/legacy/examples/machine_reading_comprehension/DuReader-yesno/args.py similarity index 81% rename from examples/machine_reading_comprehension/DuReader-yesno/args.py rename to legacy/examples/machine_reading_comprehension/DuReader-yesno/args.py index e460d925a1a0..af6727bc9681 100644 --- a/examples/machine_reading_comprehension/DuReader-yesno/args.py +++ b/legacy/examples/machine_reading_comprehension/DuReader-yesno/args.py @@ -1,3 +1,17 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import argparse diff --git a/examples/machine_reading_comprehension/DuReader-yesno/run_du.py b/legacy/examples/machine_reading_comprehension/DuReader-yesno/run_du.py similarity index 100% rename from examples/machine_reading_comprehension/DuReader-yesno/run_du.py rename to legacy/examples/machine_reading_comprehension/DuReader-yesno/run_du.py diff --git a/examples/machine_reading_comprehension/SQuAD/README.md b/legacy/examples/machine_reading_comprehension/SQuAD/README.md similarity index 100% rename from examples/machine_reading_comprehension/SQuAD/README.md rename to legacy/examples/machine_reading_comprehension/SQuAD/README.md diff --git a/examples/machine_reading_comprehension/SQuAD/args.py b/legacy/examples/machine_reading_comprehension/SQuAD/args.py similarity index 100% rename from examples/machine_reading_comprehension/SQuAD/args.py rename to legacy/examples/machine_reading_comprehension/SQuAD/args.py diff --git a/examples/machine_reading_comprehension/SQuAD/deploy/python/predict.py b/legacy/examples/machine_reading_comprehension/SQuAD/deploy/python/predict.py similarity index 100% rename from examples/machine_reading_comprehension/SQuAD/deploy/python/predict.py rename to legacy/examples/machine_reading_comprehension/SQuAD/deploy/python/predict.py diff --git a/examples/machine_reading_comprehension/SQuAD/export_model.py b/legacy/examples/machine_reading_comprehension/SQuAD/export_model.py similarity index 99% rename from examples/machine_reading_comprehension/SQuAD/export_model.py rename to legacy/examples/machine_reading_comprehension/SQuAD/export_model.py index 7f1ad38135c9..9144e3e0491c 100644 --- a/examples/machine_reading_comprehension/SQuAD/export_model.py +++ b/legacy/examples/machine_reading_comprehension/SQuAD/export_model.py @@ -16,7 +16,6 @@ import os import paddle - from run_squad import MODEL_CLASSES diff --git a/examples/machine_reading_comprehension/SQuAD/run_squad.py b/legacy/examples/machine_reading_comprehension/SQuAD/run_squad.py similarity index 100% rename from examples/machine_reading_comprehension/SQuAD/run_squad.py rename to legacy/examples/machine_reading_comprehension/SQuAD/run_squad.py diff --git a/examples/machine_translation/README.md b/legacy/examples/machine_translation/README.md similarity index 100% rename from examples/machine_translation/README.md rename to legacy/examples/machine_translation/README.md diff --git a/examples/machine_translation/preprocessor/prepare-iwslt14.sh b/legacy/examples/machine_translation/preprocessor/prepare-iwslt14.sh similarity index 100% rename from examples/machine_translation/preprocessor/prepare-iwslt14.sh rename to legacy/examples/machine_translation/preprocessor/prepare-iwslt14.sh diff --git a/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh b/legacy/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh similarity index 100% rename from examples/machine_translation/preprocessor/prepare-wmt14en2de.sh rename to legacy/examples/machine_translation/preprocessor/prepare-wmt14en2de.sh diff --git a/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh b/legacy/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh similarity index 100% rename from examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh rename to legacy/examples/machine_translation/preprocessor/prepare-wmt14en2fr.sh diff --git a/examples/machine_translation/preprocessor/preprocessor.py b/legacy/examples/machine_translation/preprocessor/preprocessor.py similarity index 100% rename from examples/machine_translation/preprocessor/preprocessor.py rename to legacy/examples/machine_translation/preprocessor/preprocessor.py diff --git a/examples/machine_translation/requirements.txt b/legacy/examples/machine_translation/requirements.txt similarity index 100% rename from examples/machine_translation/requirements.txt rename to legacy/examples/machine_translation/requirements.txt diff --git a/examples/machine_translation/transformer/README.md b/legacy/examples/machine_translation/transformer/README.md similarity index 100% rename from examples/machine_translation/transformer/README.md rename to legacy/examples/machine_translation/transformer/README.md diff --git a/examples/machine_translation/transformer/configs/transformer.base.yaml b/legacy/examples/machine_translation/transformer/configs/transformer.base.yaml similarity index 100% rename from examples/machine_translation/transformer/configs/transformer.base.yaml rename to legacy/examples/machine_translation/transformer/configs/transformer.base.yaml diff --git a/examples/machine_translation/transformer/configs/transformer.big.yaml b/legacy/examples/machine_translation/transformer/configs/transformer.big.yaml similarity index 100% rename from examples/machine_translation/transformer/configs/transformer.big.yaml rename to legacy/examples/machine_translation/transformer/configs/transformer.big.yaml diff --git a/examples/machine_translation/transformer/deploy/cpp/CMakeLists.txt b/legacy/examples/machine_translation/transformer/deploy/cpp/CMakeLists.txt similarity index 100% rename from examples/machine_translation/transformer/deploy/cpp/CMakeLists.txt rename to legacy/examples/machine_translation/transformer/deploy/cpp/CMakeLists.txt diff --git a/examples/machine_translation/transformer/deploy/cpp/README.md b/legacy/examples/machine_translation/transformer/deploy/cpp/README.md similarity index 100% rename from examples/machine_translation/transformer/deploy/cpp/README.md rename to legacy/examples/machine_translation/transformer/deploy/cpp/README.md diff --git a/examples/machine_translation/transformer/deploy/cpp/helper.h b/legacy/examples/machine_translation/transformer/deploy/cpp/helper.h similarity index 69% rename from examples/machine_translation/transformer/deploy/cpp/helper.h rename to legacy/examples/machine_translation/transformer/deploy/cpp/helper.h index 6b4d41f82d8d..b46ccfa68ae9 100644 --- a/examples/machine_translation/transformer/deploy/cpp/helper.h +++ b/legacy/examples/machine_translation/transformer/deploy/cpp/helper.h @@ -1,3 +1,17 @@ +// Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + #pragma once #include #include diff --git a/examples/machine_translation/transformer/deploy/cpp/run.sh b/legacy/examples/machine_translation/transformer/deploy/cpp/run.sh similarity index 56% rename from examples/machine_translation/transformer/deploy/cpp/run.sh rename to legacy/examples/machine_translation/transformer/deploy/cpp/run.sh index 6c59397b5cad..0c2bded97eda 100644 --- a/examples/machine_translation/transformer/deploy/cpp/run.sh +++ b/legacy/examples/machine_translation/transformer/deploy/cpp/run.sh @@ -1,4 +1,19 @@ #!/bin/bash + +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + # Whether to use mkl or gpu WITH_MKL=ON DEVICE='gpu' diff --git a/examples/machine_translation/transformer/deploy/cpp/run_impl.sh b/legacy/examples/machine_translation/transformer/deploy/cpp/run_impl.sh similarity index 50% rename from examples/machine_translation/transformer/deploy/cpp/run_impl.sh rename to legacy/examples/machine_translation/transformer/deploy/cpp/run_impl.sh index e8c5782af25e..a715925a37e6 100755 --- a/examples/machine_translation/transformer/deploy/cpp/run_impl.sh +++ b/legacy/examples/machine_translation/transformer/deploy/cpp/run_impl.sh @@ -1,4 +1,19 @@ #!/bin/bash + +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + mkdir -p build cd build rm -rf * diff --git a/examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc b/legacy/examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc similarity index 93% rename from examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc rename to legacy/examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc index 2f21391f3cab..c0eadf0aef74 100644 --- a/examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc +++ b/legacy/examples/machine_translation/transformer/deploy/cpp/transformer_e2e.cc @@ -1,3 +1,17 @@ +// Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + #include #include #include diff --git a/examples/machine_translation/transformer/deploy/python/README.md b/legacy/examples/machine_translation/transformer/deploy/python/README.md similarity index 100% rename from examples/machine_translation/transformer/deploy/python/README.md rename to legacy/examples/machine_translation/transformer/deploy/python/README.md diff --git a/examples/machine_translation/transformer/deploy/python/benchmark.sh b/legacy/examples/machine_translation/transformer/deploy/python/benchmark.sh similarity index 64% rename from examples/machine_translation/transformer/deploy/python/benchmark.sh rename to legacy/examples/machine_translation/transformer/deploy/python/benchmark.sh index 0b9b8c482995..4d953a467f51 100644 --- a/examples/machine_translation/transformer/deploy/python/benchmark.sh +++ b/legacy/examples/machine_translation/transformer/deploy/python/benchmark.sh @@ -1,4 +1,19 @@ #!/bin/bash + +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + model_dir=${1} model=${2} mkdir -p output_pipeline diff --git a/examples/machine_translation/transformer/deploy/python/inference.py b/legacy/examples/machine_translation/transformer/deploy/python/inference.py similarity index 100% rename from examples/machine_translation/transformer/deploy/python/inference.py rename to legacy/examples/machine_translation/transformer/deploy/python/inference.py diff --git a/examples/machine_translation/transformer/deploy/python/tls/benchmark_utils.py b/legacy/examples/machine_translation/transformer/deploy/python/tls/benchmark_utils.py similarity index 100% rename from examples/machine_translation/transformer/deploy/python/tls/benchmark_utils.py rename to legacy/examples/machine_translation/transformer/deploy/python/tls/benchmark_utils.py diff --git a/examples/machine_translation/transformer/deploy/python/tls/recorder.py b/legacy/examples/machine_translation/transformer/deploy/python/tls/recorder.py similarity index 100% rename from examples/machine_translation/transformer/deploy/python/tls/recorder.py rename to legacy/examples/machine_translation/transformer/deploy/python/tls/recorder.py diff --git a/examples/machine_translation/transformer/deploy/serving/README.md b/legacy/examples/machine_translation/transformer/deploy/serving/README.md similarity index 100% rename from examples/machine_translation/transformer/deploy/serving/README.md rename to legacy/examples/machine_translation/transformer/deploy/serving/README.md diff --git a/examples/machine_translation/transformer/deploy/serving/benchmark.py b/legacy/examples/machine_translation/transformer/deploy/serving/benchmark.py similarity index 100% rename from examples/machine_translation/transformer/deploy/serving/benchmark.py rename to legacy/examples/machine_translation/transformer/deploy/serving/benchmark.py diff --git a/examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh b/legacy/examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh similarity index 71% rename from examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh rename to legacy/examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh index 7f57ef5a0873..4b5f686b2f47 100644 --- a/examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh +++ b/legacy/examples/machine_translation/transformer/deploy/serving/benchmark_serving.sh @@ -1,3 +1,17 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + modelname="transformer" export FLAGS_profile_pipeline=1 # HTTP diff --git a/examples/machine_translation/transformer/deploy/serving/export_serving_model.py b/legacy/examples/machine_translation/transformer/deploy/serving/export_serving_model.py similarity index 56% rename from examples/machine_translation/transformer/deploy/serving/export_serving_model.py rename to legacy/examples/machine_translation/transformer/deploy/serving/export_serving_model.py index 97e0a526dc80..feb38dda199e 100644 --- a/examples/machine_translation/transformer/deploy/serving/export_serving_model.py +++ b/legacy/examples/machine_translation/transformer/deploy/serving/export_serving_model.py @@ -1,4 +1,19 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import argparse + import paddle import paddle_serving_client.io as serving_io diff --git a/examples/machine_translation/transformer/deploy/serving/transformer_reader.py b/legacy/examples/machine_translation/transformer/deploy/serving/transformer_reader.py similarity index 73% rename from examples/machine_translation/transformer/deploy/serving/transformer_reader.py rename to legacy/examples/machine_translation/transformer/deploy/serving/transformer_reader.py index 2b295e7d6c37..b613a5906138 100644 --- a/examples/machine_translation/transformer/deploy/serving/transformer_reader.py +++ b/legacy/examples/machine_translation/transformer/deploy/serving/transformer_reader.py @@ -1,7 +1,21 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import numpy as np -from paddlenlp.datasets import load_dataset from paddlenlp.data import Pad, Vocab +from paddlenlp.datasets import load_dataset class TransformerReader(object): diff --git a/examples/machine_translation/transformer/deploy/serving/transformer_web_client.py b/legacy/examples/machine_translation/transformer/deploy/serving/transformer_web_client.py similarity index 100% rename from examples/machine_translation/transformer/deploy/serving/transformer_web_client.py rename to legacy/examples/machine_translation/transformer/deploy/serving/transformer_web_client.py diff --git a/examples/machine_translation/transformer/deploy/serving/transformer_web_server.py b/legacy/examples/machine_translation/transformer/deploy/serving/transformer_web_server.py similarity index 100% rename from examples/machine_translation/transformer/deploy/serving/transformer_web_server.py rename to legacy/examples/machine_translation/transformer/deploy/serving/transformer_web_server.py diff --git a/examples/machine_translation/transformer/deploy/serving/utils/recorder.py b/legacy/examples/machine_translation/transformer/deploy/serving/utils/recorder.py similarity index 67% rename from examples/machine_translation/transformer/deploy/serving/utils/recorder.py rename to legacy/examples/machine_translation/transformer/deploy/serving/utils/recorder.py index 70a156a5e4f8..454abaf0e101 100644 --- a/examples/machine_translation/transformer/deploy/serving/utils/recorder.py +++ b/legacy/examples/machine_translation/transformer/deploy/serving/utils/recorder.py @@ -1,4 +1,19 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + import time + import paddle diff --git a/examples/machine_translation/transformer/export_model.py b/legacy/examples/machine_translation/transformer/export_model.py similarity index 100% rename from examples/machine_translation/transformer/export_model.py rename to legacy/examples/machine_translation/transformer/export_model.py diff --git a/examples/machine_translation/transformer/fast_transformer/README.md b/legacy/examples/machine_translation/transformer/fast_transformer/README.md similarity index 100% rename from examples/machine_translation/transformer/fast_transformer/README.md rename to legacy/examples/machine_translation/transformer/fast_transformer/README.md diff --git a/examples/machine_translation/transformer/fast_transformer/encoder_decoding_predict.py b/legacy/examples/machine_translation/transformer/fast_transformer/encoder_decoding_predict.py similarity index 100% rename from examples/machine_translation/transformer/fast_transformer/encoder_decoding_predict.py rename to legacy/examples/machine_translation/transformer/fast_transformer/encoder_decoding_predict.py diff --git a/examples/machine_translation/transformer/fast_transformer/export_model.py b/legacy/examples/machine_translation/transformer/fast_transformer/export_model.py similarity index 100% rename from examples/machine_translation/transformer/fast_transformer/export_model.py rename to legacy/examples/machine_translation/transformer/fast_transformer/export_model.py diff --git a/examples/machine_translation/transformer/images/multi_head_attention.png b/legacy/examples/machine_translation/transformer/images/multi_head_attention.png similarity index 100% rename from examples/machine_translation/transformer/images/multi_head_attention.png rename to legacy/examples/machine_translation/transformer/images/multi_head_attention.png diff --git a/examples/machine_translation/transformer/images/transformer_network.png b/legacy/examples/machine_translation/transformer/images/transformer_network.png similarity index 100% rename from examples/machine_translation/transformer/images/transformer_network.png rename to legacy/examples/machine_translation/transformer/images/transformer_network.png diff --git a/examples/machine_translation/transformer/predict.py b/legacy/examples/machine_translation/transformer/predict.py similarity index 100% rename from examples/machine_translation/transformer/predict.py rename to legacy/examples/machine_translation/transformer/predict.py diff --git a/examples/machine_translation/transformer/reader.py b/legacy/examples/machine_translation/transformer/reader.py similarity index 100% rename from examples/machine_translation/transformer/reader.py rename to legacy/examples/machine_translation/transformer/reader.py diff --git a/examples/machine_translation/transformer/static/predict.py b/legacy/examples/machine_translation/transformer/static/predict.py similarity index 100% rename from examples/machine_translation/transformer/static/predict.py rename to legacy/examples/machine_translation/transformer/static/predict.py diff --git a/examples/machine_translation/transformer/static/train.py b/legacy/examples/machine_translation/transformer/static/train.py similarity index 100% rename from examples/machine_translation/transformer/static/train.py rename to legacy/examples/machine_translation/transformer/static/train.py diff --git a/legacy/examples/machine_translation/transformer/tls/distributed_utils.py b/legacy/examples/machine_translation/transformer/tls/distributed_utils.py new file mode 100644 index 000000000000..67a9ae4c7cee --- /dev/null +++ b/legacy/examples/machine_translation/transformer/tls/distributed_utils.py @@ -0,0 +1,33 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import paddle +import paddle.distributed as dist + + +def all_gather_tokens(data): + """Gathers num of tokens from all nodes. + `data` should be a tensor of num of tokens. + """ + if dist.get_world_size() < 2: + return data + if not hasattr(all_gather_tokens, "_in_buffer") or all_gather_tokens._in_buffer is None: + all_gather_tokens._in_buffer = data + all_gather_tokens._out_buffers = [] + in_buffer = all_gather_tokens._in_buffer + out_buffers = all_gather_tokens._out_buffers + + dist.all_gather(out_buffers, in_buffer) + + return paddle.add_n(out_buffers) diff --git a/examples/machine_translation/transformer/tls/record.py b/legacy/examples/machine_translation/transformer/tls/record.py similarity index 50% rename from examples/machine_translation/transformer/tls/record.py rename to legacy/examples/machine_translation/transformer/tls/record.py index d1ddc738a528..a5a6dddc7139 100644 --- a/examples/machine_translation/transformer/tls/record.py +++ b/legacy/examples/machine_translation/transformer/tls/record.py @@ -1,3 +1,18 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + class AverageStatistical(object): def __init__(self): self.reset() diff --git a/examples/machine_translation/transformer/tls/to_static.py b/legacy/examples/machine_translation/transformer/tls/to_static.py similarity index 100% rename from examples/machine_translation/transformer/tls/to_static.py rename to legacy/examples/machine_translation/transformer/tls/to_static.py diff --git a/examples/machine_translation/transformer/train.py b/legacy/examples/machine_translation/transformer/train.py similarity index 100% rename from examples/machine_translation/transformer/train.py rename to legacy/examples/machine_translation/transformer/train.py diff --git a/examples/model_compression/minilmv2/README.md b/legacy/examples/model_compression/minilmv2/README.md similarity index 100% rename from examples/model_compression/minilmv2/README.md rename to legacy/examples/model_compression/minilmv2/README.md diff --git a/examples/model_compression/minilmv2/general_distill.py b/legacy/examples/model_compression/minilmv2/general_distill.py similarity index 100% rename from examples/model_compression/minilmv2/general_distill.py rename to legacy/examples/model_compression/minilmv2/general_distill.py diff --git a/examples/model_compression/minilmv2/run_clue.py b/legacy/examples/model_compression/minilmv2/run_clue.py similarity index 100% rename from examples/model_compression/minilmv2/run_clue.py rename to legacy/examples/model_compression/minilmv2/run_clue.py diff --git a/examples/model_compression/ofa/README.md b/legacy/examples/model_compression/ofa/README.md similarity index 100% rename from examples/model_compression/ofa/README.md rename to legacy/examples/model_compression/ofa/README.md diff --git a/examples/model_compression/ofa/export_model.py b/legacy/examples/model_compression/ofa/export_model.py similarity index 100% rename from examples/model_compression/ofa/export_model.py rename to legacy/examples/model_compression/ofa/export_model.py diff --git a/examples/model_compression/ofa/imgs/ofa_bert.jpg b/legacy/examples/model_compression/ofa/imgs/ofa_bert.jpg similarity index 100% rename from examples/model_compression/ofa/imgs/ofa_bert.jpg rename to legacy/examples/model_compression/ofa/imgs/ofa_bert.jpg diff --git a/examples/model_compression/ofa/run_glue_ofa.py b/legacy/examples/model_compression/ofa/run_glue_ofa.py similarity index 100% rename from examples/model_compression/ofa/run_glue_ofa.py rename to legacy/examples/model_compression/ofa/run_glue_ofa.py diff --git a/examples/model_compression/ofa/run_glue_ofa_depth.py b/legacy/examples/model_compression/ofa/run_glue_ofa_depth.py similarity index 100% rename from examples/model_compression/ofa/run_glue_ofa_depth.py rename to legacy/examples/model_compression/ofa/run_glue_ofa_depth.py diff --git a/examples/model_compression/pp-minilm/README.md b/legacy/examples/model_compression/pp-minilm/README.md similarity index 100% rename from examples/model_compression/pp-minilm/README.md rename to legacy/examples/model_compression/pp-minilm/README.md diff --git a/examples/model_compression/pp-minilm/data.py b/legacy/examples/model_compression/pp-minilm/data.py similarity index 100% rename from examples/model_compression/pp-minilm/data.py rename to legacy/examples/model_compression/pp-minilm/data.py diff --git a/examples/model_compression/pp-minilm/deploy/python/infer.py b/legacy/examples/model_compression/pp-minilm/deploy/python/infer.py similarity index 100% rename from examples/model_compression/pp-minilm/deploy/python/infer.py rename to legacy/examples/model_compression/pp-minilm/deploy/python/infer.py diff --git a/examples/model_compression/pp-minilm/deploy/python/infer_all.sh b/legacy/examples/model_compression/pp-minilm/deploy/python/infer_all.sh similarity index 100% rename from examples/model_compression/pp-minilm/deploy/python/infer_all.sh rename to legacy/examples/model_compression/pp-minilm/deploy/python/infer_all.sh diff --git a/examples/model_compression/pp-minilm/deploy/python/infer_perf.sh b/legacy/examples/model_compression/pp-minilm/deploy/python/infer_perf.sh similarity index 100% rename from examples/model_compression/pp-minilm/deploy/python/infer_perf.sh rename to legacy/examples/model_compression/pp-minilm/deploy/python/infer_perf.sh diff --git a/examples/model_compression/pp-minilm/deploy/serving/README.md b/legacy/examples/model_compression/pp-minilm/deploy/serving/README.md similarity index 100% rename from examples/model_compression/pp-minilm/deploy/serving/README.md rename to legacy/examples/model_compression/pp-minilm/deploy/serving/README.md diff --git a/examples/model_compression/pp-minilm/deploy/serving/config_nlp.yml b/legacy/examples/model_compression/pp-minilm/deploy/serving/config_nlp.yml similarity index 100% rename from examples/model_compression/pp-minilm/deploy/serving/config_nlp.yml rename to legacy/examples/model_compression/pp-minilm/deploy/serving/config_nlp.yml diff --git a/examples/model_compression/pp-minilm/deploy/serving/export_to_serving.py b/legacy/examples/model_compression/pp-minilm/deploy/serving/export_to_serving.py similarity index 100% rename from examples/model_compression/pp-minilm/deploy/serving/export_to_serving.py rename to legacy/examples/model_compression/pp-minilm/deploy/serving/export_to_serving.py diff --git a/examples/model_compression/pp-minilm/deploy/serving/rpc_client.py b/legacy/examples/model_compression/pp-minilm/deploy/serving/rpc_client.py similarity index 100% rename from examples/model_compression/pp-minilm/deploy/serving/rpc_client.py rename to legacy/examples/model_compression/pp-minilm/deploy/serving/rpc_client.py index c975d13265e7..442391f969b2 100644 --- a/examples/model_compression/pp-minilm/deploy/serving/rpc_client.py +++ b/legacy/examples/model_compression/pp-minilm/deploy/serving/rpc_client.py @@ -12,8 +12,8 @@ # See the License for the specific language governing permissions and # limitations under the License. -from paddle_serving_server.pipeline import PipelineClient import numpy as np +from paddle_serving_server.pipeline import PipelineClient client = PipelineClient() client.connect(["127.0.0.1:8091"]) diff --git a/examples/model_compression/pp-minilm/deploy/serving/web_service.py b/legacy/examples/model_compression/pp-minilm/deploy/serving/web_service.py similarity index 100% rename from examples/model_compression/pp-minilm/deploy/serving/web_service.py rename to legacy/examples/model_compression/pp-minilm/deploy/serving/web_service.py diff --git a/examples/model_compression/pp-minilm/finetuning/export_model.py b/legacy/examples/model_compression/pp-minilm/finetuning/export_model.py similarity index 100% rename from examples/model_compression/pp-minilm/finetuning/export_model.py rename to legacy/examples/model_compression/pp-minilm/finetuning/export_model.py diff --git a/examples/model_compression/pp-minilm/finetuning/run_all_search.sh b/legacy/examples/model_compression/pp-minilm/finetuning/run_all_search.sh similarity index 68% rename from examples/model_compression/pp-minilm/finetuning/run_all_search.sh rename to legacy/examples/model_compression/pp-minilm/finetuning/run_all_search.sh index c09a288a2fad..39364d1f4733 100644 --- a/examples/model_compression/pp-minilm/finetuning/run_all_search.sh +++ b/legacy/examples/model_compression/pp-minilm/finetuning/run_all_search.sh @@ -1,3 +1,17 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + # $1 means GENERAL_DIR mkdir -p $1/afqmc mkdir -p $1/tnews diff --git a/examples/model_compression/pp-minilm/finetuning/run_clue.py b/legacy/examples/model_compression/pp-minilm/finetuning/run_clue.py similarity index 100% rename from examples/model_compression/pp-minilm/finetuning/run_clue.py rename to legacy/examples/model_compression/pp-minilm/finetuning/run_clue.py diff --git a/examples/model_compression/pp-minilm/finetuning/run_clue.sh b/legacy/examples/model_compression/pp-minilm/finetuning/run_clue.sh similarity index 50% rename from examples/model_compression/pp-minilm/finetuning/run_clue.sh rename to legacy/examples/model_compression/pp-minilm/finetuning/run_clue.sh index de7f577ee6af..f8e5061e657d 100644 --- a/examples/model_compression/pp-minilm/finetuning/run_clue.sh +++ b/legacy/examples/model_compression/pp-minilm/finetuning/run_clue.sh @@ -1,3 +1,16 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. export TASK_NAME=$1 export LR=$2 diff --git a/examples/model_compression/pp-minilm/finetuning/run_one_search.sh b/legacy/examples/model_compression/pp-minilm/finetuning/run_one_search.sh similarity index 68% rename from examples/model_compression/pp-minilm/finetuning/run_one_search.sh rename to legacy/examples/model_compression/pp-minilm/finetuning/run_one_search.sh index fbb5261d2f31..c15fef531b9c 100644 --- a/examples/model_compression/pp-minilm/finetuning/run_one_search.sh +++ b/legacy/examples/model_compression/pp-minilm/finetuning/run_one_search.sh @@ -1,3 +1,17 @@ +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + OUTPUT_DIR=$1 TASK_NAME=$2 diff --git a/examples/model_compression/pp-minilm/general_distill/README.md b/legacy/examples/model_compression/pp-minilm/general_distill/README.md similarity index 100% rename from examples/model_compression/pp-minilm/general_distill/README.md rename to legacy/examples/model_compression/pp-minilm/general_distill/README.md diff --git a/examples/model_compression/pp-minilm/general_distill/general_distill.py b/legacy/examples/model_compression/pp-minilm/general_distill/general_distill.py similarity index 100% rename from examples/model_compression/pp-minilm/general_distill/general_distill.py rename to legacy/examples/model_compression/pp-minilm/general_distill/general_distill.py diff --git a/examples/model_compression/pp-minilm/general_distill/run.sh b/legacy/examples/model_compression/pp-minilm/general_distill/run.sh similarity index 100% rename from examples/model_compression/pp-minilm/general_distill/run.sh rename to legacy/examples/model_compression/pp-minilm/general_distill/run.sh diff --git a/examples/model_compression/pp-minilm/pp-minilm.png b/legacy/examples/model_compression/pp-minilm/pp-minilm.png similarity index 100% rename from examples/model_compression/pp-minilm/pp-minilm.png rename to legacy/examples/model_compression/pp-minilm/pp-minilm.png diff --git a/examples/model_compression/pp-minilm/pruning/export.sh b/legacy/examples/model_compression/pp-minilm/pruning/export.sh similarity index 100% rename from examples/model_compression/pp-minilm/pruning/export.sh rename to legacy/examples/model_compression/pp-minilm/pruning/export.sh diff --git a/examples/model_compression/pp-minilm/pruning/export_all.sh b/legacy/examples/model_compression/pp-minilm/pruning/export_all.sh similarity index 100% rename from examples/model_compression/pp-minilm/pruning/export_all.sh rename to legacy/examples/model_compression/pp-minilm/pruning/export_all.sh diff --git a/examples/model_compression/pp-minilm/pruning/export_model.py b/legacy/examples/model_compression/pp-minilm/pruning/export_model.py similarity index 100% rename from examples/model_compression/pp-minilm/pruning/export_model.py rename to legacy/examples/model_compression/pp-minilm/pruning/export_model.py diff --git a/examples/model_compression/pp-minilm/pruning/prune.py b/legacy/examples/model_compression/pp-minilm/pruning/prune.py similarity index 100% rename from examples/model_compression/pp-minilm/pruning/prune.py rename to legacy/examples/model_compression/pp-minilm/pruning/prune.py diff --git a/examples/model_compression/pp-minilm/pruning/prune.sh b/legacy/examples/model_compression/pp-minilm/pruning/prune.sh similarity index 100% rename from examples/model_compression/pp-minilm/pruning/prune.sh rename to legacy/examples/model_compression/pp-minilm/pruning/prune.sh diff --git a/examples/model_compression/pp-minilm/quantization/quant_all.sh b/legacy/examples/model_compression/pp-minilm/quantization/quant_all.sh similarity index 100% rename from examples/model_compression/pp-minilm/quantization/quant_all.sh rename to legacy/examples/model_compression/pp-minilm/quantization/quant_all.sh diff --git a/examples/model_compression/pp-minilm/quantization/quant_post.py b/legacy/examples/model_compression/pp-minilm/quantization/quant_post.py similarity index 100% rename from examples/model_compression/pp-minilm/quantization/quant_post.py rename to legacy/examples/model_compression/pp-minilm/quantization/quant_post.py diff --git a/examples/question_generation/README.md b/legacy/examples/question_generation/README.md similarity index 100% rename from examples/question_generation/README.md rename to legacy/examples/question_generation/README.md diff --git a/examples/question_generation/t5/README.md b/legacy/examples/question_generation/t5/README.md similarity index 100% rename from examples/question_generation/t5/README.md rename to legacy/examples/question_generation/t5/README.md diff --git a/examples/question_generation/t5/predict.py b/legacy/examples/question_generation/t5/predict.py similarity index 100% rename from examples/question_generation/t5/predict.py rename to legacy/examples/question_generation/t5/predict.py diff --git a/examples/question_generation/t5/requirements.txt b/legacy/examples/question_generation/t5/requirements.txt similarity index 100% rename from examples/question_generation/t5/requirements.txt rename to legacy/examples/question_generation/t5/requirements.txt diff --git a/examples/question_generation/t5/train.py b/legacy/examples/question_generation/t5/train.py similarity index 100% rename from examples/question_generation/t5/train.py rename to legacy/examples/question_generation/t5/train.py diff --git a/examples/question_generation/t5/utils.py b/legacy/examples/question_generation/t5/utils.py similarity index 100% rename from examples/question_generation/t5/utils.py rename to legacy/examples/question_generation/t5/utils.py diff --git a/examples/question_generation/unimo-text/README.md b/legacy/examples/question_generation/unimo-text/README.md similarity index 100% rename from examples/question_generation/unimo-text/README.md rename to legacy/examples/question_generation/unimo-text/README.md diff --git a/examples/question_generation/unimo-text/deploy/paddle_inference/README.md b/legacy/examples/question_generation/unimo-text/deploy/paddle_inference/README.md similarity index 100% rename from examples/question_generation/unimo-text/deploy/paddle_inference/README.md rename to legacy/examples/question_generation/unimo-text/deploy/paddle_inference/README.md diff --git a/examples/question_generation/unimo-text/deploy/paddle_inference/infer_utils.py b/legacy/examples/question_generation/unimo-text/deploy/paddle_inference/infer_utils.py similarity index 100% rename from examples/question_generation/unimo-text/deploy/paddle_inference/infer_utils.py rename to legacy/examples/question_generation/unimo-text/deploy/paddle_inference/infer_utils.py diff --git a/examples/question_generation/unimo-text/deploy/paddle_inference/inference.py b/legacy/examples/question_generation/unimo-text/deploy/paddle_inference/inference.py similarity index 100% rename from examples/question_generation/unimo-text/deploy/paddle_inference/inference.py rename to legacy/examples/question_generation/unimo-text/deploy/paddle_inference/inference.py diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/README.md b/legacy/examples/question_generation/unimo-text/deploy/paddle_serving/README.md similarity index 100% rename from examples/question_generation/unimo-text/deploy/paddle_serving/README.md rename to legacy/examples/question_generation/unimo-text/deploy/paddle_serving/README.md diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/config.yml b/legacy/examples/question_generation/unimo-text/deploy/paddle_serving/config.yml similarity index 100% rename from examples/question_generation/unimo-text/deploy/paddle_serving/config.yml rename to legacy/examples/question_generation/unimo-text/deploy/paddle_serving/config.yml diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/infer_utils.py b/legacy/examples/question_generation/unimo-text/deploy/paddle_serving/infer_utils.py similarity index 100% rename from examples/question_generation/unimo-text/deploy/paddle_serving/infer_utils.py rename to legacy/examples/question_generation/unimo-text/deploy/paddle_serving/infer_utils.py diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_client.py b/legacy/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_client.py similarity index 100% rename from examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_client.py rename to legacy/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_client.py diff --git a/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_service.py b/legacy/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_service.py similarity index 100% rename from examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_service.py rename to legacy/examples/question_generation/unimo-text/deploy/paddle_serving/pipeline_service.py diff --git a/examples/question_generation/unimo-text/export_model.py b/legacy/examples/question_generation/unimo-text/export_model.py similarity index 100% rename from examples/question_generation/unimo-text/export_model.py rename to legacy/examples/question_generation/unimo-text/export_model.py diff --git a/examples/question_generation/unimo-text/gen_utils.py b/legacy/examples/question_generation/unimo-text/gen_utils.py similarity index 99% rename from examples/question_generation/unimo-text/gen_utils.py rename to legacy/examples/question_generation/unimo-text/gen_utils.py index ecc75584d89f..f7868500987e 100644 --- a/examples/question_generation/unimo-text/gen_utils.py +++ b/legacy/examples/question_generation/unimo-text/gen_utils.py @@ -16,10 +16,10 @@ from functools import partial import numpy as np - import paddle import paddle.distributed as dist -from paddle.io import DataLoader, DistributedBatchSampler, BatchSampler +from paddle.io import BatchSampler, DataLoader, DistributedBatchSampler + from paddlenlp.data import Pad diff --git a/examples/question_generation/unimo-text/predict.py b/legacy/examples/question_generation/unimo-text/predict.py similarity index 100% rename from examples/question_generation/unimo-text/predict.py rename to legacy/examples/question_generation/unimo-text/predict.py diff --git a/examples/question_generation/unimo-text/requirements.txt b/legacy/examples/question_generation/unimo-text/requirements.txt similarity index 100% rename from examples/question_generation/unimo-text/requirements.txt rename to legacy/examples/question_generation/unimo-text/requirements.txt diff --git a/examples/question_generation/unimo-text/train.py b/legacy/examples/question_generation/unimo-text/train.py similarity index 100% rename from examples/question_generation/unimo-text/train.py rename to legacy/examples/question_generation/unimo-text/train.py diff --git a/examples/semantic_indexing/NQdataset.py b/legacy/examples/semantic_indexing/NQdataset.py similarity index 98% rename from examples/semantic_indexing/NQdataset.py rename to legacy/examples/semantic_indexing/NQdataset.py index 58efe8156ce1..ca1de6adc23f 100644 --- a/examples/semantic_indexing/NQdataset.py +++ b/legacy/examples/semantic_indexing/NQdataset.py @@ -86,7 +86,7 @@ def _read_json_data(self, dataPath): def __getitem__(self, index): json_sample_data = self.data[index] r = BiEncoderSample() - r.query = self._porcess_query(json_sample_data["question"]) + r.query = self._process_query(json_sample_data["question"]) positive_ctxs = json_sample_data["positive_ctxs"] @@ -106,7 +106,7 @@ def create_passage(ctx): return r - def _porcess_query(self, query): + def _process_query(self, query): query = normalize_question(query) if self.query_special_suffix and not query.endswith(self.query_special_suffix): diff --git a/examples/semantic_indexing/README.md b/legacy/examples/semantic_indexing/README.md similarity index 73% rename from examples/semantic_indexing/README.md rename to legacy/examples/semantic_indexing/README.md index 9b37dd24b737..f411744ceb0f 100644 --- a/examples/semantic_indexing/README.md +++ b/legacy/examples/semantic_indexing/README.md @@ -7,21 +7,21 @@ 我们基于 ERNIE1.0 热启,分别采用 [In-batch negatives](https://arxiv.org/abs/2004.04906) 策略和 HardestNeg 策略开源了 [batch_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/batch_neg_v1.0.tar) 和 [hardest_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/hardest_neg_v1.0.tar) 模型,相比 Baseline 模型效果有显著提升: ## 效果评估 -| 模型 | Recall@10 | Recall@50 |策略简要说明| -| ------------ | ------------ | ------------ |--------- | -| Baseline | 46.99 | 60.84 | 标准 pair-wise 训练范式,通过随机采样产生负样本| -| [In-batch negatives](https://arxiv.org/abs/2004.04906) | 51.20(**+4.21**) | 67.24(**+6.4**) | 在 Batch 内同时使用 batch_size 个负样本进行训练| -| HardestNeg | 50.22(**+3.23**) | 65.17(**+4.33**) |
在 Batch 内先挖掘最难负样本,然后进行 pair-wise 训练
| +| 模型 | Recall@10 | Recall@50 | 策略简要说明 | +|--------------------------------------------------------|------------------|------------------|---------------------------------------------------------------------------------------| +| Baseline | 46.99 | 60.84 | 标准 pair-wise 训练范式,通过随机采样产生负样本 | +| [In-batch negatives](https://arxiv.org/abs/2004.04906) | 51.20(**+4.21**) | 67.24(**+6.4**) | 在 Batch 内同时使用 batch_size 个负样本进行训练 | +| HardestNeg | 50.22(**+3.23**) | 65.17(**+4.33**) |
在 Batch 内先挖掘最难负样本,然后进行 pair-wise 训练
| ## 语义索引预训练模型下载 以下模型结构参数为: `TrasformerLayer:12, Hidden:768, Heads:12, OutputEmbSize: 256` -|Model|训练参数配置|硬件|MD5| -| ------------ | ------------ | ------------ |-----------| -|[batch_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/batch_neg_v1.0.tar)|
margin:0.2 scale:30 epoch:3 lr:5E-5 bs:128 max_len:64
|
单卡v100-16g
|da1bb1487bd3fd6a53b8ef95c278f3e6| -|[hardest_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/hardest_neg_v1.0.tar)|margin:0.2 epoch:3 lr:5E-5 bs:128 max_len:64 |单卡v100-16g|b535d890110ea608c8562c525a0b84b5| +| Model | 训练参数配置 | 硬件 | MD5 | +|------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------|----------------------------------------------|----------------------------------| +| [batch_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/batch_neg_v1.0.tar) |
margin:0.2 scale:30 epoch:3 lr:5E-5 bs:128 max_len:64
|
单卡v100-16g
| da1bb1487bd3fd6a53b8ef95c278f3e6 | +| [hardest_neg_v1.0](https://bj.bcebos.com/paddlenlp/models/semantic_index/hardest_neg_v1.0.tar) | margin:0.2 epoch:3 lr:5E-5 bs:128 max_len:64 | 单卡v100-16g | b535d890110ea608c8562c525a0b84b5 | ## 数据准备 @@ -47,11 +47,11 @@ ### 数据下载 -|数据|描述|数量|MD5| -| ------------ | ------------ | ------------ | -------- | -|
[训练集(semantic_pair_train.tsv)](https://bj.bcebos.com/paddlenlp/models/semantic_index/semantic_pair_train.tsv)
|
每行为语义相似的文本 Pair 构成的训练集
|222546|590286f695200160350cc5838cb34f00| -|[评估集(same_semantic.tsv)](https://bj.bcebos.com/paddlenlp/models/semantic_index/same_semantic.tsv)|每行为语义相似文本 Pair 构成的评估集|10255|86ec1fd5234d944177574372dcf780c5| -|[召回库(corpus_file)](https://bj.bcebos.com/paddlenlp/models/semantic_index/corpus_file)|每行为单条文本构成的召回库|313714|a3fbc3421b5aeb939809876fc7beeaa8| +| 数据 | 描述 | 数量 | MD5 | +|--------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|--------|----------------------------------| +|
[训练集(semantic_pair_train.tsv)](https://bj.bcebos.com/paddlenlp/models/semantic_index/semantic_pair_train.tsv)
|
每行为语义相似的文本 Pair 构成的训练集
| 222546 | 590286f695200160350cc5838cb34f00 | +| [评估集(same_semantic.tsv)](https://bj.bcebos.com/paddlenlp/models/semantic_index/same_semantic.tsv) | 每行为语义相似文本 Pair 构成的评估集 | 10255 | 86ec1fd5234d944177574372dcf780c5 | +| [召回库(corpus_file)](https://bj.bcebos.com/paddlenlp/models/semantic_index/corpus_file) | 每行为单条文本构成的召回库 | 313714 | a3fbc3421b5aeb939809876fc7beeaa8 | ## 项目依赖: @@ -242,17 +242,17 @@ python -u -m paddle.distributed.launch --gpus "0" \ 详细性能评测数据如下表: -| batch size | max_seq_len | Paddle 前向(ms)|FT FP32(ms) | FT FP16(ms) |Speedup(FT FP32/Paddle)|Speedup(FT FP16/Paddle)| -| ---------- | ----------- | ------------------- | ------------------- |------------------ |------------------ |------------------ | -| 16 | 16 | 23.56 | 5.40 | 5.38 | 4.36| 4.38| -| 16 | 32 | 22.34 | 8.11 | 5.57|2.75|4.01| -| 16 | 64 | 22.79 | 14.84 |5.39|1.54|4.23| -| 32 | 16 | 23.41 | 8.16 |5.30|2.87|4.42| -| 32 | 32 | 22.67 | 14.84 |6.21|1.53|3.65| -| 32 | 64 | 33.49 | 28.53 |6.05|1.17|5.54| -| 64 | 16 | 22.60 | 14.81 |5.59|1.53|4.04| -| 64 | 32 | 33.52 | 28.22 |6.24|1.19|5.37| -| 64 | 64 | 62.62 | 55.25 |11.55|1.13|5.42| +| batch size | max_seq_len | Paddle 前向(ms) | FT FP32(ms) | FT FP16(ms) | Speedup(FT FP32/Paddle) | Speedup(FT FP16/Paddle) | +|------------|-------------|-----------------|-------------|-------------|-------------------------|-------------------------| +| 16 | 16 | 23.56 | 5.40 | 5.38 | 4.36 | 4.38 | +| 16 | 32 | 22.34 | 8.11 | 5.57 | 2.75 | 4.01 | +| 16 | 64 | 22.79 | 14.84 | 5.39 | 1.54 | 4.23 | +| 32 | 16 | 23.41 | 8.16 | 5.30 | 2.87 | 4.42 | +| 32 | 32 | 22.67 | 14.84 | 6.21 | 1.53 | 3.65 | +| 32 | 64 | 33.49 | 28.53 | 6.05 | 1.17 | 5.54 | +| 64 | 16 | 22.60 | 14.81 | 5.59 | 1.53 | 4.04 | +| 64 | 32 | 33.52 | 28.22 | 6.24 | 1.19 | 5.37 | +| 64 | 64 | 62.62 | 55.25 | 11.55 | 1.13 | 5.42 | Note: 测试环境如下 ``` diff --git a/examples/semantic_indexing/README_gradient_cache.md b/legacy/examples/semantic_indexing/README_gradient_cache.md similarity index 100% rename from examples/semantic_indexing/README_gradient_cache.md rename to legacy/examples/semantic_indexing/README_gradient_cache.md diff --git a/examples/semantic_indexing/ance/model.py b/legacy/examples/semantic_indexing/ance/model.py similarity index 100% rename from examples/semantic_indexing/ance/model.py rename to legacy/examples/semantic_indexing/ance/model.py diff --git a/examples/semantic_indexing/ann_util.py b/legacy/examples/semantic_indexing/ann_util.py similarity index 99% rename from examples/semantic_indexing/ann_util.py rename to legacy/examples/semantic_indexing/ann_util.py index 55c608d3e58c..652d38c91010 100644 --- a/examples/semantic_indexing/ann_util.py +++ b/legacy/examples/semantic_indexing/ann_util.py @@ -14,8 +14,9 @@ # coding=UTF-8 -import numpy as np import hnswlib +import numpy as np + from paddlenlp.utils.log import logger diff --git a/examples/semantic_indexing/base_model.py b/legacy/examples/semantic_indexing/base_model.py similarity index 100% rename from examples/semantic_indexing/base_model.py rename to legacy/examples/semantic_indexing/base_model.py diff --git a/examples/semantic_indexing/batch_negative/model.py b/legacy/examples/semantic_indexing/batch_negative/model.py similarity index 94% rename from examples/semantic_indexing/batch_negative/model.py rename to legacy/examples/semantic_indexing/batch_negative/model.py index fd87c6d8363e..a091f0d2d730 100644 --- a/examples/semantic_indexing/batch_negative/model.py +++ b/legacy/examples/semantic_indexing/batch_negative/model.py @@ -23,7 +23,7 @@ def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_ self.margin = margin # Used scaling cosine similarity to ease converge - self.sacle = scale + self.scale = scale def forward( self, @@ -47,7 +47,7 @@ def forward( cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) - # substract margin from all positive samples cosine_sim() + # subtract margin from all positive samples cosine_sim() margin_diag = paddle.full( shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() ) @@ -55,7 +55,7 @@ def forward( cosine_sim = cosine_sim - paddle.diag(margin_diag) # scale cosine to ease training converge - cosine_sim *= self.sacle + cosine_sim *= self.scale labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") labels = paddle.reshape(labels, shape=[-1, 1]) diff --git a/examples/semantic_indexing/biencoder_base_model.py b/legacy/examples/semantic_indexing/biencoder_base_model.py similarity index 100% rename from examples/semantic_indexing/biencoder_base_model.py rename to legacy/examples/semantic_indexing/biencoder_base_model.py diff --git a/examples/semantic_indexing/data.py b/legacy/examples/semantic_indexing/data.py similarity index 94% rename from examples/semantic_indexing/data.py rename to legacy/examples/semantic_indexing/data.py index c8e2e232f370..7bc340d3bcb0 100644 --- a/examples/semantic_indexing/data.py +++ b/legacy/examples/semantic_indexing/data.py @@ -84,7 +84,7 @@ def read_text_triplet(data_path): # ANN - active learning ------------------------------------------------------ def get_latest_checkpoint(args): """ - Return: (latest_checkpint_path, global_step) + Return: (latest_checkpoint_path, global_step) """ if not os.path.exists(args.save_dir): return args.init_from_ckpt, 0 @@ -114,7 +114,7 @@ def get_latest_ann_data(ann_data_dir): def valid_checkpoint(step): ann_data_file = os.path.join(ann_data_dir, step, "new_ann_data") - # succed_flag_file is an empty file that indicates ann data has been generated + # succeed_flag_file is an empty file that indicates ann data has been generated succeed_flag_file = os.path.join(ann_data_dir, step, "succeed_flag_file") return os.path.exists(succeed_flag_file) and os.path.exists(ann_data_file) @@ -122,7 +122,7 @@ def valid_checkpoint(step): if len(ann_data_steps) > 0: latest_ann_data_file = os.path.join(ann_data_dir, str(max(ann_data_steps)), "new_ann_data") - logger.info("Using lateset ann_data_file:{}".format(latest_ann_data_file)) + logger.info("Using latest ann_data_file:{}".format(latest_ann_data_file)) return latest_ann_data_file, max(ann_data_steps) logger.info("no new ann_data, return (None, -1)") @@ -142,8 +142,8 @@ def gen_text_file(similar_text_pair_file): texts = [] with open(similar_text_pair_file, "r", encoding="utf-8") as f: for line in f: - splited_line = line.rstrip().split("\t") - if len(splited_line) != 2: + splitted_line = line.rstrip().split("\t") + if len(splitted_line) != 2: continue text, similar_text = line.rstrip().split("\t") diff --git a/examples/semantic_indexing/dense_retriever.py b/legacy/examples/semantic_indexing/dense_retriever.py similarity index 100% rename from examples/semantic_indexing/dense_retriever.py rename to legacy/examples/semantic_indexing/dense_retriever.py diff --git a/examples/semantic_indexing/evaluate.py b/legacy/examples/semantic_indexing/evaluate.py similarity index 95% rename from examples/semantic_indexing/evaluate.py rename to legacy/examples/semantic_indexing/evaluate.py index bfa086e521c9..2c1fc5a949df 100644 --- a/examples/semantic_indexing/evaluate.py +++ b/legacy/examples/semantic_indexing/evaluate.py @@ -21,7 +21,7 @@ "--similar_text_pair", type=str, default="", - help="The full path of similat pair file", + help="The full path of similar pair file", ) parser.add_argument( "--recall_result_file", @@ -33,7 +33,7 @@ "--recall_num", type=int, default=10, - help="Most similair number of doc recalled from corpus per query", + help="Most similar number of doc recalled from corpus per query", ) args = parser.parse_args() diff --git a/examples/semantic_indexing/faiss_indexer.py b/legacy/examples/semantic_indexing/faiss_indexer.py similarity index 97% rename from examples/semantic_indexing/faiss_indexer.py rename to legacy/examples/semantic_indexing/faiss_indexer.py index a0a2eb9aa7db..3ab0b18f0c9a 100644 --- a/examples/semantic_indexing/faiss_indexer.py +++ b/legacy/examples/semantic_indexing/faiss_indexer.py @@ -20,14 +20,14 @@ # This source code is licensed under the license found in the # LICENSE file in the root directory of this source tree. """ - FAISS-based index components for dense retriver + FAISS-based index components for dense retriever """ -import os -import time import logging +import os import pickle -from typing import List, Tuple, Iterator +import time +from typing import Iterator, List, Tuple import faiss import numpy as np @@ -143,7 +143,7 @@ def __init__( super(DenseHNSWFlatIndexer, self).__init__(buffer_size=buffer_size) # IndexHNSWFlat supports L2 similarity only - # so we have to apply DOT -> L2 similairy space conversion with the help of an extra dimension + # so we have to apply DOT -> L2 similarity space conversion with the help of an extra dimension index = faiss.IndexHNSWFlat(vector_sz + 1, store_n) index.hnsw.efSearch = ef_search index.hnsw.efConstruction = ef_construction diff --git a/examples/semantic_indexing/fast_predict.py b/legacy/examples/semantic_indexing/fast_predict.py similarity index 100% rename from examples/semantic_indexing/fast_predict.py rename to legacy/examples/semantic_indexing/fast_predict.py diff --git a/examples/semantic_indexing/generate_dense_embeddings.py b/legacy/examples/semantic_indexing/generate_dense_embeddings.py similarity index 100% rename from examples/semantic_indexing/generate_dense_embeddings.py rename to legacy/examples/semantic_indexing/generate_dense_embeddings.py diff --git a/examples/semantic_indexing/gradient_cache/model.py b/legacy/examples/semantic_indexing/gradient_cache/model.py similarity index 96% rename from examples/semantic_indexing/gradient_cache/model.py rename to legacy/examples/semantic_indexing/gradient_cache/model.py index 04745d097889..9c5388c7665e 100644 --- a/examples/semantic_indexing/gradient_cache/model.py +++ b/legacy/examples/semantic_indexing/gradient_cache/model.py @@ -22,7 +22,7 @@ def __init__(self, pretrained_model, dropout=None, margin=0.3, scale=30, output_ super().__init__(pretrained_model, dropout, output_emb_size) self.margin = margin # Used scaling cosine similarity to ease converge - self.sacle = scale + self.scale = scale def get_pooled_embedding_with_no_grad( self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None @@ -77,7 +77,7 @@ def forward( cosine_sim = paddle.matmul(query_cls_embedding, title_cls_embedding, transpose_y=True) - # substract margin from all positive samples cosine_sim() + # subtract margin from all positive samples cosine_sim() margin_diag = paddle.full( shape=[query_cls_embedding.shape[0]], fill_value=self.margin, dtype=paddle.get_default_dtype() ) @@ -85,7 +85,7 @@ def forward( cosine_sim = cosine_sim - paddle.diag(margin_diag) # scale cosine to ease training converge - cosine_sim *= self.sacle + cosine_sim *= self.scale labels = paddle.arange(0, query_cls_embedding.shape[0], dtype="int64") labels = paddle.reshape(labels, shape=[-1, 1]) diff --git a/examples/semantic_indexing/hardest_negative/model.py b/legacy/examples/semantic_indexing/hardest_negative/model.py similarity index 89% rename from examples/semantic_indexing/hardest_negative/model.py rename to legacy/examples/semantic_indexing/hardest_negative/model.py index ce4db41341c2..3e7676e4214b 100644 --- a/examples/semantic_indexing/hardest_negative/model.py +++ b/legacy/examples/semantic_indexing/hardest_negative/model.py @@ -46,12 +46,12 @@ def forward( pos_sim = paddle.max(cosine_sim, axis=-1) - # subtract 10000 from all diagnal elements of cosine_sim - mask_socre = paddle.full( + # subtract 10000 from all diagonal elements of cosine_sim + mask_score = paddle.full( shape=[query_cls_embedding.shape[0]], fill_value=10000, dtype=paddle.get_default_dtype() ) - tmp_cosin_sim = cosine_sim - paddle.diag(mask_socre) - hardest_negative_sim = paddle.max(tmp_cosin_sim, axis=-1) + tmp_cosine_sim = cosine_sim - paddle.diag(mask_score) + hardest_negative_sim = paddle.max(tmp_cosine_sim, axis=-1) labels = paddle.full(shape=[query_cls_embedding.shape[0]], fill_value=1.0, dtype="float32") diff --git a/examples/semantic_indexing/predict.py b/legacy/examples/semantic_indexing/predict.py similarity index 98% rename from examples/semantic_indexing/predict.py rename to legacy/examples/semantic_indexing/predict.py index 741bb4ffdf45..0459c0120922 100644 --- a/examples/semantic_indexing/predict.py +++ b/legacy/examples/semantic_indexing/predict.py @@ -116,6 +116,6 @@ def predict(model, data_loader): if args.use_fp16: convert_to_fp16(model.ptm.encoder) - cosin_sim = predict(model, valid_data_loader) - for idx, cosine in enumerate(cosin_sim): + cosine_sim = predict(model, valid_data_loader) + for idx, cosine in enumerate(cosine_sim): print("{}".format(cosine)) diff --git a/examples/semantic_indexing/qa_validation.py b/legacy/examples/semantic_indexing/qa_validation.py similarity index 97% rename from examples/semantic_indexing/qa_validation.py rename to legacy/examples/semantic_indexing/qa_validation.py index e4be203ec57a..499730673002 100644 --- a/examples/semantic_indexing/qa_validation.py +++ b/legacy/examples/semantic_indexing/qa_validation.py @@ -12,7 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. """ - Set of utilities for Q&A results validation tasks - Retriver passage validation and Reader predicted answer validation + Set of utilities for Q&A results validation tasks - Retriever passage validation and Reader predicted answer validation """ import collections @@ -21,7 +21,8 @@ import unicodedata from functools import partial from multiprocessing import Pool as ProcessPool -from typing import Tuple, List, Dict +from typing import Dict, List, Tuple + import regex as re from tokenizers import SimpleTokenizer diff --git a/examples/semantic_indexing/recall.py b/legacy/examples/semantic_indexing/recall.py similarity index 100% rename from examples/semantic_indexing/recall.py rename to legacy/examples/semantic_indexing/recall.py diff --git a/examples/semantic_indexing/requirements.txt b/legacy/examples/semantic_indexing/requirements.txt similarity index 100% rename from examples/semantic_indexing/requirements.txt rename to legacy/examples/semantic_indexing/requirements.txt diff --git a/examples/semantic_indexing/run_ann_data_gen.py b/legacy/examples/semantic_indexing/run_ann_data_gen.py similarity index 100% rename from examples/semantic_indexing/run_ann_data_gen.py rename to legacy/examples/semantic_indexing/run_ann_data_gen.py diff --git a/examples/semantic_indexing/tokenizers.py b/legacy/examples/semantic_indexing/tokenizers.py similarity index 100% rename from examples/semantic_indexing/tokenizers.py rename to legacy/examples/semantic_indexing/tokenizers.py diff --git a/examples/semantic_indexing/train_ance.py b/legacy/examples/semantic_indexing/train_ance.py similarity index 100% rename from examples/semantic_indexing/train_ance.py rename to legacy/examples/semantic_indexing/train_ance.py diff --git a/examples/semantic_indexing/train_batch_neg.py b/legacy/examples/semantic_indexing/train_batch_neg.py similarity index 100% rename from examples/semantic_indexing/train_batch_neg.py rename to legacy/examples/semantic_indexing/train_batch_neg.py diff --git a/examples/semantic_indexing/train_gradient_cache.py b/legacy/examples/semantic_indexing/train_gradient_cache.py similarity index 100% rename from examples/semantic_indexing/train_gradient_cache.py rename to legacy/examples/semantic_indexing/train_gradient_cache.py diff --git a/examples/semantic_indexing/train_gradient_cache_DPR.py b/legacy/examples/semantic_indexing/train_gradient_cache_DPR.py similarity index 100% rename from examples/semantic_indexing/train_gradient_cache_DPR.py rename to legacy/examples/semantic_indexing/train_gradient_cache_DPR.py diff --git a/examples/semantic_indexing/train_hardest_neg.py b/legacy/examples/semantic_indexing/train_hardest_neg.py similarity index 100% rename from examples/semantic_indexing/train_hardest_neg.py rename to legacy/examples/semantic_indexing/train_hardest_neg.py diff --git a/examples/sentiment_analysis/skep/README.md b/legacy/examples/sentiment_analysis/skep/README.md similarity index 100% rename from examples/sentiment_analysis/skep/README.md rename to legacy/examples/sentiment_analysis/skep/README.md diff --git a/examples/sentiment_analysis/skep/deploy/python/predict.py b/legacy/examples/sentiment_analysis/skep/deploy/python/predict.py similarity index 100% rename from examples/sentiment_analysis/skep/deploy/python/predict.py rename to legacy/examples/sentiment_analysis/skep/deploy/python/predict.py diff --git a/examples/sentiment_analysis/skep/export_model.py b/legacy/examples/sentiment_analysis/skep/export_model.py similarity index 100% rename from examples/sentiment_analysis/skep/export_model.py rename to legacy/examples/sentiment_analysis/skep/export_model.py diff --git a/examples/sentiment_analysis/skep/predict_aspect.py b/legacy/examples/sentiment_analysis/skep/predict_aspect.py similarity index 100% rename from examples/sentiment_analysis/skep/predict_aspect.py rename to legacy/examples/sentiment_analysis/skep/predict_aspect.py diff --git a/examples/sentiment_analysis/skep/predict_opinion.py b/legacy/examples/sentiment_analysis/skep/predict_opinion.py similarity index 100% rename from examples/sentiment_analysis/skep/predict_opinion.py rename to legacy/examples/sentiment_analysis/skep/predict_opinion.py diff --git a/examples/sentiment_analysis/skep/predict_sentence.py b/legacy/examples/sentiment_analysis/skep/predict_sentence.py similarity index 100% rename from examples/sentiment_analysis/skep/predict_sentence.py rename to legacy/examples/sentiment_analysis/skep/predict_sentence.py diff --git a/examples/sentiment_analysis/skep/train_aspect.py b/legacy/examples/sentiment_analysis/skep/train_aspect.py similarity index 100% rename from examples/sentiment_analysis/skep/train_aspect.py rename to legacy/examples/sentiment_analysis/skep/train_aspect.py diff --git a/examples/sentiment_analysis/skep/train_opinion.py b/legacy/examples/sentiment_analysis/skep/train_opinion.py similarity index 100% rename from examples/sentiment_analysis/skep/train_opinion.py rename to legacy/examples/sentiment_analysis/skep/train_opinion.py diff --git a/examples/sentiment_analysis/skep/train_sentence.py b/legacy/examples/sentiment_analysis/skep/train_sentence.py similarity index 100% rename from examples/sentiment_analysis/skep/train_sentence.py rename to legacy/examples/sentiment_analysis/skep/train_sentence.py diff --git a/examples/simultaneous_translation/stacl/README.md b/legacy/examples/simultaneous_translation/stacl/README.md similarity index 91% rename from examples/simultaneous_translation/stacl/README.md rename to legacy/examples/simultaneous_translation/stacl/README.md index 554b251a4b61..1bd391caa538 100644 --- a/examples/simultaneous_translation/stacl/README.md +++ b/legacy/examples/simultaneous_translation/stacl/README.md @@ -133,13 +133,13 @@ perl mosesdecoder/scripts/generic/multi-bleu.perl newstest2017.tok.en < predict. ## 模型下载(更新中) 我们提供基于NIST(中->英,共2M中英句对)预训练模型,供大家下载,下载后需解压使用。 -| Wait-k策略 | 模型连接 | 4-ref BLEU on NIST 2008| -| ------------ | --------------- |---------| -| Wait-1 | [下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w1.tar.gz) |30.94| -| Wait-3 |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w3.tar.gz) |34.24 | -| Wait-5 |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w5.tar.gz) |36.30 | -| Wait-7 |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w7.tar.gz) |37.84 | -| Wait_-1(整句模型) |[下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_sent.tar.gz) |41.41 | +| Wait-k策略 | 模型连接 | 4-ref BLEU on NIST 2008 | +|-------------------|---------------------------------------------------------------------------------|-------------------------| +| Wait-1 | [下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w1.tar.gz) | 30.94 | +| Wait-3 | [下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w3.tar.gz) | 34.24 | +| Wait-5 | [下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w5.tar.gz) | 36.30 | +| Wait-7 | [下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_w7.tar.gz) | 37.84 | +| Wait_-1(整句模型) | [下载](https://bj.bcebos.com/paddlenlp/models/stacl/nist_zhen_full_sent.tar.gz) | 41.41 | 词表下载:[source vocab](https://bj.bcebos.com/paddlenlp/models/stacl/nist.20k.zh.vocab) ,[target vocab](https://bj.bcebos.com/paddlenlp/models/stacl/nist.10k.en.vocab) ## Demo展示 diff --git a/examples/simultaneous_translation/stacl/config/transformer.yaml b/legacy/examples/simultaneous_translation/stacl/config/transformer.yaml similarity index 100% rename from examples/simultaneous_translation/stacl/config/transformer.yaml rename to legacy/examples/simultaneous_translation/stacl/config/transformer.yaml diff --git a/examples/simultaneous_translation/stacl/demo/README.md b/legacy/examples/simultaneous_translation/stacl/demo/README.md similarity index 100% rename from examples/simultaneous_translation/stacl/demo/README.md rename to legacy/examples/simultaneous_translation/stacl/demo/README.md diff --git a/examples/simultaneous_translation/stacl/demo/README_ai.md b/legacy/examples/simultaneous_translation/stacl/demo/README_ai.md similarity index 100% rename from examples/simultaneous_translation/stacl/demo/README_ai.md rename to legacy/examples/simultaneous_translation/stacl/demo/README_ai.md diff --git a/examples/simultaneous_translation/stacl/demo/const.py b/legacy/examples/simultaneous_translation/stacl/demo/const.py similarity index 100% rename from examples/simultaneous_translation/stacl/demo/const.py rename to legacy/examples/simultaneous_translation/stacl/demo/const.py diff --git a/examples/simultaneous_translation/stacl/demo/demo.py b/legacy/examples/simultaneous_translation/stacl/demo/demo.py similarity index 100% rename from examples/simultaneous_translation/stacl/demo/demo.py rename to legacy/examples/simultaneous_translation/stacl/demo/demo.py diff --git a/examples/simultaneous_translation/stacl/demo/images/paddlenlp.png b/legacy/examples/simultaneous_translation/stacl/demo/images/paddlenlp.png similarity index 100% rename from examples/simultaneous_translation/stacl/demo/images/paddlenlp.png rename to legacy/examples/simultaneous_translation/stacl/demo/images/paddlenlp.png diff --git a/examples/simultaneous_translation/stacl/demo/images/speech_demo_show.gif b/legacy/examples/simultaneous_translation/stacl/demo/images/speech_demo_show.gif similarity index 100% rename from examples/simultaneous_translation/stacl/demo/images/speech_demo_show.gif rename to legacy/examples/simultaneous_translation/stacl/demo/images/speech_demo_show.gif diff --git a/examples/simultaneous_translation/stacl/demo/images/step1.png b/legacy/examples/simultaneous_translation/stacl/demo/images/step1.png similarity index 100% rename from examples/simultaneous_translation/stacl/demo/images/step1.png rename to legacy/examples/simultaneous_translation/stacl/demo/images/step1.png diff --git a/examples/simultaneous_translation/stacl/demo/images/step2.png b/legacy/examples/simultaneous_translation/stacl/demo/images/step2.png similarity index 100% rename from examples/simultaneous_translation/stacl/demo/images/step2.png rename to legacy/examples/simultaneous_translation/stacl/demo/images/step2.png diff --git a/examples/simultaneous_translation/stacl/demo/images/step3.png b/legacy/examples/simultaneous_translation/stacl/demo/images/step3.png similarity index 100% rename from examples/simultaneous_translation/stacl/demo/images/step3.png rename to legacy/examples/simultaneous_translation/stacl/demo/images/step3.png diff --git a/examples/simultaneous_translation/stacl/demo/images/step4.png b/legacy/examples/simultaneous_translation/stacl/demo/images/step4.png similarity index 100% rename from examples/simultaneous_translation/stacl/demo/images/step4.png rename to legacy/examples/simultaneous_translation/stacl/demo/images/step4.png diff --git a/examples/simultaneous_translation/stacl/demo/images/step5.png b/legacy/examples/simultaneous_translation/stacl/demo/images/step5.png similarity index 100% rename from examples/simultaneous_translation/stacl/demo/images/step5.png rename to legacy/examples/simultaneous_translation/stacl/demo/images/step5.png diff --git a/examples/simultaneous_translation/stacl/demo/images/step6.png b/legacy/examples/simultaneous_translation/stacl/demo/images/step6.png similarity index 100% rename from examples/simultaneous_translation/stacl/demo/images/step6.png rename to legacy/examples/simultaneous_translation/stacl/demo/images/step6.png diff --git a/examples/simultaneous_translation/stacl/demo/images/step7.png b/legacy/examples/simultaneous_translation/stacl/demo/images/step7.png similarity index 100% rename from examples/simultaneous_translation/stacl/demo/images/step7.png rename to legacy/examples/simultaneous_translation/stacl/demo/images/step7.png diff --git a/examples/simultaneous_translation/stacl/demo/images/step8.png b/legacy/examples/simultaneous_translation/stacl/demo/images/step8.png similarity index 100% rename from examples/simultaneous_translation/stacl/demo/images/step8.png rename to legacy/examples/simultaneous_translation/stacl/demo/images/step8.png diff --git a/examples/simultaneous_translation/stacl/demo/images/text_demo_show.gif b/legacy/examples/simultaneous_translation/stacl/demo/images/text_demo_show.gif similarity index 100% rename from examples/simultaneous_translation/stacl/demo/images/text_demo_show.gif rename to legacy/examples/simultaneous_translation/stacl/demo/images/text_demo_show.gif diff --git a/examples/simultaneous_translation/stacl/demo/model_demo.py b/legacy/examples/simultaneous_translation/stacl/demo/model_demo.py similarity index 100% rename from examples/simultaneous_translation/stacl/demo/model_demo.py rename to legacy/examples/simultaneous_translation/stacl/demo/model_demo.py diff --git a/examples/simultaneous_translation/stacl/demo/requirements.txt b/legacy/examples/simultaneous_translation/stacl/demo/requirements.txt similarity index 100% rename from examples/simultaneous_translation/stacl/demo/requirements.txt rename to legacy/examples/simultaneous_translation/stacl/demo/requirements.txt diff --git a/examples/simultaneous_translation/stacl/demo/transformer_demo.yaml b/legacy/examples/simultaneous_translation/stacl/demo/transformer_demo.yaml similarity index 100% rename from examples/simultaneous_translation/stacl/demo/transformer_demo.yaml rename to legacy/examples/simultaneous_translation/stacl/demo/transformer_demo.yaml diff --git a/examples/simultaneous_translation/stacl/images/STACL_architecture.png b/legacy/examples/simultaneous_translation/stacl/images/STACL_architecture.png similarity index 100% rename from examples/simultaneous_translation/stacl/images/STACL_architecture.png rename to legacy/examples/simultaneous_translation/stacl/images/STACL_architecture.png diff --git a/examples/simultaneous_translation/stacl/images/example.png b/legacy/examples/simultaneous_translation/stacl/images/example.png similarity index 100% rename from examples/simultaneous_translation/stacl/images/example.png rename to legacy/examples/simultaneous_translation/stacl/images/example.png diff --git a/examples/simultaneous_translation/stacl/model.py b/legacy/examples/simultaneous_translation/stacl/model.py similarity index 100% rename from examples/simultaneous_translation/stacl/model.py rename to legacy/examples/simultaneous_translation/stacl/model.py diff --git a/examples/simultaneous_translation/stacl/predict.py b/legacy/examples/simultaneous_translation/stacl/predict.py similarity index 99% rename from examples/simultaneous_translation/stacl/predict.py rename to legacy/examples/simultaneous_translation/stacl/predict.py index 8f2e3da9e404..e2a6f7256158 100644 --- a/examples/simultaneous_translation/stacl/predict.py +++ b/legacy/examples/simultaneous_translation/stacl/predict.py @@ -12,17 +12,18 @@ # See the License for the specific language governing permissions and # limitations under the License. -import os import argparse +import os from pprint import pprint -import yaml -from attrdict import AttrDict import paddle -from paddlenlp.transformers import position_encoding_init import reader +import yaml +from attrdict import AttrDict from model import SimultaneousTransformer +from paddlenlp.transformers import position_encoding_init + def parse_args(): parser = argparse.ArgumentParser() diff --git a/examples/simultaneous_translation/stacl/reader.py b/legacy/examples/simultaneous_translation/stacl/reader.py similarity index 99% rename from examples/simultaneous_translation/stacl/reader.py rename to legacy/examples/simultaneous_translation/stacl/reader.py index cb71b2bfb212..cb0ab155d8ba 100644 --- a/examples/simultaneous_translation/stacl/reader.py +++ b/legacy/examples/simultaneous_translation/stacl/reader.py @@ -13,8 +13,10 @@ # limitations under the License. from functools import partial + from paddle.io import DataLoader -from paddlenlp.data import Vocab, Pad + +from paddlenlp.data import Pad, Vocab from paddlenlp.data.sampler import SamplerHelper from paddlenlp.datasets import load_dataset diff --git a/examples/simultaneous_translation/stacl/requirements.txt b/legacy/examples/simultaneous_translation/stacl/requirements.txt similarity index 100% rename from examples/simultaneous_translation/stacl/requirements.txt rename to legacy/examples/simultaneous_translation/stacl/requirements.txt diff --git a/examples/simultaneous_translation/stacl/train.py b/legacy/examples/simultaneous_translation/stacl/train.py similarity index 99% rename from examples/simultaneous_translation/stacl/train.py rename to legacy/examples/simultaneous_translation/stacl/train.py index 09ecb03001a9..6fd3a80fda71 100644 --- a/examples/simultaneous_translation/stacl/train.py +++ b/legacy/examples/simultaneous_translation/stacl/train.py @@ -12,23 +12,22 @@ # See the License for the specific language governing permissions and # limitations under the License. +import argparse import os import time - -import argparse from pprint import pprint -import numpy as np -import yaml -from attrdict import AttrDict +import numpy as np import paddle import paddle.distributed as dist -from paddlenlp.utils.log import logger - import reader -from model import SimultaneousTransformer, CrossEntropyCriterion +import yaml +from attrdict import AttrDict +from model import CrossEntropyCriterion, SimultaneousTransformer from utils.record import AverageStatistical +from paddlenlp.utils.log import logger + def parse_args(): parser = argparse.ArgumentParser() diff --git a/examples/text_graph/erniesage/models/__init__.py b/legacy/examples/simultaneous_translation/stacl/utils/__init__.py similarity index 80% rename from examples/text_graph/erniesage/models/__init__.py rename to legacy/examples/simultaneous_translation/stacl/utils/__init__.py index 4b02ff01793b..fd05a9208165 100644 --- a/examples/text_graph/erniesage/models/__init__.py +++ b/legacy/examples/simultaneous_translation/stacl/utils/__init__.py @@ -1,4 +1,4 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. +# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -11,8 +11,3 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. - -from models import model - -__all__ = [] -__all__ += model.__all__ diff --git a/examples/simultaneous_translation/stacl/utils/record.py b/legacy/examples/simultaneous_translation/stacl/utils/record.py similarity index 100% rename from examples/simultaneous_translation/stacl/utils/record.py rename to legacy/examples/simultaneous_translation/stacl/utils/record.py diff --git a/examples/simultaneous_translation/stacl/utils/tokenizer.py b/legacy/examples/simultaneous_translation/stacl/utils/tokenizer.py similarity index 100% rename from examples/simultaneous_translation/stacl/utils/tokenizer.py rename to legacy/examples/simultaneous_translation/stacl/utils/tokenizer.py diff --git a/examples/torch_migration/README.md b/legacy/examples/torch_migration/README.md similarity index 100% rename from examples/torch_migration/README.md rename to legacy/examples/torch_migration/README.md diff --git a/examples/torch_migration/docs/ThesisReproduction_NLP.md b/legacy/examples/torch_migration/docs/ThesisReproduction_NLP.md similarity index 100% rename from examples/torch_migration/docs/ThesisReproduction_NLP.md rename to legacy/examples/torch_migration/docs/ThesisReproduction_NLP.md diff --git a/examples/torch_migration/pipeline/Step1/README.md b/legacy/examples/torch_migration/pipeline/Step1/README.md similarity index 100% rename from examples/torch_migration/pipeline/Step1/README.md rename to legacy/examples/torch_migration/pipeline/Step1/README.md diff --git a/examples/torch_migration/pipeline/Step1/check_step1.py b/legacy/examples/torch_migration/pipeline/Step1/check_step1.py similarity index 100% rename from examples/torch_migration/pipeline/Step1/check_step1.py rename to legacy/examples/torch_migration/pipeline/Step1/check_step1.py diff --git a/examples/torch_migration/pipeline/Step1/pd_forward_bert.py b/legacy/examples/torch_migration/pipeline/Step1/pd_forward_bert.py similarity index 100% rename from examples/torch_migration/pipeline/Step1/pd_forward_bert.py rename to legacy/examples/torch_migration/pipeline/Step1/pd_forward_bert.py diff --git a/examples/torch_migration/pipeline/Step1/pt_forward_bert.py b/legacy/examples/torch_migration/pipeline/Step1/pt_forward_bert.py similarity index 100% rename from examples/torch_migration/pipeline/Step1/pt_forward_bert.py rename to legacy/examples/torch_migration/pipeline/Step1/pt_forward_bert.py diff --git a/examples/torch_migration/pipeline/Step1/torch2paddle.py b/legacy/examples/torch_migration/pipeline/Step1/torch2paddle.py similarity index 99% rename from examples/torch_migration/pipeline/Step1/torch2paddle.py rename to legacy/examples/torch_migration/pipeline/Step1/torch2paddle.py index 4a2b4977051b..b395486e83eb 100644 --- a/examples/torch_migration/pipeline/Step1/torch2paddle.py +++ b/legacy/examples/torch_migration/pipeline/Step1/torch2paddle.py @@ -17,9 +17,10 @@ import numpy as np import paddle import torch -from paddlenlp.transformers import BertForPretraining as PDBertForMaskedLM from transformers import BertForMaskedLM as PTBertForMaskedLM +from paddlenlp.transformers import BertForPretraining as PDBertForMaskedLM + def convert_pytorch_checkpoint_to_paddle( pytorch_checkpoint_path="pytorch_model.bin", diff --git a/examples/torch_migration/pipeline/Step2/README.md b/legacy/examples/torch_migration/pipeline/Step2/README.md similarity index 100% rename from examples/torch_migration/pipeline/Step2/README.md rename to legacy/examples/torch_migration/pipeline/Step2/README.md diff --git a/examples/torch_migration/pipeline/Step2/accuracy.py b/legacy/examples/torch_migration/pipeline/Step2/accuracy.py similarity index 100% rename from examples/torch_migration/pipeline/Step2/accuracy.py rename to legacy/examples/torch_migration/pipeline/Step2/accuracy.py diff --git a/examples/torch_migration/pipeline/Step2/check_step2.py b/legacy/examples/torch_migration/pipeline/Step2/check_step2.py similarity index 100% rename from examples/torch_migration/pipeline/Step2/check_step2.py rename to legacy/examples/torch_migration/pipeline/Step2/check_step2.py diff --git a/examples/torch_migration/pipeline/Step2/demo_sst2_sentence/demo.tsv b/legacy/examples/torch_migration/pipeline/Step2/demo_sst2_sentence/demo.tsv similarity index 100% rename from examples/torch_migration/pipeline/Step2/demo_sst2_sentence/demo.tsv rename to legacy/examples/torch_migration/pipeline/Step2/demo_sst2_sentence/demo.tsv diff --git a/examples/torch_migration/pipeline/Step2/predict.py b/legacy/examples/torch_migration/pipeline/Step2/predict.py similarity index 100% rename from examples/torch_migration/pipeline/Step2/predict.py rename to legacy/examples/torch_migration/pipeline/Step2/predict.py diff --git a/examples/torch_migration/pipeline/Step2/test_data.py b/legacy/examples/torch_migration/pipeline/Step2/test_data.py similarity index 100% rename from examples/torch_migration/pipeline/Step2/test_data.py rename to legacy/examples/torch_migration/pipeline/Step2/test_data.py diff --git a/examples/torch_migration/pipeline/Step2/test_metric.py b/legacy/examples/torch_migration/pipeline/Step2/test_metric.py similarity index 100% rename from examples/torch_migration/pipeline/Step2/test_metric.py rename to legacy/examples/torch_migration/pipeline/Step2/test_metric.py diff --git a/examples/torch_migration/pipeline/Step3/README.md b/legacy/examples/torch_migration/pipeline/Step3/README.md similarity index 100% rename from examples/torch_migration/pipeline/Step3/README.md rename to legacy/examples/torch_migration/pipeline/Step3/README.md diff --git a/examples/torch_migration/pipeline/Step3/check_step3.py b/legacy/examples/torch_migration/pipeline/Step3/check_step3.py similarity index 100% rename from examples/torch_migration/pipeline/Step3/check_step3.py rename to legacy/examples/torch_migration/pipeline/Step3/check_step3.py diff --git a/examples/torch_migration/pipeline/Step3/paddle_loss.py b/legacy/examples/torch_migration/pipeline/Step3/paddle_loss.py similarity index 100% rename from examples/torch_migration/pipeline/Step3/paddle_loss.py rename to legacy/examples/torch_migration/pipeline/Step3/paddle_loss.py diff --git a/examples/torch_migration/pipeline/Step3/torch_loss.py b/legacy/examples/torch_migration/pipeline/Step3/torch_loss.py similarity index 100% rename from examples/torch_migration/pipeline/Step3/torch_loss.py rename to legacy/examples/torch_migration/pipeline/Step3/torch_loss.py diff --git a/examples/torch_migration/pipeline/Step4/README.md b/legacy/examples/torch_migration/pipeline/Step4/README.md similarity index 100% rename from examples/torch_migration/pipeline/Step4/README.md rename to legacy/examples/torch_migration/pipeline/Step4/README.md diff --git a/examples/torch_migration/pipeline/Step4/check_step4.py b/legacy/examples/torch_migration/pipeline/Step4/check_step4.py similarity index 100% rename from examples/torch_migration/pipeline/Step4/check_step4.py rename to legacy/examples/torch_migration/pipeline/Step4/check_step4.py diff --git a/examples/torch_migration/pipeline/Step4/test_bp.py b/legacy/examples/torch_migration/pipeline/Step4/test_bp.py similarity index 100% rename from examples/torch_migration/pipeline/Step4/test_bp.py rename to legacy/examples/torch_migration/pipeline/Step4/test_bp.py diff --git a/examples/torch_migration/pipeline/Step4/test_lr_scheduler.py b/legacy/examples/torch_migration/pipeline/Step4/test_lr_scheduler.py similarity index 100% rename from examples/torch_migration/pipeline/Step4/test_lr_scheduler.py rename to legacy/examples/torch_migration/pipeline/Step4/test_lr_scheduler.py diff --git a/examples/torch_migration/pipeline/Step5/README.md b/legacy/examples/torch_migration/pipeline/Step5/README.md similarity index 100% rename from examples/torch_migration/pipeline/Step5/README.md rename to legacy/examples/torch_migration/pipeline/Step5/README.md diff --git a/examples/torch_migration/pipeline/Step5/bert_paddle/train.py b/legacy/examples/torch_migration/pipeline/Step5/bert_paddle/train.py similarity index 100% rename from examples/torch_migration/pipeline/Step5/bert_paddle/train.py rename to legacy/examples/torch_migration/pipeline/Step5/bert_paddle/train.py diff --git a/examples/torch_migration/pipeline/Step5/bert_paddle/train.sh b/legacy/examples/torch_migration/pipeline/Step5/bert_paddle/train.sh similarity index 100% rename from examples/torch_migration/pipeline/Step5/bert_paddle/train.sh rename to legacy/examples/torch_migration/pipeline/Step5/bert_paddle/train.sh diff --git a/examples/torch_migration/pipeline/Step5/bert_paddle/utils.py b/legacy/examples/torch_migration/pipeline/Step5/bert_paddle/utils.py similarity index 100% rename from examples/torch_migration/pipeline/Step5/bert_paddle/utils.py rename to legacy/examples/torch_migration/pipeline/Step5/bert_paddle/utils.py diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/accuracy.py b/legacy/examples/torch_migration/pipeline/Step5/bert_torch/accuracy.py similarity index 100% rename from examples/torch_migration/pipeline/Step5/bert_torch/accuracy.py rename to legacy/examples/torch_migration/pipeline/Step5/bert_torch/accuracy.py diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/glue.py b/legacy/examples/torch_migration/pipeline/Step5/bert_torch/glue.py similarity index 100% rename from examples/torch_migration/pipeline/Step5/bert_torch/glue.py rename to legacy/examples/torch_migration/pipeline/Step5/bert_torch/glue.py diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/train.py b/legacy/examples/torch_migration/pipeline/Step5/bert_torch/train.py similarity index 100% rename from examples/torch_migration/pipeline/Step5/bert_torch/train.py rename to legacy/examples/torch_migration/pipeline/Step5/bert_torch/train.py diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/train.sh b/legacy/examples/torch_migration/pipeline/Step5/bert_torch/train.sh similarity index 100% rename from examples/torch_migration/pipeline/Step5/bert_torch/train.sh rename to legacy/examples/torch_migration/pipeline/Step5/bert_torch/train.sh diff --git a/examples/torch_migration/pipeline/Step5/bert_torch/utils.py b/legacy/examples/torch_migration/pipeline/Step5/bert_torch/utils.py similarity index 100% rename from examples/torch_migration/pipeline/Step5/bert_torch/utils.py rename to legacy/examples/torch_migration/pipeline/Step5/bert_torch/utils.py diff --git a/examples/torch_migration/pipeline/Step5/check_step5.py b/legacy/examples/torch_migration/pipeline/Step5/check_step5.py similarity index 100% rename from examples/torch_migration/pipeline/Step5/check_step5.py rename to legacy/examples/torch_migration/pipeline/Step5/check_step5.py diff --git a/examples/torch_migration/pipeline/classifier_weights/generate_classifier_weights.py b/legacy/examples/torch_migration/pipeline/classifier_weights/generate_classifier_weights.py similarity index 100% rename from examples/torch_migration/pipeline/classifier_weights/generate_classifier_weights.py rename to legacy/examples/torch_migration/pipeline/classifier_weights/generate_classifier_weights.py diff --git a/examples/torch_migration/pipeline/fake_data/gen_fake_data.py b/legacy/examples/torch_migration/pipeline/fake_data/gen_fake_data.py similarity index 100% rename from examples/torch_migration/pipeline/fake_data/gen_fake_data.py rename to legacy/examples/torch_migration/pipeline/fake_data/gen_fake_data.py diff --git a/examples/torch_migration/pipeline/models/pd_bert.py b/legacy/examples/torch_migration/pipeline/models/pd_bert.py similarity index 100% rename from examples/torch_migration/pipeline/models/pd_bert.py rename to legacy/examples/torch_migration/pipeline/models/pd_bert.py diff --git a/examples/torch_migration/pipeline/models/pt_bert.py b/legacy/examples/torch_migration/pipeline/models/pt_bert.py similarity index 100% rename from examples/torch_migration/pipeline/models/pt_bert.py rename to legacy/examples/torch_migration/pipeline/models/pt_bert.py diff --git a/examples/torch_migration/pipeline/reprod_log_demo/check_log_diff.py b/legacy/examples/torch_migration/pipeline/reprod_log_demo/check_log_diff.py similarity index 100% rename from examples/torch_migration/pipeline/reprod_log_demo/check_log_diff.py rename to legacy/examples/torch_migration/pipeline/reprod_log_demo/check_log_diff.py diff --git a/examples/torch_migration/pipeline/reprod_log_demo/write_log.py b/legacy/examples/torch_migration/pipeline/reprod_log_demo/write_log.py similarity index 100% rename from examples/torch_migration/pipeline/reprod_log_demo/write_log.py rename to legacy/examples/torch_migration/pipeline/reprod_log_demo/write_log.py diff --git a/examples/torch_migration/pipeline/weights/torch2paddle.py b/legacy/examples/torch_migration/pipeline/weights/torch2paddle.py similarity index 99% rename from examples/torch_migration/pipeline/weights/torch2paddle.py rename to legacy/examples/torch_migration/pipeline/weights/torch2paddle.py index 74511fea26e9..3a8d472064bd 100644 --- a/examples/torch_migration/pipeline/weights/torch2paddle.py +++ b/legacy/examples/torch_migration/pipeline/weights/torch2paddle.py @@ -17,9 +17,10 @@ import numpy as np import paddle import torch -from paddlenlp.transformers import BertForPretraining as PDBertForMaskedLM from transformers import BertForMaskedLM as PTBertForMaskedLM +from paddlenlp.transformers import BertForPretraining as PDBertForMaskedLM + def convert_pytorch_checkpoint_to_paddle( pytorch_checkpoint_path="pytorch_model.bin", diff --git a/examples/torch_migration/pipeline/weights/torch_bert_weight.py b/legacy/examples/torch_migration/pipeline/weights/torch_bert_weight.py similarity index 100% rename from examples/torch_migration/pipeline/weights/torch_bert_weight.py rename to legacy/examples/torch_migration/pipeline/weights/torch_bert_weight.py index 819229e156a5..b1cf6f1881c3 100644 --- a/examples/torch_migration/pipeline/weights/torch_bert_weight.py +++ b/legacy/examples/torch_migration/pipeline/weights/torch_bert_weight.py @@ -12,8 +12,8 @@ # See the License for the specific language governing permissions and # limitations under the License. -from transformers import BertModel import torch +from transformers import BertModel hf_model = BertModel.from_pretrained("bert-base-uncased") hf_model.eval() diff --git a/examples/torch_migration/requirements.txt b/legacy/examples/torch_migration/requirements.txt similarity index 100% rename from examples/torch_migration/requirements.txt rename to legacy/examples/torch_migration/requirements.txt diff --git a/scripts/regression/ci_case.sh b/scripts/regression/ci_case.sh index e19a42f8a756..44f41d61d212 100644 --- a/scripts/regression/ci_case.sh +++ b/scripts/regression/ci_case.sh @@ -22,412 +22,396 @@ export CXX_COMPILER_PATH=$(which g++) export CC=$(which gcc) export CXX=$(which g++) -if [ ! -d "model_logs" ];then +if [ ! -d "model_logs" ]; then mkdir model_logs fi -if [ ! -d "unittest_logs" ];then +if [ ! -d "unittest_logs" ]; then mkdir model_logs fi -print_info(){ -if [ $1 -ne 0 ];then - if [[ $2 =~ 'tests' ]];then - mv ${nlp_dir}/unittest_logs/$3.log ${nlp_dir}/unittest_logs/$3_FAIL.log - echo -e "\033[31m ${nlp_dir}/unittest_logs/$3_FAIL \033[0m" - cat ${nlp_dir}/unittest_logs/$3_FAIL.log +print_info() { + if [ $1 -ne 0 ]; then + if [[ $2 =~ 'tests' ]]; then + mv ${nlp_dir}/unittest_logs/$3.log ${nlp_dir}/unittest_logs/$3_FAIL.log + echo -e "\033[31m ${nlp_dir}/unittest_logs/$3_FAIL \033[0m" + cat ${nlp_dir}/unittest_logs/$3_FAIL.log + else + mv ${log_path}/$2 ${log_path}/$2_FAIL.log + echo -e "\033[31m ${log_path}/$2_FAIL \033[0m" + cat ${log_path}/$2_FAIL.log + fi + elif [[ $2 =~ 'tests' ]]; then + echo -e "\033[32m ${log_path}/$3_SUCCESS \033[0m" else - mv ${log_path}/$2 ${log_path}/$2_FAIL.log - echo -e "\033[31m ${log_path}/$2_FAIL \033[0m" - cat ${log_path}/$2_FAIL.log + echo -e "\033[32m ${log_path}/$2_SUCCESS \033[0m" fi -elif [[ $2 =~ 'tests' ]];then - echo -e "\033[32m ${log_path}/$3_SUCCESS \033[0m" -else - echo -e "\033[32m ${log_path}/$2_SUCCESS \033[0m" -fi } # case list -# 1 waybill_ie (无可控参数,数据集外置) -waybill_ie(){ -cd ${nlp_dir}/examples/information_extraction/waybill_ie/ -export CUDA_VISIBLE_DEVICES=${cudaid1} -# BiGRU +CRF star training -time ( -python download.py --data_dir ./waybill_ie -python run_bigru_crf.py >${log_path}/waybill_ie_bigru_crf) >>${log_path}/waybill_ie_bigru_crf 2>&1 -print_info $? waybill_ie_bigru_crf -# ERNIE +RF star training -time (python run_ernie.py >${log_path}/waybill_ie_ernie) >>${log_path}/waybill_ie_ernie 2>&1 -print_info $? waybill_ie_ernie -# ERNIE +CRF star training -time (python run_ernie_crf.py >${log_path}/waybill_ie_ernie_crf) >>${log_path}/waybill_ie_ernie_crf 2>&1 -print_info $? waybill_ie_ernie_crf -} # 2 msra_ner (不可控,内置) -msra_ner(){ -cd ${nlp_dir}/examples/information_extraction/msra_ner/ -export CUDA_VISIBLE_DEVICES=${cudaid2} -## train -time (python -m paddle.distributed.launch ./train.py \ - --model_type bert \ - --model_name_or_path bert-base-multilingual-uncased \ - --dataset msra_ner \ - --max_seq_length 128 \ - --batch_size 16 \ - --learning_rate 2e-5 \ - --num_train_epochs 1 \ - --logging_steps 1 \ - --max_steps 2 \ - --save_steps 2 \ - --output_dir ./tmp/msra_ner/ \ - --device gpu >${log_path}/msra_ner_train) >>${log_path}/msra_ner_train 2>&1 -print_info $? msra_ner_train -## eval -time (python -u ./eval.py \ - --model_name_or_path bert-base-multilingual-uncased \ - --max_seq_length 128 \ - --batch_size 16 \ - --device gpu \ - --init_checkpoint_path ./tmp/msra_ner/model_2.pdparams >${log_path}/msra_ner_eval) >>${log_path}/msra_ner_eval 2>&1 -print_info $? msra_ner_eval -## predict -time (python -u ./predict.py \ - --model_name_or_path bert-base-multilingual-uncased \ - --max_seq_length 128 \ - --batch_size 16 \ - --device gpu \ - --init_checkpoint_path ./tmp/msra_ner/model_2.pdparams >${log_path}/msra_ner_predict) >>${log_path}/msra_ner_predict 2>&1 -print_info $? msra_ner_predict +msra_ner() { + cd ${nlp_dir}/legacy/examples/information_extraction/msra_ner/ + export CUDA_VISIBLE_DEVICES=${cudaid2} + ## train + time (python -m paddle.distributed.launch ./train.py \ + --model_type bert \ + --model_name_or_path bert-base-multilingual-uncased \ + --dataset msra_ner \ + --max_seq_length 128 \ + --batch_size 16 \ + --learning_rate 2e-5 \ + --num_train_epochs 1 \ + --logging_steps 1 \ + --max_steps 2 \ + --save_steps 2 \ + --output_dir ./tmp/msra_ner/ \ + --device gpu >${log_path}/msra_ner_train) >>${log_path}/msra_ner_train 2>&1 + print_info $? msra_ner_train + ## eval + time (python -u ./eval.py \ + --model_name_or_path bert-base-multilingual-uncased \ + --max_seq_length 128 \ + --batch_size 16 \ + --device gpu \ + --init_checkpoint_path ./tmp/msra_ner/model_2.pdparams >${log_path}/msra_ner_eval) >>${log_path}/msra_ner_eval 2>&1 + print_info $? msra_ner_eval + ## predict + time (python -u ./predict.py \ + --model_name_or_path bert-base-multilingual-uncased \ + --max_seq_length 128 \ + --batch_size 16 \ + --device gpu \ + --init_checkpoint_path ./tmp/msra_ner/model_2.pdparams >${log_path}/msra_ner_predict) >>${log_path}/msra_ner_predict 2>&1 + print_info $? msra_ner_predict } # 3 glue glue() { -cd ${nlp_dir}/examples/benchmark/glue/ -export CUDA_VISIBLE_DEVICES=${cudaid2} -## TASK_SST-2 -export TASK_NAME=SST-2 -time (python -u run_glue.py \ - --model_type bert \ - --model_name_or_path bert-base-uncased \ - --task_name $TASK_NAME \ - --max_seq_length 128 \ - --batch_size 128 \ - --learning_rate 3e-5 \ - --max_steps 1 \ - --logging_steps 1 \ - --save_steps 1 \ - --output_dir ./$TASK_NAME/ \ - --device gpu >${log_path}/glue_${TASK_NAME}_train) >>${log_path}/glue_${TASK_NAME}_train 2>&1 -print_info $? glue_${TASK_NAME}_train + cd ${nlp_dir}/legacy/examples/benchmark/glue/ + export CUDA_VISIBLE_DEVICES=${cudaid2} + ## TASK_SST-2 + export TASK_NAME=SST-2 + time (python -u run_glue.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --task_name $TASK_NAME \ + --max_seq_length 128 \ + --batch_size 128 \ + --learning_rate 3e-5 \ + --max_steps 1 \ + --logging_steps 1 \ + --save_steps 1 \ + --output_dir ./$TASK_NAME/ \ + --device gpu >${log_path}/glue_${TASK_NAME}_train) >>${log_path}/glue_${TASK_NAME}_train 2>&1 + print_info $? glue_${TASK_NAME}_train } # 4 bert bert() { -export CUDA_VISIBLE_DEVICES=${cudaid2} -# cd ${nlp_dir}/model_zoo/bert/ -# wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/bert.tar.gz -# tar -xzvf bert.tar.gz -python -c "import datasets;from datasets import load_dataset; train_dataset=load_dataset('glue', 'sst2', split='train')" -cd ${nlp_dir}/model_zoo/bert/data/ -wget -q https://bj.bcebos.com/paddlenlp/models/transformers/bert/data/training_data.hdf5 -cd ../ -# pretrain -time (python -m paddle.distributed.launch run_pretrain.py \ - --model_type bert \ - --model_name_or_path bert-base-uncased \ - --max_predictions_per_seq 20 \ - --batch_size 16 \ - --learning_rate 1e-4 \ - --weight_decay 1e-2 \ - --adam_epsilon 1e-6 \ - --warmup_steps 10000 \ - --input_dir data/ \ - --output_dir pretrained_models/ \ - --logging_steps 1 \ - --save_steps 1 \ - --max_steps 1 \ - --device gpu \ - --use_amp False >${log_path}/bert_pretrain) >>${log_path}/bert_pretrain 2>&1 -print_info $? bert_pretrain -time (python -m paddle.distributed.launch run_glue_trainer.py \ - --model_name_or_path bert-base-uncased \ - --task_name SST2 \ - --max_seq_length 128 \ - --per_device_train_batch_size 32 \ - --per_device_eval_batch_size 32 \ - --learning_rate 2e-5 \ - --num_train_epochs 3 \ - --logging_steps 1 \ - --save_steps 1 \ - --max_steps 1 \ - --output_dir ./tmp/ \ - --device gpu \ - --fp16 False\ + export CUDA_VISIBLE_DEVICES=${cudaid2} + # cd ${nlp_dir}/model_zoo/bert/ + # wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/bert.tar.gz + # tar -xzvf bert.tar.gz + python -c "import datasets;from datasets import load_dataset; train_dataset=load_dataset('glue', 'sst2', split='train')" + cd ${nlp_dir}/model_zoo/bert/data/ + wget -q https://bj.bcebos.com/paddlenlp/models/transformers/bert/data/training_data.hdf5 + cd ../ + # pretrain + time (python -m paddle.distributed.launch run_pretrain.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --max_predictions_per_seq 20 \ + --batch_size 16 \ + --learning_rate 1e-4 \ + --weight_decay 1e-2 \ + --adam_epsilon 1e-6 \ + --warmup_steps 10000 \ + --input_dir data/ \ + --output_dir pretrained_models/ \ + --logging_steps 1 \ + --save_steps 1 \ + --max_steps 1 \ + --device gpu \ + --use_amp False >${log_path}/bert_pretrain) >>${log_path}/bert_pretrain 2>&1 + print_info $? bert_pretrain + time (python -m paddle.distributed.launch run_glue_trainer.py \ + --model_name_or_path bert-base-uncased \ + --task_name SST2 \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --per_device_eval_batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --save_steps 1 \ + --max_steps 1 \ + --output_dir ./tmp/ \ + --device gpu \ + --fp16 False\ --do_train \ - --do_eval >${log_path}/bert_fintune) >>${log_path}/bert_fintune 2>&1 -print_info $? bert_fintune -time (python -u ./export_model.py \ - --model_type bert \ - --model_path bert-base-uncased \ - --output_path ./infer_model/model >${log_path}/bert_export) >>${log_path}/bert_export 2>&1 -print_info $? bert_export - } + --do_eval >${log_path}/bert_fintune) >>${log_path}/bert_fintune 2>&1 + print_info $? bert_fintune + time (python -u ./export_model.py \ + --model_type bert \ + --model_path bert-base-uncased \ + --output_path ./infer_model/model >${log_path}/bert_export) >>${log_path}/bert_export 2>&1 + print_info $? bert_export +} # 5 skep (max save 不可控 内置) -skep () { -cd ${nlp_dir}/examples/sentiment_analysis/skep/ -export CUDA_VISIBLE_DEVICES=${cudaid2} -## train_sentence -time ( python -m paddle.distributed.launch train_sentence.py --batch_size 16 --epochs 1 --model_name "skep_ernie_1.0_large_ch" --device gpu --save_dir ./checkpoints >${log_path}/skep_train_sentence) >>${log_path}/skep_train_sentence 2>&1 -print_info $? skep_train_sentence -## train_aspect -time ( python -m paddle.distributed.launch train_aspect.py --batch_size 4 --epochs 1 --device gpu --save_dir ./aspect_checkpoints >${log_path}/skep_train_aspect) >>${log_path}/skep_train_aspect 2>&1 -print_info $? skep_train_aspect -# # train_opinion -time ( python -m paddle.distributed.launch train_opinion.py --batch_size 4 --epochs 1 --device gpu --save_dir ./opinion_checkpoints >${log_path}/skep_train_opinion) >>${log_path}/skep_train_opinion 2>&1 -print_info $? skep_train_opinion -# predict_sentence -time (python predict_sentence.py --model_name "skep_ernie_1.0_large_ch" --ckpt_dir checkpoints/model_100 >${log_path}/skep_predict_sentence) >>${log_path}/skep_predict_sentence 2>&1 -print_info $? skep_predict_sentence -## predict_aspect -time (python predict_aspect.py --device 'gpu' --ckpt_dir ./aspect_checkpoints/model_100 >${log_path}/skep_predict_aspect) >>${log_path}/skep_predict_aspect 2>&1 -print_info $? skep_predict_aspect -# # predict_opinion -time (python predict_opinion.py --device 'gpu' --ckpt_dir ./opinion_checkpoints/model_100 >${log_path}/skep_predict_opinion) >>${log_path}/skep_predict_opinion 2>&1 -print_info $? skep_predict_opinion +skep() { + cd ${nlp_dir}/legacy/examples/sentiment_analysis/skep/ + export CUDA_VISIBLE_DEVICES=${cudaid2} + ## train_sentence + time (python -m paddle.distributed.launch train_sentence.py --batch_size 16 --epochs 1 --model_name "skep_ernie_1.0_large_ch" --device gpu --save_dir ./checkpoints >${log_path}/skep_train_sentence) >>${log_path}/skep_train_sentence 2>&1 + print_info $? skep_train_sentence + ## train_aspect + time (python -m paddle.distributed.launch train_aspect.py --batch_size 4 --epochs 1 --device gpu --save_dir ./aspect_checkpoints >${log_path}/skep_train_aspect) >>${log_path}/skep_train_aspect 2>&1 + print_info $? skep_train_aspect + # # train_opinion + time (python -m paddle.distributed.launch train_opinion.py --batch_size 4 --epochs 1 --device gpu --save_dir ./opinion_checkpoints >${log_path}/skep_train_opinion) >>${log_path}/skep_train_opinion 2>&1 + print_info $? skep_train_opinion + # predict_sentence + time (python predict_sentence.py --model_name "skep_ernie_1.0_large_ch" --ckpt_dir checkpoints/model_100 >${log_path}/skep_predict_sentence) >>${log_path}/skep_predict_sentence 2>&1 + print_info $? skep_predict_sentence + ## predict_aspect + time (python predict_aspect.py --device 'gpu' --ckpt_dir ./aspect_checkpoints/model_100 >${log_path}/skep_predict_aspect) >>${log_path}/skep_predict_aspect 2>&1 + print_info $? skep_predict_aspect + # # predict_opinion + time (python predict_opinion.py --device 'gpu' --ckpt_dir ./opinion_checkpoints/model_100 >${log_path}/skep_predict_opinion) >>${log_path}/skep_predict_opinion 2>&1 + print_info $? skep_predict_opinion } # 6 bigbird bigbird(){ -cd ${nlp_dir}/examples/language_model/bigbird/ -export CUDA_VISIBLE_DEVICES=${cudaid2} -time (python -m paddle.distributed.launch --log_dir log run_pretrain.py --model_name_or_path bigbird-base-uncased \ - --input_dir "./data" \ - --output_dir "output" \ - --batch_size 4 \ - --weight_decay 0.01 \ - --learning_rate 1e-5 \ - --max_steps 1 \ - --save_steps 1 \ - --logging_steps 1 \ - --max_encoder_length 512 \ - --max_pred_length 75 >${log_path}/bigbird_pretrain) >>${log_path}/bigbird_pretrain 2>&1 + cd ${nlp_dir}/examples/language_model/bigbird/ + export CUDA_VISIBLE_DEVICES=${cudaid2} + time (python -m paddle.distributed.launch --log_dir log run_pretrain.py --model_name_or_path bigbird-base-uncased \ + --input_dir "./data" \ + --output_dir "output" \ + --batch_size 4 \ + --weight_decay 0.01 \ + --learning_rate 1e-5 \ + --max_steps 1 \ + --save_steps 1 \ + --logging_steps 1 \ + --max_encoder_length 512 \ + --max_pred_length 75 >${log_path}/bigbird_pretrain) >>${log_path}/bigbird_pretrain 2>&1 print_info $? bigbird_pretrain } # 7 electra electra(){ -cd ${nlp_dir}/model_zoo/electra/ -export CUDA_VISIBLE_DEVICES=${cudaid2} -export DATA_DIR=./BookCorpus/ -wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/BookCorpus.tar.gz && tar -xzvf BookCorpus.tar.gz -time (python -u ./run_pretrain.py \ - --model_type electra \ - --model_name_or_path electra-small \ - --input_dir ./BookCorpus/ \ - --output_dir ./pretrain_model/ \ - --train_batch_size 64 \ - --learning_rate 5e-4 \ - --max_seq_length 128 \ - --weight_decay 1e-2 \ - --adam_epsilon 1e-6 \ - --warmup_steps 10000 \ - --num_train_epochs 4 \ - --logging_steps 1 \ - --save_steps 1 \ - --max_steps 1 \ - --device gpu >${log_path}/electra_pretrain) >>${log_path}/electra_pretrain 2>&1 -print_info $? electra_pretrain + cd ${nlp_dir}/model_zoo/electra/ + export CUDA_VISIBLE_DEVICES=${cudaid2} + export DATA_DIR=./BookCorpus/ + wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/BookCorpus.tar.gz && tar -xzvf BookCorpus.tar.gz + time (python -u ./run_pretrain.py \ + --model_type electra \ + --model_name_or_path electra-small \ + --input_dir ./BookCorpus/ \ + --output_dir ./pretrain_model/ \ + --train_batch_size 64 \ + --learning_rate 5e-4 \ + --max_seq_length 128 \ + --weight_decay 1e-2 \ + --adam_epsilon 1e-6 \ + --warmup_steps 10000 \ + --num_train_epochs 4 \ + --logging_steps 1 \ + --save_steps 1 \ + --max_steps 1 \ + --device gpu >${log_path}/electra_pretrain) >>${log_path}/electra_pretrain 2>&1 + print_info $? electra_pretrain } # 9 ernie ernie(){ -#data process -cd ${nlp_dir}/model_zoo/ernie-1.0/ -mkdir data -cd ./data -wget -q https://paddlenlp.bj.bcebos.com/models/transformers/data_tools/ernie_wudao_0903_92M_ids.npy -wget -q https://paddlenlp.bj.bcebos.com/models/transformers/data_tools/ernie_wudao_0903_92M_idx.npz -cd ../ -mkdir data_ernie_3.0 && cd data_ernie_3.0 -wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_200g_sample_ernie-3.0-base-zh_ids.npy -wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_200g_sample_ernie-3.0-base-zh_idx.npz -cd ../ -# pretrain_trainer -python -u -m paddle.distributed.launch \ - --log_dir "output/trainer_log" \ - run_pretrain_trainer.py \ - --model_type "ernie" \ - --model_name_or_path "ernie-3.0-base-zh" \ - --tokenizer_name_or_path "ernie-3.0-base-zh" \ - --input_dir "./data_ernie_3.0" \ - --output_dir "output/trainer_log" \ - --split 949,50,1 \ - --max_seq_length 512 \ - --per_device_train_batch_size 16 \ - --per_device_eval_batch_size 32 \ - --fp16 \ - --fp16_opt_level "O2" \ - --learning_rate 0.0001 \ - --min_learning_rate 0.00001 \ - --max_steps 2 \ - --save_steps 2 \ - --weight_decay 0.01 \ - --warmup_ratio 0.01 \ - --max_grad_norm 1.0 \ - --logging_steps 1\ + #data process + cd ${nlp_dir}/model_zoo/ernie-1.0/ + mkdir data + cd ./data + wget -q https://paddlenlp.bj.bcebos.com/models/transformers/data_tools/ernie_wudao_0903_92M_ids.npy + wget -q https://paddlenlp.bj.bcebos.com/models/transformers/data_tools/ernie_wudao_0903_92M_idx.npz + cd ../ + mkdir data_ernie_3.0 && cd data_ernie_3.0 + wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_200g_sample_ernie-3.0-base-zh_ids.npy + wget https://bj.bcebos.com/paddlenlp/models/transformers/data_tools/wudao_200g_sample_ernie-3.0-base-zh_idx.npz + cd ../ + # pretrain_trainer + python -u -m paddle.distributed.launch \ + --log_dir "output/trainer_log" \ + run_pretrain_trainer.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-3.0-base-zh" \ + --tokenizer_name_or_path "ernie-3.0-base-zh" \ + --input_dir "./data_ernie_3.0" \ + --output_dir "output/trainer_log" \ + --split 949,50,1 \ + --max_seq_length 512 \ + --per_device_train_batch_size 16 \ + --per_device_eval_batch_size 32 \ + --fp16 \ + --fp16_opt_level "O2" \ + --learning_rate 0.0001 \ + --min_learning_rate 0.00001 \ + --max_steps 2 \ + --save_steps 2 \ + --weight_decay 0.01 \ + --warmup_ratio 0.01 \ + --max_grad_norm 1.0 \ + --logging_steps 1\ --dataloader_num_workers 4 \ - --eval_steps 1000 \ - --report_to "visualdl" \ - --disable_tqdm true \ - --do_train \ - --device "gpu" >${log_path}/ernie_1.0_pretrain_trainer >>${log_path}/ernie_1.0_pretrain_trainer 2>&1 + --eval_steps 1000 \ + --report_to "visualdl" \ + --disable_tqdm true \ + --do_train \ + --device "gpu" >${log_path}/ernie_1.0_pretrain_trainer >>${log_path}/ernie_1.0_pretrain_trainer 2>&1 print_info $? ernie_1.0_pretrain_trainer -# pretrain_static -python -u -m paddle.distributed.launch \ - --log_dir "./log" \ - run_pretrain_static.py \ - --model_type "ernie" \ - --model_name_or_path "ernie-1.0-base-zh" \ - --tokenizer_name_or_path "ernie-1.0-base-zh" \ - --input_dir "./data/" \ - --output_dir "./output/" \ - --max_seq_len 512 \ - --micro_batch_size 16 \ - --global_batch_size 32 \ - --sharding_degree 1 \ - --dp_degree 2 \ - --use_sharding false \ - --use_amp true \ - --use_recompute false \ - --max_lr 0.0001 \ - --min_lr 0.00001 \ - --max_steps 4 \ - --save_steps 2 \ - --checkpoint_steps 5000 \ - --decay_steps 3960000 \ - --weight_decay 0.01 \ - --warmup_rate 0.0025 \ - --grad_clip 1.0 \ - --logging_freq 2\ + # pretrain_static + python -u -m paddle.distributed.launch \ + --log_dir "./log" \ + run_pretrain_static.py \ + --model_type "ernie" \ + --model_name_or_path "ernie-1.0-base-zh" \ + --tokenizer_name_or_path "ernie-1.0-base-zh" \ + --input_dir "./data/" \ + --output_dir "./output/" \ + --max_seq_len 512 \ + --micro_batch_size 16 \ + --global_batch_size 32 \ + --sharding_degree 1 \ + --dp_degree 2 \ + --use_sharding false \ + --use_amp true \ + --use_recompute false \ + --max_lr 0.0001 \ + --min_lr 0.00001 \ + --max_steps 4 \ + --save_steps 2 \ + --checkpoint_steps 5000 \ + --decay_steps 3960000 \ + --weight_decay 0.01 \ + --warmup_rate 0.0025 \ + --grad_clip 1.0 \ + --logging_freq 2\ --num_workers 2 \ - --eval_freq 1000 \ - --device "gpu" >${log_path}/ernie_1.0_pretrain_static >>${log_path}/ernie_1.0_pretrain_static 2>&1 + --eval_freq 1000 \ + --device "gpu" >${log_path}/ernie_1.0_pretrain_static >>${log_path}/ernie_1.0_pretrain_static 2>&1 print_info $? ernie_1.0_pretrain_static } # 10 xlnet xlnet(){ -cd ${nlp_dir}/examples/language_model/xlnet/ -export CUDA_VISIBLE_DEVICES=${cudaid2} -time (python -m paddle.distributed.launch ./run_glue.py \ - --model_name_or_path xlnet-base-cased \ - --task_name SST-2 \ - --max_seq_length 128 \ - --batch_size 32 \ - --learning_rate 2e-5 \ - --num_train_epochs 3 \ - --max_steps 1 \ - --logging_steps 1 \ - --save_steps 1 \ - --output_dir ./xlnet/ >${log_path}/xlnet_train) >>${log_path}/xlnet_train 2>&1 -print_info $? xlnet_train + cd ${nlp_dir}/examples/language_model/xlnet/ + export CUDA_VISIBLE_DEVICES=${cudaid2} + time (python -m paddle.distributed.launch ./run_glue.py \ + --model_name_or_path xlnet-base-cased \ + --task_name SST-2 \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 3 \ + --max_steps 1 \ + --logging_steps 1 \ + --save_steps 1 \ + --output_dir ./xlnet/ >${log_path}/xlnet_train) >>${log_path}/xlnet_train 2>&1 + print_info $? xlnet_train } # 11 ofa ofa(){ -cd ${nlp_dir}/examples/model_compression/ofa/ -cd ../../benchmark/glue/ -export CUDA_VISIBLE_DEVICES=${cudaid2} -# finetuing -time (python -u run_glue.py \ - --model_type bert \ - --model_name_or_path bert-base-uncased \ - --task_name SST-2 \ - --max_seq_length 128 \ - --batch_size 32 \ - --learning_rate 2e-5 \ - --num_train_epochs 1 \ - --max_steps 1 \ - --logging_steps 1 \ - --save_steps 1 \ - --output_dir ./ \ - --device gpu >${log_path}/ofa_pretrain) >>${log_path}/ofa_pretrain 2>&1 -print_info $? ofa_pretrain -mv sst-2_ft_model_1.pdparams/ ${nlp_dir}/examples/model_compression/ofa/ -cd - -#model slim -export CUDA_VISIBLE_DEVICES=${cudaid2} -time (python -m paddle.distributed.launch run_glue_ofa.py \ - --model_type bert \ - --model_name_or_path ./sst-2_ft_model_1.pdparams/ \ - --task_name SST-2 --max_seq_length 128 \ - --batch_size 32 \ - --learning_rate 2e-5 \ - --num_train_epochs 1 \ - --max_steps 1 \ - --logging_steps 1 \ - --save_steps 1 \ - --output_dir ./ofa/SST-2 \ - --device gpu \ - --width_mult_list 1.0 0.8333333333333334 0.6666666666666666 0.5 >${log_path}/ofa_slim) >>${log_path}/ofa_slim 2>&1 -print_info $? ofa_slim + cd ${nlp_dir}/examples/model_compression/ofa/ + cd ../../benchmark/glue/ + export CUDA_VISIBLE_DEVICES=${cudaid2} + # finetuing + time (python -u run_glue.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --task_name SST-2 \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 1 \ + --max_steps 1 \ + --logging_steps 1 \ + --save_steps 1 \ + --output_dir ./ \ + --device gpu >${log_path}/ofa_pretrain) >>${log_path}/ofa_pretrain 2>&1 + print_info $? ofa_pretrain + mv sst-2_ft_model_1.pdparams/ ${nlp_dir}/examples/model_compression/ofa/ + cd - + #model slim + export CUDA_VISIBLE_DEVICES=${cudaid2} + time (python -m paddle.distributed.launch run_glue_ofa.py \ + --model_type bert \ + --model_name_or_path ./sst-2_ft_model_1.pdparams/ \ + --task_name SST-2 --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 2e-5 \ + --num_train_epochs 1 \ + --max_steps 1 \ + --logging_steps 1 \ + --save_steps 1 \ + --output_dir ./ofa/SST-2 \ + --device gpu \ + --width_mult_list 1.0 0.8333333333333334 0.6666666666666666 0.5 >${log_path}/ofa_slim) >>${log_path}/ofa_slim 2>&1 + print_info $? ofa_slim } # 12 albert -albert (){ -cd ${nlp_dir}/examples/benchmark/glue/ -export CUDA_VISIBLE_DEVICES=${cudaid2} -time (python -m paddle.distributed.launch run_glue.py \ - --model_type albert \ - --model_name_or_path albert-base-v2 \ +albert() { + cd ${nlp_dir}/legacy/examples/benchmark/glue/ + export CUDA_VISIBLE_DEVICES=${cudaid2} + time (python -m paddle.distributed.launch run_glue.py \ + --model_type albert \ + --model_name_or_path albert-base-v2 \ --task_name SST-2 \ - --max_seq_length 128 \ - --batch_size 32 \ - --learning_rate 1e-5 \ - --max_steps 1 \ - --warmup_steps 1256 \ - --logging_steps 1 \ - --save_steps 1 \ - --output_dir ./albert/SST-2/ \ + --max_seq_length 128 \ + --batch_size 32 \ + --learning_rate 1e-5 \ + --max_steps 1 \ + --warmup_steps 1256 \ + --logging_steps 1 \ + --save_steps 1 \ + --output_dir ./albert/SST-2/ \ --device gpu >${log_path}/albert_sst-2_train) >>${log_path}/albert_sst-2_train 2>&1 -print_info $? albert_sst-2_train + print_info $? albert_sst-2_train } # 13 squad -squad (){ -cd ${nlp_dir}/examples/machine_reading_comprehension/SQuAD/ -export CUDA_VISIBLE_DEVICES=${cudaid1} -# finetune -time (python -m paddle.distributed.launch run_squad.py \ - --model_type bert \ - --model_name_or_path bert-base-uncased \ - --max_seq_length 384 \ - --batch_size 12 \ - --learning_rate 3e-5 \ - --num_train_epochs 1 \ - --max_steps 1 \ - --logging_steps 1 \ - --save_steps 1 \ - --warmup_proportion 0.1 \ - --weight_decay 0.01 \ - --output_dir ./tmp/squad/ \ - --device gpu \ - --do_train \ - --do_predict >${log_path}/squad_train) >>${log_path}/squad_train 2>&1 -print_info $? squad_train -# export model -time (python -u ./export_model.py \ - --model_type bert \ - --model_path ./tmp/squad/model_1/ \ - --output_path ./infer_model/model >${log_path}/squad_export) >>${log_path}/squad_export 2>&1 -print_info $? squad_export -# predict -time (python -u deploy/python/predict.py \ - --model_type bert \ - --model_name_or_path ./infer_model/model \ - --batch_size 2 \ - --max_seq_length 384 >${log_path}/squad_predict) >>${log_path}/squad_predict 2>&1 -print_info $? squad_predict +squad() { + cd ${nlp_dir}/legacy/examples/machine_reading_comprehension/SQuAD/ + export CUDA_VISIBLE_DEVICES=${cudaid1} + # finetune + time (python -m paddle.distributed.launch run_squad.py \ + --model_type bert \ + --model_name_or_path bert-base-uncased \ + --max_seq_length 384 \ + --batch_size 12 \ + --learning_rate 3e-5 \ + --num_train_epochs 1 \ + --max_steps 1 \ + --logging_steps 1 \ + --save_steps 1 \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --output_dir ./tmp/squad/ \ + --device gpu \ + --do_train \ + --do_predict >${log_path}/squad_train) >>${log_path}/squad_train 2>&1 + print_info $? squad_train + # export model + time (python -u ./export_model.py \ + --model_type bert \ + --model_path ./tmp/squad/model_1/ \ + --output_path ./infer_model/model >${log_path}/squad_export) >>${log_path}/squad_export 2>&1 + print_info $? squad_export + # predict + time (python -u deploy/python/predict.py \ + --model_type bert \ + --model_name_or_path ./infer_model/model \ + --batch_size 2 \ + --max_seq_length 384 >${log_path}/squad_predict) >>${log_path}/squad_predict 2>&1 + print_info $? squad_predict } # 15 lexical_analysis lexical_analysis(){ -export CUDA_VISIBLE_DEVICES=${cudaid2} -cd ${nlp_dir}/examples/lexical_analysis/ -#train -time (python download.py --data_dir ./ ) -time (python -m paddle.distributed.launch train.py \ + export CUDA_VISIBLE_DEVICES=${cudaid2} + cd ${nlp_dir}/examples/lexical_analysis/ + #train + time (python download.py --data_dir ./ ) + time (python -m paddle.distributed.launch train.py \ --data_dir ./lexical_analysis_dataset_tiny \ --model_save_dir ./save_dir \ --epochs 1 \ @@ -435,643 +419,475 @@ time (python -m paddle.distributed.launch train.py \ --logging_steps 1\ --batch_size 32 \ --device gpu >${log_path}/lexical_analysis_train) >>${log_path}/lexical_analysis_train 2>&1 -print_info $? lexical_analysis_train -#export -time (python export_model.py \ - --data_dir=./lexical_analysis_dataset_tiny \ - --params_path=./save_dir/model_15.pdparams \ - --output_path=./infer_model/static_graph_params >${log_path}/lexical_analysis_export) >>${log_path}/lexical_analysis_export 2>&1 -print_info $? lexical_analysis_export -# predict -time (python predict.py --data_dir ./lexical_analysis_dataset_tiny \ + print_info $? lexical_analysis_train + #export + time (python export_model.py \ + --data_dir=./lexical_analysis_dataset_tiny \ + --params_path=./save_dir/model_15.pdparams \ + --output_path=./infer_model/static_graph_params >${log_path}/lexical_analysis_export) >>${log_path}/lexical_analysis_export 2>&1 + print_info $? lexical_analysis_export + # predict + time (python predict.py --data_dir ./lexical_analysis_dataset_tiny \ --init_checkpoint ./save_dir/model_15.pdparams \ --batch_size 32 \ --device gpu >${log_path}/lexical_analysis_predict) >>${log_path}/lexical_analysis_predict 2>&1 -print_info $? lexical_analysis_predict -# deploy -time (python deploy/predict.py \ - --model_file=infer_model/static_graph_params.pdmodel \ - --params_file=infer_model/static_graph_params.pdiparams \ - --data_dir lexical_analysis_dataset_tiny >${log_path}/lexical_analysis_deploy) >>${log_path}/lexical_analysis_deploy 2>&1 -print_info $? lexical_analysis_deploy -} -# 16 seq2seq -seq2seq() { -export CUDA_VISIBLE_DEVICES=${cudaid2} -cd ${nlp_dir}/examples/machine_translation/seq2seq/ -# train (1041/steps) 5min -time (python train.py \ - --num_layers 2 \ - --hidden_size 512 \ - --batch_size 128 \ - --max_epoch 1 \ - --log_freq 1 \ - --dropout 0.2 \ - --init_scale 0.1 \ - --max_grad_norm 5.0 \ - --device gpu \ - --model_path ./attention_models >${log_path}/seq2seq_train) >>${log_path}/seq2seq_train 2>&1 -print_info $? seq2seq_train -# predict -time (python predict.py \ - --num_layers 2 \ - --hidden_size 512 \ - --batch_size 128 \ - --dropout 0.2 \ - --init_scale 0.1 \ - --max_grad_norm 5.0 \ - --init_from_ckpt attention_models/0 \ - --infer_output_file infer_output.txt \ - --beam_size 10 \ - --device gpu >${log_path}/seq2seq_predict) >>${log_path}/seq2seq_predict 2>&1 -print_info $? seq2seq_predict -# export -time (python export_model.py \ - --num_layers 2 \ - --hidden_size 512 \ - --batch_size 128 \ - --dropout 0.2 \ - --init_scale 0.1 \ - --max_grad_norm 5.0 \ - --init_from_ckpt attention_models/0.pdparams \ - --beam_size 10 \ - --export_path ./infer_model/model >${log_path}/seq2seq_export) >>${log_path}/seq2seq_export 2>&1 -print_info $? seq2seq_export -# depoly -time (cd deploy/python -python infer.py \ - --export_path ../../infer_model/model \ - --device gpu \ - --batch_size 128 \ - --infer_output_file infer_output.txt >${log_path}/seq2seq_depoly) >>${log_path}/seq2seq_deploy 2>&1 -print_info $? seq2seq_depoly + print_info $? lexical_analysis_predict + # deploy + time (python deploy/predict.py \ + --model_file=infer_model/static_graph_params.pdmodel \ + --params_file=infer_model/static_graph_params.pdiparams \ + --data_dir lexical_analysis_dataset_tiny >${log_path}/lexical_analysis_deploy) >>${log_path}/lexical_analysis_deploy 2>&1 + print_info $? lexical_analysis_deploy } # 18 word_embedding 5min word_embedding(){ -export CUDA_VISIBLE_DEVICES=${cudaid1} -cd ${nlp_dir}/examples/word_embedding/ -# 使用paddlenlp.embeddings.TokenEmbedding -time (python train.py --device='gpu' \ - --lr=5e-4 \ - --batch_size=32 \ - --epochs=1 \ - --use_token_embedding=True \ - --vdl_dir='./vdl_paddlenlp_dir' >${log_path}/word_embedding_paddlenlp_train) >>${log_path}/word_embedding_paddlenlp_train 2>&1 -print_info $? word_embedding_paddlenlp_train -# 使用paddle.nn.Embedding -time (python train.py --device='gpu' \ - --lr=1e-4 \ - --batch_size=32 \ - --epochs=1 \ - --use_token_embedding=False \ - --vdl_dir='./vdl_paddle_dir' >${log_path}/word_embedding_paddle_train) >>${log_path}/word_embedding_paddle_train 2>&1 -print_info $? word_embedding_paddle_train -} -# 19 ernie-ctm -ernie-ctm(){ -export CUDA_VISIBLE_DEVICES=${cudaid1} -cd ${nlp_dir}/examples/text_to_knowledge/ernie-ctm/ -wget https://paddlenlp.bj.bcebos.com/paddlenlp/datasets/wordtag_dataset_v2.tar.gz && tar -zxvf wordtag_dataset_v2.tar.gz -time (python -m paddle.distributed.launch train.py \ - --max_seq_len 128 \ - --batch_size 8 \ - --learning_rate 5e-5 \ - --num_train_epochs 1 \ - --logging_steps 1 \ - --save_steps 100 \ - --output_dir ./output/ \ - --device "gpu" >${log_path}/ernie-ctm_train) >>${log_path}/ernie-ctm_train 2>&1 -print_info $? ernie-ctm_train -export CUDA_VISIBLE_DEVICES=${cudaid1} -time (python -m paddle.distributed.launch predict.py \ - --batch_size 32 \ - --params_path ./output/model_125/model_state.pdparams \ - --device "gpu" >${log_path}/ernie-ctm_eval) >>${log_path}/ernie-ctm_eval 2>&1 -print_info $? ernie-ctm_eval -} -# 20 distilbert -distilbert (){ -cd ${nlp_dir}/examples/model_compression/distill_lstm/ -wget -q https://paddle-qa.bj.bcebos.com/SST-2_GLUE.tar -tar -xzvf SST-2_GLUE.tar -time ( - python small.py \ - --task_name sst-2 \ - --vocab_size 30522 \ - --max_epoch 1 \ - --batch_size 64 \ - --lr 1.0 \ - --dropout_prob 0.4 \ - --output_dir small_models/SST-2 \ - --save_steps 10000 \ - --embedding_name w2v.google_news.target.word-word.dim300.en >${log_path}/distilbert_small_train) >>${log_path}/distilbert_small_train 2>&1 -print_info $? distilbert_small_train -time ( - python bert_distill.py \ - --task_name sst-2 \ - --vocab_size 30522 \ - --max_epoch 1 \ - --lr 1.0 \ - --task_name sst-2 \ - --dropout_prob 0.2 \ - --batch_size 128 \ - --model_name bert-base-uncased \ - --output_dir distilled_models/SST-2 \ - --teacher_dir ./SST-2/sst-2_ft_model_1.pdparams/ \ - --save_steps 1000 \ - --n_iter 1 \ - --embedding_name w2v.google_news.target.word-word.dim300.en >${log_path}/distilbert_teacher_train) >>${log_path}/distilbert_teacher_train 2>&1 -print_info $? distilbert_teacher_train -} -fast_transformer(){ -# FT -cd ${nlp_dir}/ -export PYTHONPATH=$PWD/PaddleNLP/:$PYTHONPATH -wget -q https://paddle-qa.bj.bcebos.com/paddle-pipeline/Develop-TagBuild-Infer-Linux-Gpu-Cuda120-Cudnn89-Trt86-Mkl-Avx-Gcc122/latest/paddle_inference.tgz -tar -zxf paddle_inference.tgz -cd ${nlp_dir}/paddlenlp/ops -#python op -mkdir build_tr_so -cd build_tr_so/ -cmake .. -DCMAKE_BUILD_TYPE=Release \ --DCMAKE_C_COMPILER=${C_COMPILER_PATH} \ --DCMAKE_CXX_COMPILER=${CXX_COMPILER_PATH} \ --DPY_CMD=python \ --DPADDLE_LIB=${nlp_dir}/paddle_inference \ --DDEMO=${nlp_dir}/paddlenlp/ops/fast_transformer/src/demo/transformer_e2e.cc \ --DON_INFER=ON -DWITH_MKL=ON -DWITH_ONNXRUNTIME=ON -make -j >${log_path}/transformer_python_FT >>${log_path}/transformer_python_FT 2>&1 -print_info $? transformer_python_FT -cd ../ -#C++ op -mkdir build_tr_cc -cd build_tr_cc/ -cmake .. -DCMAKE_BUILD_TYPE=Release \ --DCMAKE_C_COMPILER=${C_COMPILER_PATH} \ --DCMAKE_CXX_COMPILER=${CXX_COMPILER_PATH} \ --DPADDLE_LIB=${nlp_dir}/paddle_inference -DDEMO=${nlp_dir}/paddlenlp/ops/fast_transformer/src/demo/transformer_e2e.cc \ --DON_INFER=ON -DWITH_MKL=ON -DWITH_ONNXRUNTIME=ON -make -j >${log_path}/transformer_C_FT >>${log_path}/transformer_C_FT 2>&1 -print_info $? transformer_C_FT -#deploy python -cd ${nlp_dir}/examples/machine_translation/transformer/fast_transformer/ -sed -i "s#./trained_models/step_final/#./base_trained_models/step_final/#g" ../configs/transformer.base.yaml -wget -q https://paddlenlp.bj.bcebos.com/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz -tar -zxf transformer-base-wmt_ende_bpe.tar.gz -export FLAGS_fraction_of_gpu_memory_to_use=0.1 -cp -rf ${nlp_dir}/paddlenlp/ops/build_tr_so/third-party/build/fastertransformer/bin/decoding_gemm ./ -./decoding_gemm 8 4 8 64 38512 32 512 0 -#beam_search -python encoder_decoding_predict.py \ - --config ../configs/transformer.base.yaml \ - --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so \ - --decoding_strategy beam_search \ - --beam_size 5 >${log_path}/transformer_deploy_P_FT >>${log_path}/transformer_deploy_P_FT 2>&1 -print_info $? transformer_deploy_P_FT -#topk -python encoder_decoding_predict.py \ - --config ../configs/transformer.base.yaml \ - --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so \ - --decoding_strategy topk_sampling \ - --topk 3 >topk.log -#topp -python encoder_decoding_predict.py \ - --config ../configs/transformer.base.yaml \ - --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so \ - --decoding_strategy topp_sampling \ - --topk 0 \ - --topp 0.1 >topp.log -#deploy c++ -python export_model.py \ - --config ../configs/transformer.base.yaml \ - --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so \ - --decoding_strategy beam_search --beam_size 5 -./decoding_gemm 8 5 8 64 38512 256 512 0 -${nlp_dir}/paddlenlp/ops/build_tr_cc/bin/./transformer_e2e -batch_size 8 -gpu_id 0 -model_dir ./infer_model/ -vocab_file ${PPNLP_HOME}/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 \ --data_file ${PPNLP_HOME}/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en >${log_path}/transformer_deploy_C_FT >>${log_path}/transformer_deploy_C_FT 2>&1 -print_info $? transformer_deploy_C_FT + export CUDA_VISIBLE_DEVICES=${cudaid1} + cd ${nlp_dir}/examples/word_embedding/ + # 使用paddlenlp.embeddings.TokenEmbedding + time (python train.py --device='gpu' \ + --lr=5e-4 \ + --batch_size=32 \ + --epochs=1 \ + --use_token_embedding=True \ + --vdl_dir='./vdl_paddlenlp_dir' >${log_path}/word_embedding_paddlenlp_train) >>${log_path}/word_embedding_paddlenlp_train 2>&1 + print_info $? word_embedding_paddlenlp_train + # 使用paddle.nn.Embedding + time (python train.py --device='gpu' \ + --lr=1e-4 \ + --batch_size=32 \ + --epochs=1 \ + --use_token_embedding=False \ + --vdl_dir='./vdl_paddle_dir' >${log_path}/word_embedding_paddle_train) >>${log_path}/word_embedding_paddle_train 2>&1 + print_info $? word_embedding_paddle_train +} +fast_transformer() { + # FT + cd ${nlp_dir}/ + export PYTHONPATH=$PWD/PaddleNLP/:$PYTHONPATH + wget -q https://paddle-qa.bj.bcebos.com/paddle-pipeline/Develop-TagBuild-Infer-Linux-Gpu-Cuda120-Cudnn89-Trt86-Mkl-Avx-Gcc122/latest/paddle_inference.tgz + tar -zxf paddle_inference.tgz + cd ${nlp_dir}/paddlenlp/ops + #python op + mkdir build_tr_so + cd build_tr_so/ + cmake .. -DCMAKE_BUILD_TYPE=Release \ + -DCMAKE_C_COMPILER=${C_COMPILER_PATH} \ + -DCMAKE_CXX_COMPILER=${CXX_COMPILER_PATH} \ + -DPY_CMD=python \ + -DPADDLE_LIB=${nlp_dir}/paddle_inference \ + -DDEMO=${nlp_dir}/paddlenlp/ops/fast_transformer/src/demo/transformer_e2e.cc \ + -DON_INFER=ON -DWITH_MKL=ON -DWITH_ONNXRUNTIME=ON + make -j >${log_path}/transformer_python_FT >>${log_path}/transformer_python_FT 2>&1 + print_info $? transformer_python_FT + cd ../ + #C++ op + mkdir build_tr_cc + cd build_tr_cc/ + cmake .. -DCMAKE_BUILD_TYPE=Release \ + -DCMAKE_C_COMPILER=${C_COMPILER_PATH} \ + -DCMAKE_CXX_COMPILER=${CXX_COMPILER_PATH} \ + -DPADDLE_LIB=${nlp_dir}/paddle_inference -DDEMO=${nlp_dir}/paddlenlp/ops/fast_transformer/src/demo/transformer_e2e.cc \ + -DON_INFER=ON -DWITH_MKL=ON -DWITH_ONNXRUNTIME=ON + make -j >${log_path}/transformer_C_FT >>${log_path}/transformer_C_FT 2>&1 + print_info $? transformer_C_FT + #deploy python + cd ${nlp_dir}/examples/machine_translation/transformer/fast_transformer/ + sed -i "s#./trained_models/step_final/#./base_trained_models/step_final/#g" ../configs/transformer.base.yaml + wget -q https://paddlenlp.bj.bcebos.com/models/transformers/transformer/transformer-base-wmt_ende_bpe.tar.gz + tar -zxf transformer-base-wmt_ende_bpe.tar.gz + export FLAGS_fraction_of_gpu_memory_to_use=0.1 + cp -rf ${nlp_dir}/paddlenlp/ops/build_tr_so/third-party/build/fastertransformer/bin/decoding_gemm ./ + ./decoding_gemm 8 4 8 64 38512 32 512 0 + #beam_search + python encoder_decoding_predict.py \ + --config ../configs/transformer.base.yaml \ + --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so \ + --decoding_strategy beam_search \ + --beam_size 5 >${log_path}/transformer_deploy_P_FT >>${log_path}/transformer_deploy_P_FT 2>&1 + print_info $? transformer_deploy_P_FT + #topk + python encoder_decoding_predict.py \ + --config ../configs/transformer.base.yaml \ + --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so \ + --decoding_strategy topk_sampling \ + --topk 3 >topk.log + #topp + python encoder_decoding_predict.py \ + --config ../configs/transformer.base.yaml \ + --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so \ + --decoding_strategy topp_sampling \ + --topk 0 \ + --topp 0.1 >topp.log + #deploy c++ + python export_model.py \ + --config ../configs/transformer.base.yaml \ + --decoding_lib ${nlp_dir}/paddlenlp/ops/build_tr_so/lib/libdecoding_op.so \ + --decoding_strategy beam_search --beam_size 5 + ./decoding_gemm 8 5 8 64 38512 256 512 0 + ${nlp_dir}/paddlenlp/ops/build_tr_cc/bin/./transformer_e2e -batch_size 8 -gpu_id 0 -model_dir ./infer_model/ -vocab_file ${PPNLP_HOME}/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/vocab_all.bpe.33708 \ + -data_file ${PPNLP_HOME}/datasets/WMT14ende/WMT14.en-de/wmt14_ende_data_bpe/newstest2014.tok.bpe.33708.en >${log_path}/transformer_deploy_C_FT >>${log_path}/transformer_deploy_C_FT 2>&1 + print_info $? transformer_deploy_C_FT } # 22 transformer -transformer (){ -cd ${nlp_dir}/examples/machine_translation/transformer/ -wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/WMT14.en-de.partial.tar.gz -tar -xzvf WMT14.en-de.partial.tar.gz -time ( -sed -i "s/save_step: 10000/save_step: 1/g" configs/transformer.base.yaml -sed -i "s/print_step: 100/print_step: 1/g" configs/transformer.base.yaml -sed -i "s/epoch: 30/epoch: 1/g" configs/transformer.base.yaml -sed -i "s/max_iter: None/max_iter: 2/g" configs/transformer.base.yaml -sed -i "s/batch_size: 4096/batch_size: 1000/g" configs/transformer.base.yaml +transformer() { + cd ${nlp_dir}/legacy/examples/machine_translation/transformer/ + wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/WMT14.en-de.partial.tar.gz + tar -xzvf WMT14.en-de.partial.tar.gz + time ( + sed -i "s/save_step: 10000/save_step: 1/g" configs/transformer.base.yaml + sed -i "s/print_step: 100/print_step: 1/g" configs/transformer.base.yaml + sed -i "s/epoch: 30/epoch: 1/g" configs/transformer.base.yaml + sed -i "s/max_iter: None/max_iter: 2/g" configs/transformer.base.yaml + sed -i "s/batch_size: 4096/batch_size: 1000/g" configs/transformer.base.yaml -python train.py --config ./configs/transformer.base.yaml \ - --train_file ${PWD}/WMT14.en-de.partial/train.tok.clean.bpe.en ${PWD}/WMT14.en-de.partial/train.tok.clean.bpe.de \ - --dev_file ${PWD}/WMT14.en-de.partial/dev.tok.bpe.en ${PWD}/WMT14.en-de.partial/dev.tok.bpe.de \ - --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \ - --unk_token "" --bos_token "" --eos_token "" >${log_path}/transformer_train) >>${log_path}/transformer_train 2>&1 -print_info $? transformer_train -#predict -time ( -sed -i 's#init_from_params: "./trained_models/step/"#init_from_params: "./trained_models/step_final/"#g' configs/transformer.base.yaml -python predict.py --config ./configs/transformer.base.yaml \ - --test_file ${PWD}/WMT14.en-de.partial/test.tok.bpe.en ${PWD}/WMT14.en-de.partial/test.tok.bpe.de \ - --without_ft \ - --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \ - --unk_token "" --bos_token "" --eos_token "" >${log_path}/transformer_predict) >>${log_path}/transformer_predict 2>&1 -print_info $? transformer_predict -#export -time ( -python export_model.py --config ./configs/transformer.base.yaml \ - --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \ - --bos_token "" --eos_token "" >${log_path}/transformer_export) >>${log_path}/transformer_export 2>&1 -print_info $? transformer_export -#infer -time ( -python ./deploy/python/inference.py --config ./configs/transformer.base.yaml \ - --profile \ - --test_file ${PWD}/WMT14.en-de.partial/test.tok.bpe.en ${PWD}/WMT14.en-de.partial/test.tok.bpe.de \ - --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \ - --unk_token "" --bos_token "" --eos_token "" >${log_path}/transformer_infer) >>${log_path}/transformer_infer 2>&1 -print_info $? transformer_infer + python train.py --config ./configs/transformer.base.yaml \ + --train_file ${PWD}/WMT14.en-de.partial/train.tok.clean.bpe.en ${PWD}/WMT14.en-de.partial/train.tok.clean.bpe.de \ + --dev_file ${PWD}/WMT14.en-de.partial/dev.tok.bpe.en ${PWD}/WMT14.en-de.partial/dev.tok.bpe.de \ + --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \ + --unk_token "" --bos_token "" --eos_token "" >${log_path}/transformer_train + ) >>${log_path}/transformer_train 2>&1 + print_info $? transformer_train + #predict + time ( + sed -i 's#init_from_params: "./trained_models/step/"#init_from_params: "./trained_models/step_final/"#g' configs/transformer.base.yaml + python predict.py --config ./configs/transformer.base.yaml \ + --test_file ${PWD}/WMT14.en-de.partial/test.tok.bpe.en ${PWD}/WMT14.en-de.partial/test.tok.bpe.de \ + --without_ft \ + --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \ + --unk_token "" --bos_token "" --eos_token "" >${log_path}/transformer_predict + ) >>${log_path}/transformer_predict 2>&1 + print_info $? transformer_predict + #export + time ( + python export_model.py --config ./configs/transformer.base.yaml \ + --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \ + --bos_token "" --eos_token "" >${log_path}/transformer_export + ) >>${log_path}/transformer_export 2>&1 + print_info $? transformer_export + #infer + time ( + python ./deploy/python/inference.py --config ./configs/transformer.base.yaml \ + --profile \ + --test_file ${PWD}/WMT14.en-de.partial/test.tok.bpe.en ${PWD}/WMT14.en-de.partial/test.tok.bpe.de \ + --vocab_file ${PWD}/WMT14.en-de.partial/vocab_all.bpe.33708 \ + --unk_token "" --bos_token "" --eos_token "" >${log_path}/transformer_infer + ) >>${log_path}/transformer_infer 2>&1 + print_info $? transformer_infer -# fast_transformer -} -# 23 pet -pet (){ -path="examples/few_shot/pet" -python scripts/regression/ci_normal_case.py ${path} -} -efl(){ -path="examples/few_shot/efl" -python scripts/regression/ci_normal_case.py ${path} -} -p-tuning(){ -path="examples/few_shot/p-tuning" -python scripts/regression/ci_normal_case.py ${path} + # fast_transformer } #25 ernie-doc ernie-doc(){ -cd ${nlp_dir}/model_zoo/ernie-doc/ -export CUDA_VISIBLE_DEVICES=${cudaid2} -time (python -m paddle.distributed.launch --log_dir hyp run_classifier.py --epochs 15 --layerwise_decay 0.7 --learning_rate 5e-5 --batch_size 4 --save_steps 100 --max_steps 100 --dataset hyp --output_dir hyp >${log_path}/ernie-doc_hyp) >>${log_path}/ernie-doc_hyp 2>&1 -print_info $? ernie-doc_hyp -time (python -m paddle.distributed.launch --log_dir cmrc2018 run_mrc.py --batch_size 4 --layerwise_decay 0.8 --dropout 0.2 --learning_rate 4.375e-5 --epochs 1 --save_steps 100 --max_steps 100 --dataset cmrc2018 --output_dir cmrc2018 >${log_path}/ernie-doc_cmrc2018) >>${log_path}/ernie-doc_cmrc2018 2>&1 -print_info $? ernie-doc_cmrc2018 -time (python -m paddle.distributed.launch --log_dir c3 run_mcq.py --learning_rate 6.5e-5 --epochs 1 --save_steps 100 --max_steps 100 --output_dir c3 >${log_path}/ernie-doc_c3) >>${log_path}/ernie-doc_c3 2>&1 -print_info $? ernie-doc_c3 -time (python -m paddle.distributed.launch --log_dir cail/ run_semantic_matching.py --epochs 1 --layerwise_decay 0.8 --learning_rate 1.25e-5 --batch_size 4 --save_steps 100 --max_steps 100 --output_dir cail >${log_path}/ernie-doc_cail) >>${log_path}/ernie-doc_cail 2>&1 -print_info $? ernie-doc_cail -time (python -m paddle.distributed.launch --log_dir msra run_sequence_labeling.py --learning_rate 3e-5 --epochs 1 --save_steps 100 --max_steps 100 --output_dir msra >${log_path}/ernie-doc_msar) >>${log_path}/ernie-doc_msar 2>&1 -print_info $? ernie-doc_msar -time (python run_mrc.py --model_name_or_path ernie-doc-base-zh --dataset dureader_robust --batch_size 8 --learning_rate 2.75e-4 --epochs 1 --save_steps 10 --max_steps 2 --logging_steps 10 --device gpu >${log_path}/ernie-doc_dureader_robust) >>${log_path}/ernie-doc_dureader_robust 2>&1 -print_info $? ernie-doc_dureader_robust + cd ${nlp_dir}/model_zoo/ernie-doc/ + export CUDA_VISIBLE_DEVICES=${cudaid2} + time (python -m paddle.distributed.launch --log_dir hyp run_classifier.py --epochs 15 --layerwise_decay 0.7 --learning_rate 5e-5 --batch_size 4 --save_steps 100 --max_steps 100 --dataset hyp --output_dir hyp >${log_path}/ernie-doc_hyp) >>${log_path}/ernie-doc_hyp 2>&1 + print_info $? ernie-doc_hyp + time (python -m paddle.distributed.launch --log_dir cmrc2018 run_mrc.py --batch_size 4 --layerwise_decay 0.8 --dropout 0.2 --learning_rate 4.375e-5 --epochs 1 --save_steps 100 --max_steps 100 --dataset cmrc2018 --output_dir cmrc2018 >${log_path}/ernie-doc_cmrc2018) >>${log_path}/ernie-doc_cmrc2018 2>&1 + print_info $? ernie-doc_cmrc2018 + time (python -m paddle.distributed.launch --log_dir c3 run_mcq.py --learning_rate 6.5e-5 --epochs 1 --save_steps 100 --max_steps 100 --output_dir c3 >${log_path}/ernie-doc_c3) >>${log_path}/ernie-doc_c3 2>&1 + print_info $? ernie-doc_c3 + time (python -m paddle.distributed.launch --log_dir cail/ run_semantic_matching.py --epochs 1 --layerwise_decay 0.8 --learning_rate 1.25e-5 --batch_size 4 --save_steps 100 --max_steps 100 --output_dir cail >${log_path}/ernie-doc_cail) >>${log_path}/ernie-doc_cail 2>&1 + print_info $? ernie-doc_cail + time (python -m paddle.distributed.launch --log_dir msra run_sequence_labeling.py --learning_rate 3e-5 --epochs 1 --save_steps 100 --max_steps 100 --output_dir msra >${log_path}/ernie-doc_msar) >>${log_path}/ernie-doc_msar 2>&1 + print_info $? ernie-doc_msar + time (python run_mrc.py --model_name_or_path ernie-doc-base-zh --dataset dureader_robust --batch_size 8 --learning_rate 2.75e-4 --epochs 1 --save_steps 10 --max_steps 2 --logging_steps 10 --device gpu >${log_path}/ernie-doc_dureader_robust) >>${log_path}/ernie-doc_dureader_robust 2>&1 + print_info $? ernie-doc_dureader_robust } #26 transformer-xl transformer-xl (){ -cd ${nlp_dir}/examples/language_model/transformer-xl/ -mkdir gen_data && cd gen_data -wget https://paddle-qa.bj.bcebos.com/paddlenlp/enwik8.tar.gz && tar -zxvf enwik8.tar.gz -cd ../ -export CUDA_VISIBLE_DEVICES=${cudaid2} -time (sed -i 's/print_step: 100/print_step: 1/g' configs/enwik8.yaml -sed -i 's/save_step: 10000/save_step: 3/g' configs/enwik8.yaml -sed -i 's/batch_size: 16/batch_size: 8/g' configs/enwik8.yaml -sed -i 's/max_step: 400000/max_step: 3/g' configs/enwik8.yaml -python -m paddle.distributed.launch train.py --config ./configs/enwik8.yaml >${log_path}/transformer-xl_train_enwik8) >>${log_path}/transformer-xl_train_enwik8 2>&1 -print_info $? transformer-xl_train_enwik8 -time (sed -i 's/batch_size: 8/batch_size: 1/g' configs/enwik8.yaml -sed -i 's#init_from_params: "./trained_models/step_final/"#init_from_params: "./trained_models/step_3/"#g' configs/enwik8.yaml -python eval.py --config ./configs/enwik8.yaml >${log_path}/transformer-xl_eval_enwik8) >>${log_path}/transformer-xl_eval_enwik8 2>&1 -print_info $? transformer-xl_eval_enwik8 + cd ${nlp_dir}/examples/language_model/transformer-xl/ + mkdir gen_data && cd gen_data + wget https://paddle-qa.bj.bcebos.com/paddlenlp/enwik8.tar.gz && tar -zxvf enwik8.tar.gz + cd ../ + export CUDA_VISIBLE_DEVICES=${cudaid2} + time (sed -i 's/print_step: 100/print_step: 1/g' configs/enwik8.yaml + sed -i 's/save_step: 10000/save_step: 3/g' configs/enwik8.yaml + sed -i 's/batch_size: 16/batch_size: 8/g' configs/enwik8.yaml + sed -i 's/max_step: 400000/max_step: 3/g' configs/enwik8.yaml + python -m paddle.distributed.launch train.py --config ./configs/enwik8.yaml >${log_path}/transformer-xl_train_enwik8) >>${log_path}/transformer-xl_train_enwik8 2>&1 + print_info $? transformer-xl_train_enwik8 + time (sed -i 's/batch_size: 8/batch_size: 1/g' configs/enwik8.yaml + sed -i 's#init_from_params: "./trained_models/step_final/"#init_from_params: "./trained_models/step_3/"#g' configs/enwik8.yaml + python eval.py --config ./configs/enwik8.yaml >${log_path}/transformer-xl_eval_enwik8) >>${log_path}/transformer-xl_eval_enwik8 2>&1 + print_info $? transformer-xl_eval_enwik8 } #28 question_matching question_matching() { -cd ${nlp_dir}/examples/text_matching/question_matching/ -wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/data_v4.tar.gz -tar -xvzf data_v4.tar.gz -export CUDA_VISIBLE_DEVICES=${cudaid2} -#train -time ( -python -u -m paddle.distributed.launch train.py \ - --train_set ./data_v4/train/ALL/train \ - --dev_set ./data_v4/train/ALL/dev \ - --device gpu \ - --eval_step 10 \ - --max_steps 10 \ - --save_dir ./checkpoints \ - --train_batch_size 32 \ - --learning_rate 2E-5 \ - --epochs 1 \ - --rdrop_coef 0.0 >${log_path}/question_matching_train) >>${log_path}/question_matching_train 2>&1 -print_info $? question_matching_train -#predict -time ( -export CUDA_VISIBLE_DEVICES=${cudaid1} -python -u \ - predict.py \ - --device gpu \ - --params_path "./checkpoints/model_10/model_state.pdparams" \ - --batch_size 128 \ - --input_file ./data_v4/test/public_test_A \ - --result_file 0.0_predict_public_result_test_A_re >${log_path}/question_matching_predict) >>${log_path}/question_matching_predict 2>&1 -print_info $? question_matching_predict + cd ${nlp_dir}/examples/text_matching/question_matching/ + wget -q https://paddle-qa.bj.bcebos.com/paddlenlp/data_v4.tar.gz + tar -xvzf data_v4.tar.gz + export CUDA_VISIBLE_DEVICES=${cudaid2} + #train + time ( + python -u -m paddle.distributed.launch train.py \ + --train_set ./data_v4/train/ALL/train \ + --dev_set ./data_v4/train/ALL/dev \ + --device gpu \ + --eval_step 10 \ + --max_steps 10 \ + --save_dir ./checkpoints \ + --train_batch_size 32 \ + --learning_rate 2E-5 \ + --epochs 1 \ + --rdrop_coef 0.0 >${log_path}/question_matching_train) >>${log_path}/question_matching_train 2>&1 + print_info $? question_matching_train + #predict + time ( + export CUDA_VISIBLE_DEVICES=${cudaid1} + python -u \ + predict.py \ + --device gpu \ + --params_path "./checkpoints/model_10/model_state.pdparams" \ + --batch_size 128 \ + --input_file ./data_v4/test/public_test_A \ + --result_file 0.0_predict_public_result_test_A_re >${log_path}/question_matching_predict) >>${log_path}/question_matching_predict 2>&1 + print_info $? question_matching_predict } # 29 ernie-csc ernie-csc() { -export CUDA_VISIBLE_DEVICES=${cudaid2} -cd ${nlp_dir}/examples/text_correction/ernie-csc -#dowdnload data -python download.py --data_dir ./extra_train_ds/ --url https://github.com/wdimmy/Automatic-Corpus-Generation/raw/master/corpus/train.sgml -#trans xml txt -python change_sgml_to_txt.py -i extra_train_ds/train.sgml -o extra_train_ds/train.txt -#2卡训练 -python -m paddle.distributed.launch train.py --batch_size 32 --logging_steps 100 --epochs 1 --learning_rate 5e-5 --model_name_or_path ernie-1.0-base-zh --output_dir ./checkpoints/ --extra_train_ds_dir ./extra_train_ds/ >${log_path}/ernie-csc_train >>${log_path}/ernie-csc_train 2>&1 -print_info $? ernie-csc_train -#predict -sh run_sighan_predict.sh >${log_path}/ernie-csc_predict >>${log_path}/ernie-csc_predict 2>&1 -print_info $? ernie-csc_predict -#export model -python export_model.py --params_path ./checkpoints/best_model.pdparams --output_path ./infer_model/static_graph_params >${log_path}/ernie-csc_export >>${log_path}/ernie-csc_export 2>&1 -print_info $? ernie-csc_export -#python deploy -python predict.py --model_file infer_model/static_graph_params.pdmodel --params_file infer_model/static_graph_params.pdiparams >${log_path}/ernie-csc_deploy >>${log_path}/ernie-csc_deploy 2>&1 -print_info $? ernie-csc_deploy -} -#30 nptag -nptag() { -cd ${nlp_dir}/examples/text_to_knowledge/nptag/ -wget -q https://paddlenlp.bj.bcebos.com/paddlenlp/datasets/nptag_dataset.tar.gz && tar -zxvf nptag_dataset.tar.gz -export CUDA_VISIBLE_DEVICES=${cudaid2} -python -m paddle.distributed.launch train.py \ - --batch_size 64 \ - --learning_rate 1e-6 \ - --num_train_epochs 1 \ - --logging_steps 10 \ - --save_steps 100 \ - --output_dir ./output \ - --device "gpu" >${log_path}/nptag_train >>${log_path}/nptag_train 2>&1 -print_info $? nptag_train -export CUDA_VISIBLE_DEVICES=${cudaid2} -python -m paddle.distributed.launch predict.py \ - --device=gpu \ - --params_path ./output/model_100/model_state.pdparams >${log_path}/nptag_predict >>${log_path}/nptag_predict 2>&1 -print_info $? nptag_predict -python export_model.py --params_path=./output/model_100/model_state.pdparams --output_path=./export >${log_path}/nptag_export >>${log_path}/nptag_export 2>&1 -print_info $? nptag_export -python deploy/python/predict.py --model_dir=./export >${log_path}/nptag_depoly >>${log_path}/nptag_deploy 2>&1 -print_info $? nptag_depoly + export CUDA_VISIBLE_DEVICES=${cudaid2} + cd ${nlp_dir}/examples/text_correction/ernie-csc + #dowdnload data + python download.py --data_dir ./extra_train_ds/ --url https://github.com/wdimmy/Automatic-Corpus-Generation/raw/master/corpus/train.sgml + #trans xml txt + python change_sgml_to_txt.py -i extra_train_ds/train.sgml -o extra_train_ds/train.txt + #2卡训练 + python -m paddle.distributed.launch train.py --batch_size 32 --logging_steps 100 --epochs 1 --learning_rate 5e-5 --model_name_or_path ernie-1.0-base-zh --output_dir ./checkpoints/ --extra_train_ds_dir ./extra_train_ds/ >${log_path}/ernie-csc_train >>${log_path}/ernie-csc_train 2>&1 + print_info $? ernie-csc_train + #predict + sh run_sighan_predict.sh >${log_path}/ernie-csc_predict >>${log_path}/ernie-csc_predict 2>&1 + print_info $? ernie-csc_predict + #export model + python export_model.py --params_path ./checkpoints/best_model.pdparams --output_path ./infer_model/static_graph_params >${log_path}/ernie-csc_export >>${log_path}/ernie-csc_export 2>&1 + print_info $? ernie-csc_export + #python deploy + python predict.py --model_file infer_model/static_graph_params.pdmodel --params_file infer_model/static_graph_params.pdiparams >${log_path}/ernie-csc_deploy >>${log_path}/ernie-csc_deploy 2>&1 + print_info $? ernie-csc_deploy } #31 ernie-m ernie-m() { -export CUDA_VISIBLE_DEVICES=${cudaid2} -cd ${nlp_dir}/model_zoo/ernie-m -# TODO(ouyanghongyu): remove the following scripts later. -if [ ! -f 'test.py' ];then - echo '模型测试文件不存在!' - # finetuned for cross-lingual-transfer - python -m paddle.distributed.launch --log_dir output_clt run_classifier.py \ - --do_train \ - --do_eval \ - --do_export \ + export CUDA_VISIBLE_DEVICES=${cudaid2} + cd ${nlp_dir}/model_zoo/ernie-m + # TODO(ouyanghongyu): remove the following scripts later. + if [ ! -f 'test.py' ];then + echo '模型测试文件不存在!' + # finetuned for cross-lingual-transfer + python -m paddle.distributed.launch --log_dir output_clt run_classifier.py \ + --do_train \ + --do_eval \ + --do_export \ + --device gpu \ + --task_type cross-lingual-transfer \ + --model_name_or_path __internal_testing__/ernie-m \ + --use_test_data True \ + --test_data_path ../../tests/fixtures/tests_samples/xnli/xnli.jsonl \ + --output_dir output_clt \ + --export_model_dir output_clt \ + --per_device_train_batch_size 8 \ + --save_steps 1 \ + --eval_steps 1 \ + --max_steps 2 \ + --overwrite_output_dir \ + --remove_unused_columns False >${log_path}/ernie-m_clt >>${log_path}/ernie-m_clt 2>&1 + print_info $? ernie-m_clt + # finetuned for translate-train-all + python -m paddle.distributed.launch --log_dir output_tta run_classifier.py \ + --do_train \ + --do_eval \ + --do_export \ + --device gpu \ + --task_type translate-train-all \ + --model_name_or_path __internal_testing__/ernie-m \ + --use_test_data True \ + --test_data_path ../../tests/fixtures/tests_samples/xnli/xnli.jsonl \ + --output_dir output_tta \ + --export_model_dir output_tta \ + --per_device_train_batch_size 8 \ + --save_steps 1 \ + --eval_steps 1 \ + --max_steps 2 \ + --overwrite_output_dir \ + --remove_unused_columns False >${log_path}/ernie-m_tta >>${log_path}/ernie-m_tta 2>&1 + print_info $? ernie-m_tta + else + python -m pytest ${nlp_dir}/tests/model_zoo/test_ernie_m.py >${log_path}/ernie-m >>${log_path}/ernie-m 2>&1 + print_info $? ernie-m + fi +} +#32 clue +clue() { + cd ${nlp_dir}/legacy/examples/benchmark/clue/classification + python -u ./run_clue_classifier_trainer.py \ + --model_name_or_path ernie-3.0-base-zh \ + --dataset "clue afqmc" \ + --max_seq_length 128 \ + --per_device_train_batch_size 32 \ + --per_device_eval_batch_size 32 \ + --learning_rate 1e-5 \ + --num_train_epochs 3 \ + --logging_steps 1 \ + --seed 42 \ + --save_steps 3 \ + --warmup_ratio 0.1 \ + --weight_decay 0.01 \ + --adam_epsilon 1e-8 \ + --output_dir ./tmp \ --device gpu \ - --task_type cross-lingual-transfer \ - --model_name_or_path __internal_testing__/ernie-m \ - --use_test_data True \ - --test_data_path ../../tests/fixtures/tests_samples/xnli/xnli.jsonl \ - --output_dir output_clt \ - --export_model_dir output_clt \ - --per_device_train_batch_size 8 \ - --save_steps 1 \ - --eval_steps 1 \ - --max_steps 2 \ - --overwrite_output_dir \ - --remove_unused_columns False >${log_path}/ernie-m_clt >>${log_path}/ernie-m_clt 2>&1 - print_info $? ernie-m_clt - # finetuned for translate-train-all - python -m paddle.distributed.launch --log_dir output_tta run_classifier.py \ --do_train \ --do_eval \ - --do_export \ - --device gpu \ - --task_type translate-train-all \ - --model_name_or_path __internal_testing__/ernie-m \ - --use_test_data True \ - --test_data_path ../../tests/fixtures/tests_samples/xnli/xnli.jsonl \ - --output_dir output_tta \ - --export_model_dir output_tta \ - --per_device_train_batch_size 8 \ + --metric_for_best_model "eval_accuracy" \ + --load_best_model_at_end \ + --save_total_limit 1 \ + --max_steps 1 >${log_path}/clue-trainer_api >>${log_path}/clue-trainer_api 2>&1 + print_info $? clue-tranier_api + python -u run_clue_classifier.py \ + --model_name_or_path ernie-3.0-base-zh \ + --task_name afqmc \ + --max_seq_length 128 \ + --batch_size 16 \ + --learning_rate 3e-5 \ + --num_train_epochs 3 \ + --logging_steps 100 \ + --seed 42 \ --save_steps 1 \ - --eval_steps 1 \ - --max_steps 2 \ - --overwrite_output_dir \ - --remove_unused_columns False >${log_path}/ernie-m_tta >>${log_path}/ernie-m_tta 2>&1 - print_info $? ernie-m_tta -else - python -m pytest ${nlp_dir}/tests/model_zoo/test_ernie_m.py >${log_path}/ernie-m >>${log_path}/ernie-m 2>&1 - print_info $? ernie-m -fi -} -#32 clue -clue (){ -cd ${nlp_dir}/examples/benchmark/clue/classification -python -u ./run_clue_classifier_trainer.py \ - --model_name_or_path ernie-3.0-base-zh \ - --dataset "clue afqmc" \ - --max_seq_length 128 \ - --per_device_train_batch_size 32 \ - --per_device_eval_batch_size 32 \ - --learning_rate 1e-5 \ - --num_train_epochs 3 \ - --logging_steps 1 \ - --seed 42 \ - --save_steps 3 \ - --warmup_ratio 0.1 \ - --weight_decay 0.01 \ - --adam_epsilon 1e-8 \ - --output_dir ./tmp \ - --device gpu \ - --do_train \ - --do_eval \ - --metric_for_best_model "eval_accuracy" \ - --load_best_model_at_end \ - --save_total_limit 1 \ - --max_steps 1 >${log_path}/clue-trainer_api >>${log_path}/clue-trainer_api 2>&1 -print_info $? clue-tranier_api -python -u run_clue_classifier.py \ - --model_name_or_path ernie-3.0-base-zh \ - --task_name afqmc \ - --max_seq_length 128 \ - --batch_size 16 \ - --learning_rate 3e-5 \ - --num_train_epochs 3 \ - --logging_steps 100 \ - --seed 42 \ - --save_steps 1 \ - --warmup_proportion 0.1 \ - --weight_decay 0.01 \ - --adam_epsilon 1e-8 \ - --output_dir ./output/afqmc \ - --device gpu \ - --max_steps 1 \ - --do_train >${log_path}/clue-class >>${log_path}/clue-class 2>&1 -print_info $? clue-class -cd ${nlp_dir}/examples/benchmark/clue/mrc -export CUDA_VISIBLE_DEVICES=${cudaid1} -python -m paddle.distributed.launch run_cmrc2018.py \ - --model_name_or_path ernie-3.0-base-zh \ - --batch_size 16 \ - --learning_rate 3e-5 \ - --max_seq_length 512 \ - --num_train_epochs 2 \ - --do_train \ - --do_predict \ - --warmup_proportion 0.1 \ - --weight_decay 0.01 \ - --gradient_accumulation_steps 2 \ - --max_steps 1 \ - --output_dir ./tmp >${log_path}/clue-mrc >>${log_path}/clue-mrc 2>&1 -print_info $? clue-mrc -} -#32 textcnn -textcnn(){ -cd ${nlp_dir}/examples/sentiment_analysis/textcnn -wget https://bj.bcebos.com/paddlenlp/datasets/RobotChat.tar.gz -tar xvf RobotChat.tar.gz -wget https://bj.bcebos.com/paddlenlp/robot_chat_word_dict.txt -wget https://bj.bcebos.com/paddlenlp/models/textcnn.pdparams -python -m paddle.distributed.launch train.py \ - --vocab_path=./robot_chat_word_dict.txt \ - --init_from_ckpt=./textcnn.pdparams \ - --device=gpu \ - --lr=5e-5 \ - --batch_size=64 \ - --epochs=1 \ - --save_dir=./checkpoints \ - --data_path=./RobotChat >${log_path}/textcnn_train >>${log_path}/textcnn_train 2>&1 -print_info $? textcnn_train -python export_model.py --vocab_path=./robot_chat_word_dict.txt --params_path=./checkpoints/final.pdparams \ - --output_path=./static_graph_params >${log_path}/textcnn_export >>${log_path}/textcnn_export 2>&1 -print_info $? export_export -python deploy/python/predict.py --model_file=static_graph_params.pdmodel \ - --params_file=static_graph_params.pdiparams >${log_path}/textcnn_depoly >>${log_path}/textcnn_depoly 2>&1 -print_info $? textcnn_deploy -python predict.py --vocab_path=./robot_chat_word_dict.txt \ - --device=gpu \ - --params_path=./checkpoints/final.pdparams >${log_path}/textcnn_predict >>${log_path}/textcnn_predict 2>&1 -print_info $? textcnn_predict + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --adam_epsilon 1e-8 \ + --output_dir ./output/afqmc \ + --device gpu \ + --max_steps 1 \ + --do_train >${log_path}/clue-class >>${log_path}/clue-class 2>&1 + print_info $? clue-class + cd ${nlp_dir}/examples/benchmark/clue/mrc + export CUDA_VISIBLE_DEVICES=${cudaid1} + python -m paddle.distributed.launch run_cmrc2018.py \ + --model_name_or_path ernie-3.0-base-zh \ + --batch_size 16 \ + --learning_rate 3e-5 \ + --max_seq_length 512 \ + --num_train_epochs 2 \ + --do_train \ + --do_predict \ + --warmup_proportion 0.1 \ + --weight_decay 0.01 \ + --gradient_accumulation_steps 2 \ + --max_steps 1 \ + --output_dir ./tmp >${log_path}/clue-mrc >>${log_path}/clue-mrc 2>&1 + print_info $? clue-mrc } #33 taskflow taskflow (){ -cd ${nlp_dir} -python -m pytest tests/taskflow/test_*.py >${nlp_dir}/unittest_logs/taskflow_unittest >>${nlp_dir}/unittest_logs/taskflow_unittest 2>&1 -print_info $? taskflow_unittest -python -m pytest scripts/regression/test_taskflow.py >${log_path}/taskflow >>${log_path}/taskflow 2>&1 -print_info $? taskflow + cd ${nlp_dir} + python -m pytest tests/taskflow/test_*.py >${nlp_dir}/unittest_logs/taskflow_unittest >>${nlp_dir}/unittest_logs/taskflow_unittest 2>&1 + print_info $? taskflow_unittest + python -m pytest scripts/regression/test_taskflow.py >${log_path}/taskflow >>${log_path}/taskflow 2>&1 + print_info $? taskflow } llm(){ -cd ${nlp_dir}/csrc -echo "build paddlenlp_op" -python setup_cuda.py install + cd ${nlp_dir}/csrc + echo "build paddlenlp_op" + python setup_cuda.py install -echo ' Testing all LLMs ' -cd ${nlp_dir} -python -m pytest tests/llm/test_*.py --alluredir=result >${log_path}/llm >>${log_path}/llm 2>&1 -print_info $? llm + echo ' Testing all LLMs ' + cd ${nlp_dir} + python -m pytest tests/llm/test_*.py --alluredir=result >${log_path}/llm >>${log_path}/llm 2>&1 + print_info $? llm } fast_generation(){ -cd ${nlp_dir}/fast_generation/samples -# python codegen_sample.py >${log_path}/fast_generation_codegen >>${log_path}/fast_generation_codegen 2>&1 -# print_info $? fast_generation_codegen + cd ${nlp_dir}/fast_generation/samples + # python codegen_sample.py >${log_path}/fast_generation_codegen >>${log_path}/fast_generation_codegen 2>&1 + # print_info $? fast_generation_codegen -python gpt_sample.py >${log_path}/fast_generation_gpt >>${log_path}/fast_generation_gpt 2>&1 -print_info $? fast_generation_gpt + python gpt_sample.py >${log_path}/fast_generation_gpt >>${log_path}/fast_generation_gpt 2>&1 + print_info $? fast_generation_gpt -python mbart_sample.py >${log_path}/fast_generation_mbart >>${log_path}/fast_generation_mbart 2>&1 -print_info $? fast_generation_mbart + python mbart_sample.py >${log_path}/fast_generation_mbart >>${log_path}/fast_generation_mbart 2>&1 + print_info $? fast_generation_mbart -python plato_sample.py >${log_path}/fast_generation_plato >>${log_path}/fast_generation_plato 2>&1 -print_info $? fast_generation_plato + python plato_sample.py >${log_path}/fast_generation_plato >>${log_path}/fast_generation_plato 2>&1 + print_info $? fast_generation_plato -python t5_sample.py --use_faster >${log_path}/fast_generation_t5 >>${log_path}/fast_generation_t5 2>&1 -print_info $? fast_generation_t5 + python t5_sample.py --use_faster >${log_path}/fast_generation_t5 >>${log_path}/fast_generation_t5 2>&1 + print_info $? fast_generation_t5 -cd ${nlp_dir}/paddlenlp/ops/fast_transformer/sample/ -python bart_decoding_sample.py >${log_path}/fast_generation_bart >>${log_path}/fast_generation_bart 2>&1 -print_info $? fast_generation_bart + cd ${nlp_dir}/paddlenlp/ops/fast_transformer/sample/ + python bart_decoding_sample.py >${log_path}/fast_generation_bart >>${log_path}/fast_generation_bart 2>&1 + print_info $? fast_generation_bart -python t5_export_model_sample.py >${log_path}/t5_export_model_sample >>${log_path}/t5_export_model_sample 2>&1 -print_info $? t5_export_model_sample + python t5_export_model_sample.py >${log_path}/t5_export_model_sample >>${log_path}/t5_export_model_sample 2>&1 + print_info $? t5_export_model_sample -python t5_export_model_sample.py >${log_path}/t5_export_model_sample >>${log_path}/t5_export_model_sample 2>&1 -print_info $? t5_export_model_sample + python t5_export_model_sample.py >${log_path}/t5_export_model_sample >>${log_path}/t5_export_model_sample 2>&1 + print_info $? t5_export_model_sample -# fast_gpt -# fast_transformer + # fast_gpt + # fast_transformer } ernie-3.0(){ -cd ${nlp_dir}/model_zoo/ernie-3.0/ -#训练 -python run_seq_cls.py --model_name_or_path ernie-3.0-medium-zh --dataset afqmc --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml --max_steps=2 --save_step=2 >${log_path}/ernie-3.0_train_seq_cls >>${log_path}/ernie-3.0_train_seq_cls 2>&1 -print_info $? ernie-3.0_train_seq_cls -python run_token_cls.py --model_name_or_path ernie-3.0-medium-zh --dataset msra_ner --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml --max_steps=2 --save_step=2 >${log_path}/ernie-3.0_train_token_cls >>${log_path}/ernie-3.0_train_token_cls 2>&1 -print_info $? ernie-3.0_train_token_cls -python run_qa.py --model_name_or_path ernie-3.0-medium-zh --dataset cmrc2018 --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml --max_steps=2 --save_step=2 >${log_path}/ernie-3.0_train_qa >>${log_path}/ernie-3.0_train_qa 2>&1 -print_info $? ernie-3.0_train_qa -# 预测 -python run_seq_cls.py --model_name_or_path best_models/afqmc/ --dataset afqmc --output_dir ./best_models --do_predict --config=configs/default.yml >${log_path}/ernie-3.0_predict_seq_cls >>${log_path}/ernie-3.0_predict_seq_cls 2>&1 -print_info $? ernie-3.0_predict_seq_cls -python run_token_cls.py --model_name_or_path best_models/msra_ner/ --dataset msra_ner --output_dir ./best_models --do_predict --config=configs/default.yml >${log_path}/ernie-3.0_predict_token_cls >>${log_path}/ernie-3.0_predict_token_cls 2>&1 -print_info $? ernie-3.0_predict_token_cls -python run_qa.py --model_name_or_path best_models/cmrc2018/ --dataset cmrc2018 --output_dir ./best_models --do_predict --config=configs/default.yml >${log_path}/ernie-3.0_predict_qa >>${log_path}/ernie-3.0_predict_qa 2>&1 -print_info $? ernie-3.0_predict_qa -#压缩 -python compress_seq_cls.py --model_name_or_path best_models/afqmc/ --dataset afqmc --output_dir ./best_models/afqmc --config=configs/default.yml --max_steps 10 --eval_steps 5 --save_steps 5 --save_steps 5 --algo_list mse --batch_size_list 4 >${log_path}/ernie-3.0_compress_seq_cls >>${log_path}/ernie-3.0_compress_seq_cls 2>&1 -print_info $? ernie-3.0_compress_seq_cls -python compress_token_cls.py --model_name_or_path best_models/msra_ner/ --dataset msra_ner --output_dir ./best_models/msra_ner --config=configs/default.yml --max_steps 10 --eval_steps 5 --save_steps 5 --algo_list mse --batch_size_list 4 >${log_path}/ernie-3.0_compress_token_cls >>${log_path}/ernie-3.0_compress_token_cls 2>&1 -print_info $? ernie-3.0_compress_token_cls -python compress_qa.py --model_name_or_path best_models/cmrc2018/ --dataset cmrc2018 --output_dir ./best_models/cmrc2018 --config=configs/default.yml --max_steps 10 --eval_steps 5 --save_steps 5 --algo_list mse --batch_size_list 4 >${log_path}/ernie-3.0_compress_qa >>${log_path}/ernie-3.0_compress_qa 2>&1 -print_info $? ernie-3.0_compress_qa + cd ${nlp_dir}/model_zoo/ernie-3.0/ + #训练 + python run_seq_cls.py --model_name_or_path ernie-3.0-medium-zh --dataset afqmc --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml --max_steps=2 --save_step=2 >${log_path}/ernie-3.0_train_seq_cls >>${log_path}/ernie-3.0_train_seq_cls 2>&1 + print_info $? ernie-3.0_train_seq_cls + python run_token_cls.py --model_name_or_path ernie-3.0-medium-zh --dataset msra_ner --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml --max_steps=2 --save_step=2 >${log_path}/ernie-3.0_train_token_cls >>${log_path}/ernie-3.0_train_token_cls 2>&1 + print_info $? ernie-3.0_train_token_cls + python run_qa.py --model_name_or_path ernie-3.0-medium-zh --dataset cmrc2018 --output_dir ./best_models --export_model_dir best_models/ --do_train --do_eval --do_export --config=configs/default.yml --max_steps=2 --save_step=2 >${log_path}/ernie-3.0_train_qa >>${log_path}/ernie-3.0_train_qa 2>&1 + print_info $? ernie-3.0_train_qa + # 预测 + python run_seq_cls.py --model_name_or_path best_models/afqmc/ --dataset afqmc --output_dir ./best_models --do_predict --config=configs/default.yml >${log_path}/ernie-3.0_predict_seq_cls >>${log_path}/ernie-3.0_predict_seq_cls 2>&1 + print_info $? ernie-3.0_predict_seq_cls + python run_token_cls.py --model_name_or_path best_models/msra_ner/ --dataset msra_ner --output_dir ./best_models --do_predict --config=configs/default.yml >${log_path}/ernie-3.0_predict_token_cls >>${log_path}/ernie-3.0_predict_token_cls 2>&1 + print_info $? ernie-3.0_predict_token_cls + python run_qa.py --model_name_or_path best_models/cmrc2018/ --dataset cmrc2018 --output_dir ./best_models --do_predict --config=configs/default.yml >${log_path}/ernie-3.0_predict_qa >>${log_path}/ernie-3.0_predict_qa 2>&1 + print_info $? ernie-3.0_predict_qa + #压缩 + python compress_seq_cls.py --model_name_or_path best_models/afqmc/ --dataset afqmc --output_dir ./best_models/afqmc --config=configs/default.yml --max_steps 10 --eval_steps 5 --save_steps 5 --save_steps 5 --algo_list mse --batch_size_list 4 >${log_path}/ernie-3.0_compress_seq_cls >>${log_path}/ernie-3.0_compress_seq_cls 2>&1 + print_info $? ernie-3.0_compress_seq_cls + python compress_token_cls.py --model_name_or_path best_models/msra_ner/ --dataset msra_ner --output_dir ./best_models/msra_ner --config=configs/default.yml --max_steps 10 --eval_steps 5 --save_steps 5 --algo_list mse --batch_size_list 4 >${log_path}/ernie-3.0_compress_token_cls >>${log_path}/ernie-3.0_compress_token_cls 2>&1 + print_info $? ernie-3.0_compress_token_cls + python compress_qa.py --model_name_or_path best_models/cmrc2018/ --dataset cmrc2018 --output_dir ./best_models/cmrc2018 --config=configs/default.yml --max_steps 10 --eval_steps 5 --save_steps 5 --algo_list mse --batch_size_list 4 >${log_path}/ernie-3.0_compress_qa >>${log_path}/ernie-3.0_compress_qa 2>&1 + print_info $? ernie-3.0_compress_qa } ernie-health(){ -cd ${nlp_dir}/tests/model_zoo/ -if [ ! -f 'test_ernie-health.py' ];then - echo '模型测试文件不存在!' -else - python -m pytest tests/model_zoo/test_ernie-health.py >${log_path}/ernie-health_unittest>>${log_path}/ernie-health_unittest 2>&1 - print_info $? tests ernie-health_unittest -fi + cd ${nlp_dir}/tests/model_zoo/ + if [ ! -f 'test_ernie-health.py' ];then + echo '模型测试文件不存在!' + else + python -m pytest tests/model_zoo/test_ernie-health.py >${log_path}/ernie-health_unittest>>${log_path}/ernie-health_unittest 2>&1 + print_info $? tests ernie-health_unittest + fi } uie(){ -cd ${nlp_dir}/model_zoo/uie/ -mkdir data && cd data && wget https://bj.bcebos.com/paddlenlp/datasets/uie/doccano_ext.json && cd ../ -python doccano.py --doccano_file ./data/doccano_ext.json --task_type ext --save_dir ./data --splits 0.8 0.2 0 --schema_lang ch >${log_path}/uie_doccano>>${log_path}/uie_doccano 2>&1 -print_info $? uie_doccano -python -u -m paddle.distributed.launch finetune.py --device gpu --logging_steps 2 --save_steps 2 --eval_steps 2 --seed 42 \ - --model_name_or_path uie-base --output_dir ./checkpoint/model_best --train_path data/train.txt --dev_path data/dev.txt \ - --max_seq_length 512 --per_device_eval_batch_size 16 --per_device_train_batch_size 16 --num_train_epochs 100 --learning_rate 1e-5 \ - --do_train --do_eval --do_export --export_model_dir ./checkpoint/model_best --label_names start_positions end_positions \ - --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True \ - --save_total_limit 1 --max_steps 2 >${log_path}/uie_train>>${log_path}/uie_train2>&1 -print_info $? uie_train -python evaluate.py --model_path ./checkpoint/model_best --test_path ./data/dev.txt --batch_size 16 --max_seq_len 512 >${log_path}/uie_eval>>${log_path}/uie_eval 2>&1 -print_info $? uie_eval + cd ${nlp_dir}/model_zoo/uie/ + mkdir data && cd data && wget https://bj.bcebos.com/paddlenlp/datasets/uie/doccano_ext.json && cd ../ + python doccano.py --doccano_file ./data/doccano_ext.json --task_type ext --save_dir ./data --splits 0.8 0.2 0 --schema_lang ch >${log_path}/uie_doccano>>${log_path}/uie_doccano 2>&1 + print_info $? uie_doccano + python -u -m paddle.distributed.launch finetune.py --device gpu --logging_steps 2 --save_steps 2 --eval_steps 2 --seed 42 \ + --model_name_or_path uie-base --output_dir ./checkpoint/model_best --train_path data/train.txt --dev_path data/dev.txt \ + --max_seq_length 512 --per_device_eval_batch_size 16 --per_device_train_batch_size 16 --num_train_epochs 100 --learning_rate 1e-5 \ + --do_train --do_eval --do_export --export_model_dir ./checkpoint/model_best --label_names start_positions end_positions \ + --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True \ + --save_total_limit 1 --max_steps 2 >${log_path}/uie_train>>${log_path}/uie_train2>&1 + print_info $? uie_train + python evaluate.py --model_path ./checkpoint/model_best --test_path ./data/dev.txt --batch_size 16 --max_seq_len 512 >${log_path}/uie_eval>>${log_path}/uie_eval 2>&1 + print_info $? uie_eval } ernie-layout(){ -cd ${nlp_dir}/model_zoo/ernie-layout/ -# train ner -python -u run_ner.py --model_name_or_path ernie-layoutx-base-uncased --output_dir ./ernie-layoutx-base-uncased/models/funsd/ \ - --dataset_name funsd --do_train --do_eval --max_steps 2 --eval_steps 2 --save_steps 2 --save_total_limit 1 --seed 1000 --overwrite_output_dir \ - --load_best_model_at_end --pattern ner-bio --preprocessing_num_workers 4 --overwrite_cache false --doc_stride 128 --target_size 1000 \ - --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --learning_rate 2e-5 --lr_scheduler_type constant --gradient_accumulation_steps 1 \ - --metric_for_best_model eval_f1 --greater_is_better true >${log_path}/ernie-layout_train>>${log_path}/ernie-layout_train 2>&1 -print_info $? ernie-layout_train -# export ner -python export_model.py --task_type ner --model_path ./ernie-layoutx-base-uncased/models/funsd/ --output_path ./ner_export >${log_path}/ernie-layout_export>>${log_path}/ernie-layout_export2>&1 -print_info $? ernie-layout_export -# deploy ner -cd ${nlp_dir}/model_zoo/ernie-layout/deploy/python -wget https://bj.bcebos.com/paddlenlp/datasets/document_intelligence/images.zip && unzip images.zip -python infer.py --model_path_prefix ../../ner_export/inference --task_type ner --lang "en" --batch_size 8 >${log_path}/ernie-layout_deploy>>${log_path}/ernie-layout_deploy 2>&1 -print_info $? ernie-layout_deploy + cd ${nlp_dir}/model_zoo/ernie-layout/ + # train ner + python -u run_ner.py --model_name_or_path ernie-layoutx-base-uncased --output_dir ./ernie-layoutx-base-uncased/models/funsd/ \ + --dataset_name funsd --do_train --do_eval --max_steps 2 --eval_steps 2 --save_steps 2 --save_total_limit 1 --seed 1000 --overwrite_output_dir \ + --load_best_model_at_end --pattern ner-bio --preprocessing_num_workers 4 --overwrite_cache false --doc_stride 128 --target_size 1000 \ + --per_device_train_batch_size 4 --per_device_eval_batch_size 4 --learning_rate 2e-5 --lr_scheduler_type constant --gradient_accumulation_steps 1 \ + --metric_for_best_model eval_f1 --greater_is_better true >${log_path}/ernie-layout_train>>${log_path}/ernie-layout_train 2>&1 + print_info $? ernie-layout_train + # export ner + python export_model.py --task_type ner --model_path ./ernie-layoutx-base-uncased/models/funsd/ --output_path ./ner_export >${log_path}/ernie-layout_export>>${log_path}/ernie-layout_export2>&1 + print_info $? ernie-layout_export + # deploy ner + cd ${nlp_dir}/model_zoo/ernie-layout/deploy/python + wget https://bj.bcebos.com/paddlenlp/datasets/document_intelligence/images.zip && unzip images.zip + python infer.py --model_path_prefix ../../ner_export/inference --task_type ner --lang "en" --batch_size 8 >${log_path}/ernie-layout_deploy>>${log_path}/ernie-layout_deploy 2>&1 + print_info $? ernie-layout_deploy } ernie-1.0(){ ernie @@ -1082,17 +898,13 @@ ernie_m(){ } ernie_layout(){ -ernie-layout + ernie-layout } ernie_csc(){ ernie-csc } -ernie_ctm(){ - ernie-ctm -} - ernie_doc(){ ernie-doc } @@ -1102,25 +914,25 @@ ernie_health(){ } segment_parallel_utils(){ -cd ${nlp_dir} -echo "test segment_parallel_utils, cudaid1:${cudaid1}, cudaid2:${cudaid2}" -if [[ ${cudaid1} != ${cudaid2} ]]; then - time (python -m paddle.distributed.launch tests/transformers/test_segment_parallel_utils.py >${log_path}/segment_parallel_utils) >>${log_path}/segment_parallel_utils 2>&1 - print_info $? segment_parallel_utils -else - echo "only one gpu:${cudaid1} is set, skip test" -fi + cd ${nlp_dir} + echo "test segment_parallel_utils, cudaid1:${cudaid1}, cudaid2:${cudaid2}" + if [[ ${cudaid1} != ${cudaid2} ]]; then + time (python -m paddle.distributed.launch tests/transformers/test_segment_parallel_utils.py >${log_path}/segment_parallel_utils) >>${log_path}/segment_parallel_utils 2>&1 + print_info $? segment_parallel_utils + else + echo "only one gpu:${cudaid1} is set, skip test" + fi } ring_flash_attention(){ -cd ${nlp_dir} -echo "test ring_flash_attention, cudaid1:${cudaid1}, cudaid2:${cudaid2}" -if [[ ${cudaid1} != ${cudaid2} ]]; then - time (python -m paddle.distributed.launch tests/transformers/test_ring_flash_attention.py >${log_path}/ring_flash_attention) >>${log_path}/ring_flash_attention 2>&1 - print_info $? ring_flash_attention -else - echo "only one gpu:${cudaid1} is set, skip test" -fi + cd ${nlp_dir} + echo "test ring_flash_attention, cudaid1:${cudaid1}, cudaid2:${cudaid2}" + if [[ ${cudaid1} != ${cudaid2} ]]; then + time (python -m paddle.distributed.launch tests/transformers/test_ring_flash_attention.py >${log_path}/ring_flash_attention) >>${log_path}/ring_flash_attention 2>&1 + print_info $? ring_flash_attention + else + echo "only one gpu:${cudaid1} is set, skip test" + fi } $1 diff --git a/scripts/regression/get_model_list.py b/scripts/regression/get_model_list.py index 97bfff2c3b5d..3e90ea2bb404 100644 --- a/scripts/regression/get_model_list.py +++ b/scripts/regression/get_model_list.py @@ -20,97 +20,68 @@ def get_model_list(): """ get model list from - + """ CI_MODEL_LIST = [ - "waybill_ie", - "msra_ner", - "glue", + "DuEE", + "DuReader-robust", + "DuReader-yesno", + "SQuAD", + "albert", "bert", - "skep", "bigbird", + "clue", + "couplet", + "doc", "electra", - "gpt", + "elmo", "ernie", - "xlnet", - "ofa", - "albert", - "squad", - "tinybert", - "lexical_analysis", - "seq2seq", - "pretrained_models", - "word_embedding", - "ernie-ctm", - "distilbert", - "stacl", - "transformer", - "simbert", + "ernie-1.0", + "ernie-csc", "ernie-doc", - "transformer-xl", + "ernie-gen", + "ernie-health", "ernie-m", - "plato-xl", - "pointer_summarizer", - "question_matching", + "ernie_matching", "few_shot", - "unimo-text", - "ernie-csc", - "nptag", - "ofa", - "transformer", - "DuIE", - "tcn", - "word_embedding", - "unified_transformer", - "lic2021_baseline", - "vae-seq2seq", + "glue", + "gpt", + "gpt-3", + "lexical_analysis", + "minilmv2", + "mpnet", "msra_ner", - "simbert", - "clue", - "pet", - "bert", - "ernie-ctm", - "DuReader-yesno", - "nptag", - "semantic_indexing", - "seq2seq", + "msra_ner", + "ofa", "pointer_summarizer", - "bigbird", - "unimo-text", - "minilmv2", - "wordtag", - "simcse", - "ernie-gen", - "distill_lstm", - "DuReader-robust", - "ernie_matching", - "rnn", - "ernie-1.0", - "stacl", - "erniesage", - "DuEE", - "efl", - "doc", - "couplet", - "rnnlm", "pp-minilm", - "dgu", - "mpnet", - "textcnn", - "p-tuning", - "SQuAD", - "elmo", - "plato-2", "pretrained_models", + "question_matching", + "rnn", + "rnnlm", + "semantic_indexing", "sentiment_analysis", - "ernie-health", - "gpt-3", + "simbert", + "simbert", + "simcse", + "skep", + "squad", + "stacl", + "stacl", + "tcn", + "tinybert", + "transformer", + "transformer-xl", + "unimo-text", + "vae-seq2seq", + "word_embedding", + "xlnet", ] examples_second_list = ["model_interpretation", "semantic_indexing", "lexical_analysis", "word_embedding"] model_list = os.listdir("model_zoo") - examples_list = os.listdir("examples/") + examples_list = os.listdir("legacy/examples/") app_list = os.listdir("applications/") # remove model_list README diff --git a/scripts/regression/run_ci.sh b/scripts/regression/run_ci.sh index af2e164947e0..d4490304b7ee 100644 --- a/scripts/regression/run_ci.sh +++ b/scripts/regression/run_ci.sh @@ -28,12 +28,12 @@ export APIcase_list=() declare -A Normal_dic declare -A all_P0case_dic declare -A Build_list -all_P0case_dic=(["waybill_ie"]=3 ["msra_ner"]=15 ["glue"]=2 ["bert"]=2 ["skep"]=10 ["bigbird"]=2 ["electra"]=2 ["gpt"]=2 ["ernie-1.0"]=2 ["xlnet"]=2 \ -["ofa"]=2 ["albert"]=2 ["SQuAD"]=20 ["lexical_analysis"]=5 ["seq2seq"]=5 ["word_embedding"]=5 \ -["ernie-ctm"]=5 ["distilbert"]=5 ["transformer"]=5 ["pet"]=5 ["efl"]=5 ["p-tuning"]=5 ["ernie-doc"]=20 ["transformer-xl"]=5 \ -["question_matching"]=5 ["ernie-csc"]=5 ["nptag"]=5 ["ernie-m"]=5 ["taskflow"]=5 ["clue"]=5 ["textcnn"]=5 \ -["fast_generation"]=10 ["ernie-3.0"]=5 ["ernie-layout"]=5 ["uie"]=5 ["ernie-health"]=5 ["llm"]=5 \ -["ernie"]=2 ["ernie_m"]=5 ["ernie_layout"]=5 ["ernie_csc"]=5 ["ernie_ctm"]=5 ["ernie_doc"]=20 ["ernie_health"]=5 ["segment_parallel_utils"]=5 ["ring_flash_attention"]=5) +all_P0case_dic=("msra_ner"]=15 ["glue"]=2 ["bert"]=2 ["skep"]=10 ["bigbird"]=2 ["electra"]=2 ["gpt"]=2 ["ernie-1.0"]=2 ["xlnet"]=2 + ["ofa"]=2 ["albert"]=2 ["SQuAD"]=20 ["lexical_analysis"]=5 ["word_embedding"]=5 + ["transformer"]=5 ["ernie-doc"]=20 ["transformer-xl"]=5 + ["question_matching"]=5 ["ernie-csc"]=5 ["ernie-m"]=5 ["taskflow"]=5 ["clue"]=5 ["textcnn"]=5 + ["fast_generation"]=10 ["ernie-3.0"]=5 ["ernie-layout"]=5 ["uie"]=5 ["ernie-health"]=5 ["llm"]=5 + ["ernie"]=2 ["ernie_m"]=5 ["ernie_layout"]=5 ["ernie_csc"]=5 ["ernie_ctm"]=5 ["ernie_doc"]=20 ["ernie_health"]=5 ["segment_parallel_utils"]=5 ["ring_flash_attention"]=5) #################################### python -m pip config --user set global.index http://pip.baidu-int.com/search/ diff --git a/scripts/regression/run_release.sh b/scripts/regression/run_release.sh index 6aa02bbd8dff..354960ea7fe0 100644 --- a/scripts/regression/run_release.sh +++ b/scripts/regression/run_release.sh @@ -54,7 +54,7 @@ export all_P0case_time=0 declare -A all_P0case_dic get_diff_TO_P0case(){ if [[ ${Testcase} =~ "all" ]];then - P0case_list=(waybill_ie msra_ner glue bert skep bigbird electra gpt ernie-1.0 xlnet ofa squad tinybert lexical_analysis seq2seq \ + P0case_list=(msra_ner glue bert skep bigbird electra gpt ernie-1.0 xlnet ofa squad tinybert lexical_analysis seq2seq \ word_embedding ernie-ctm distilbert stacl transformer simbert ernie-doc transformer-xl pointer_summarizer question_matching ernie-csc \ nptag ernie-m clue taskflow transformers fast_generation ernie-3.0 fast_transformer fast_gpt llama) elif [[ ${Testcase} =~ "p0" ]];then