[autonlp] text classification fix& add taskflow config file #4896

lugimzzz · 2023-02-20T12:35:30Z

PR types

Bug fixes

PR changes

APIs

Description

修复&支持taskflow用config文件加载

修复model candidate bug
支持taskflow用config文件加载

paddle-bot · 2023-02-20T12:35:35Z

Thanks for your contribution!

lugimzzz · 2023-02-20T12:37:43Z

paddlenlp/experimental/autonlp/text_classification.py

@@ -132,7 +133,7 @@ def _default_prompt_tuning_arguments(self) -> PromptTuningArguments:
    def _model_candidates(self) -> List[Dict[str, Any]]:
        train_batch_size = hp.choice("batch_size", [2, 4, 8, 16, 32])
        chinese_models = hp.choice(
-            "models",
+            "chinese_models",


解决报错 DuplicateLabel：models

这里加一个还是不要4个不同的名字，因为英文和中文模型不会同时出现，就统一叫finetune_models和prompt_models吧。
同时这里增加一个单测（仿造model_candidates, 既有prompt model的hp.choice又有finetune model的hp.choice），避免以后类似的情况单测不能catch到

新增单测test_default_model_candidate

codecov · 2023-02-20T13:00:57Z

Codecov Report

Merging #4896 (3812224) into develop (d2ffb89) will increase coverage by 1.71%.
The diff coverage is 92.07%.

@@             Coverage Diff             @@
##           develop    #4896      +/-   ##
===========================================
+ Coverage    44.65%   46.36%   +1.71%     
===========================================
  Files          446      448       +2     
  Lines        64375    64619     +244     
===========================================
+ Hits         28744    29960    +1216     
+ Misses       35631    34659     -972

Impacted Files	Coverage Δ
...addlenlp/experimental/autonlp/auto_trainer_base.py	`89.91% <ø> (-0.09%)`	⬇️
paddlenlp/transformers/ernie/modeling.py	`85.30% <88.88%> (+0.08%)`	⬆️
paddlenlp/transformers/auto/configuration.py	`89.10% <89.10%> (ø)`
paddlenlp/transformers/roberta/modeling.py	`92.08% <96.77%> (+0.25%)`	⬆️
...dlenlp/experimental/autonlp/text_classification.py	`97.20% <100.00%> (+0.52%)`	⬆️
paddlenlp/transformers/__init__.py	`100.00% <100.00%> (ø)`
paddlenlp/taskflow/text_classification.py	`72.15% <0.00%> (-2.85%)`	⬇️
paddlenlp/taskflow/task.py	`50.18% <0.00%> (-1.81%)`	⬇️
paddlenlp/utils/downloader.py	`65.48% <0.00%> (-0.45%)`	⬇️
paddlenlp/transformers/generation_utils.py	`81.57% <0.00%> (-0.31%)`	⬇️
... and 13 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

sijunhe · 2023-02-21T03:00:55Z

paddlenlp/experimental/autonlp/text_classification.py

@@ -132,7 +133,7 @@ def _default_prompt_tuning_arguments(self) -> PromptTuningArguments:
    def _model_candidates(self) -> List[Dict[str, Any]]:
        train_batch_size = hp.choice("batch_size", [2, 4, 8, 16, 32])
        chinese_models = hp.choice(
-            "models",
+            "chinese_models",


这里加一个还是不要4个不同的名字，因为英文和中文模型不会同时出现，就统一叫finetune_models和prompt_models吧。
同时这里增加一个单测（仿造model_candidates, 既有prompt model的hp.choice又有finetune model的hp.choice），避免以后类似的情况单测不能catch到

sijunhe · 2023-02-21T03:03:24Z

paddlenlp/experimental/autonlp/text_classification.py

@@ -538,6 +521,8 @@ def export(self, export_path, trial_id=None):
        if model_config["trainer_type"] == "PromptTrainer":
            trainer.export_model(export_path)
            trainer.model.plm.save_pretrained(os.path.join(export_path, "plm"))
+            mode = "prompt"
+            max_length = model_config.get("PreprocessArguments.max_length", 128)


不要hardcode 128,试一下 trainer.model.config.max_position_embeddings

sijunhe · 2023-02-21T03:04:10Z

paddlenlp/experimental/autonlp/text_classification.py

+            mode = "finetune"
+            max_length = trainer.model.config.max_position_embeddings


这里的逻辑和以上应该一致吧？model_config.get("PreprocessArguments.max_length", trainer.model.config.max_position_embeddings)

统一修改为max_length = config.get("PreprocessArguments.max_length", model.config.max_position_embeddings)

sijunhe · 2023-02-21T03:08:37Z

paddlenlp/experimental/autonlp/text_classification.py

+        }
+
+        with open(os.path.join(export_path, "taskflow_config.json"), "w", encoding="utf-8") as f:
+            json.dump(taskflow_config, f, ensure_ascii=False)

        if os.path.exists(self.training_path):
            logger.info("Removing training checkpoints to conserve disk space")
            shutil.rmtree(self.training_path)

        logger.info(f"Exported {trial_id} to {export_path}")


可以多log一行，taskflow config saved to {export_path}. You can use the taskflow config to create a Taskflow instance for inference

sijunhe · 2023-02-21T05:57:47Z

paddlenlp/experimental/autonlp/text_classification.py

@@ -298,16 +298,18 @@ def _construct_trainer(self, config, eval_dataset=None) -> Trainer:
            ]
        else:
            callbacks = None
+        max_length = config.get("PreprocessArguments.max_length", 128)


这个是不是多出来的？

sijunhe · 2023-02-21T05:58:17Z

paddlenlp/experimental/autonlp/text_classification.py

@@ -551,6 +553,9 @@ def export(self, export_path, trial_id=None):

        with open(os.path.join(export_path, "taskflow_config.json"), "w", encoding="utf-8") as f:
            json.dump(taskflow_config, f, ensure_ascii=False)
+        logger.info(
+            f"taskflow config saved to {export_path}. You can use the taskflow config to create a Taskflow instance for inference"


Suggested change

f"taskflow config saved to {export_path}. You can use the taskflow config to create a Taskflow instance for inference"

f"Taskflow config saved to {export_path}. You can use the Taskflow config to create a Taskflow instance for inference"

sijunhe · 2023-02-21T06:02:10Z

paddlenlp/experimental/autonlp/text_classification.py

-        chinese_models = hp.choice(
-            "models",
+        chinese_finetune_models = hp.choice(
+            "finetune_models",
            [
                "ernie-1.0-large-zh-cw"  # 24-layer, 1024-hidden, 16-heads, 272M parameters.


逗号？还好单测发现了

sijunhe · 2023-02-21T06:03:39Z

tests/experimental/autonlp/test_text_classification.py

@@ -264,6 +263,96 @@ def test_multilabel(self, custom_model_candidate, hp_overrides):
            # test training_path
            self.assertFalse(os.path.exists(os.path.join(auto_trainer.training_path)))

+    @slow


这个可能得挪到parameterized 下面才起效，Test里还是跑起来了，本地用pytest tests/experimental/autonlp/test_text_classification.py 验证一下

本地验证确实要挪到下面

lugimzzz · 2023-02-21T07:05:36Z

paddlenlp/experimental/autonlp/text_classification.py

@@ -152,7 +152,6 @@ def _model_candidates(self) -> List[Dict[str, Any]]:
                "roberta-large",  # 24-layer, 1024-hidden, 16-heads, 334M parameters. Case-sensitive
                "roberta-base",  # 12-layer, 768-hidden, 12-heads, 110M parameters. Case-sensitive
                "distilroberta-base",  # 6-layer, 768-hidden, 12-heads, 66M parameters. Case-sensitive
-                "ernie-3.0-tiny-mini-v2-en",  # 6-layer, 384-hidden, 12-heads, 27M parameters


测试发现"ernie-3.0-tiny-mini-v2-en"加载tokenizer会报错，暂时删去

sijunhe

lgtm!

fix

cc46cb8

lugimzzz commented Feb 20, 2023

View reviewed changes

lugimzzz requested a review from sijunhe February 20, 2023 12:41

lugimzzz self-assigned this Feb 20, 2023

lugimzzz added autonlp text classification labels Feb 20, 2023

sijunhe reviewed Feb 21, 2023

View reviewed changes

lugimzzz added 2 commits February 21, 2023 05:00

add test

6397f32

fix

e054a1b

sijunhe reviewed Feb 21, 2023

View reviewed changes

fix

3812224

lugimzzz commented Feb 21, 2023

View reviewed changes

sijunhe approved these changes Feb 21, 2023

View reviewed changes

sijunhe merged commit 9497c54 into PaddlePaddle:develop Feb 21, 2023

lugimzzz deleted the autonlp branch February 21, 2023 07:37

lugimzzz mentioned this pull request Mar 7, 2023

PaddleNLP 2.5.2 Release Note Candidate #5113

Closed

		mode = "finetune"
		max_length = trainer.model.config.max_position_embeddings

	f"taskflow config saved to {export_path}. You can use the taskflow config to create a Taskflow instance for inference"
	f"Taskflow config saved to {export_path}. You can use the Taskflow config to create a Taskflow instance for inference"

[autonlp] text classification fix& add taskflow config file #4896

[autonlp] text classification fix& add taskflow config file #4896

Uh oh!

Conversation

lugimzzz commented Feb 20, 2023

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Feb 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sijunhe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Feb 20, 2023 •

edited

Loading