[docs] update gkd (#4657)

Jintao-Huang · web-flow · commit 871278cd7ac6 · 2025-06-21T00:09:21.000+08:00
diff --git a/docs/source/Customization/自定义数据集.md b/docs/source/Customization/自定义数据集.md
@@ -80,13 +80,13 @@ query-response格式：
 - 注意：GRPO会透传所有额外的字段内容给ORM，而不像其他训练方法，默认将额外的字段删除。例如: 你可以额外传入'solution'。自定义的ORM需要包含一个位置参数completions，其他为关键词参数，由数据集额外字段透传。
 
 #### GKD
-若未开启`seq_kd`，即该参数为False。数据集格式如下：
+若未开启`seq_kd`，即该参数为False。数据集格式如下（你可使用teacher模型提前蒸馏）：
 ```jsonl
 {"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "告诉我明天的天气"}, {"role": "assistant", "content": "明天天气晴朗"}]}
 {"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}, {"role": "assistant", "content": "等于3"}]}
 ```
 
-若开启`seq_kd`，则不需要最后一轮的'assistant'部分：
+若开启`seq_kd`，则不需要最后一轮的'assistant'部分（teacher模型在训练时生成数据）：
 ```jsonl
 {"messages": [{"role": "system", "content": "你是个有用无害的助手"}, {"role": "user", "content": "告诉我明天的天气"}]}
 {"messages": [{"role": "system", "content": "你是个有用无害的数学计算器"}, {"role": "user", "content": "1+1等于几"}, {"role": "assistant", "content": "等于2"}, {"role": "user", "content": "再加1呢"}]}
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -409,6 +409,7 @@ RLHF参数继承于[训练参数](#训练参数)。
 - temperature: 默认为0.9，该参数将在PPO、GRPO、GKD中使用。
 - lmbda: 默认为0.5。该参数在GKD中使用。控制学生数据比例的 lambda 参数（即策略内学生生成输出所占的比例）。
 - seq_kd: 默认为False。该参数在GKD中使用。控制是否执行序列级知识蒸馏（Sequence-Level KD）的 seq_kd 参数（可视为对教师模型生成输出的监督式微调）。
+  - 注意：你可以提前对数据集内容使用teacher模型进行推理（使用vllm/sglang/lmdeploy等推理引擎加速），并在训练时将`seq_kd`设置为False。或者将`seq_kd`设置为True，在训练时使用teacher模型生成序列（能保证多个epoch生成数据的不同，但效率较慢）。
 
 #### Reward/Teacher模型参数
 reward模型参数将在PPO、GRPO中使用。
diff --git a/docs/source_en/Customization/Custom-dataset.md b/docs/source_en/Customization/Custom-dataset.md
@@ -82,14 +82,14 @@ The following outlines the standard dataset format for ms-swift, where the "syst
 
 #### GKD
 
-If `seq_kd` is not enabled (i.e., the parameter is set to `False`), the dataset format should be as follows:
+If `seq_kd` is not enabled, i.e., the parameter is set to False, the dataset format is as follows (you can use a teacher model to pre-distill the data):
 
 ```jsonl
 {"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}, {"role": "assistant", "content": "Tomorrow's weather will be sunny"}]}
 {"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}, {"role": "assistant", "content": "It equals 3"}]}
 ```
 
-If `seq_kd` is enabled, the final `assistant` turn is not required in the dataset. The format should be:
+If `seq_kd` is enabled, the final round of the 'assistant' part is not required (the teacher model generates data during training):
 
 ```jsonl
 {"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -419,6 +419,7 @@ RLHF arguments inherit from the [training arguments](#training-arguments).
 - temperature: Default is 0.9; this parameter will be used in PPO, GRPO and GKD.
 - lmbda: Default is 0.5. This parameter is used in GKD. It is the lambda parameter that controls the student data fraction (i.e., the proportion of on-policy student-generated outputs).
 - seq_kd: Default is False. This parameter is used in GKD. It is the `seq_kd` parameter that controls whether to perform Sequence-Level KD (can be viewed as supervised fine-tuning on teacher-generated output).
+  - Note: You can perform inference on the dataset using the teacher model in advance (accelerated by inference engines such as vLLM, SGLang, or lmdeploy), and set `seq_kd` to False during training. Alternatively, you can set `seq_kd` to True, which will use the teacher model to generate sequences during training (ensuring different generated data across multiple epochs, but at a slower efficiency).
 
 #### Reward/Teacher Model Parameters
 
diff --git a/swift/llm/dataset/utils.py b/swift/llm/dataset/utils.py
@@ -301,7 +301,6 @@ def __init__(
             'template': template,
             'packing_interval': packing_interval,
             'strict': strict,
-            'version': 'v1',
         })
         self.dataset_name = f'packing-cache-{fingerprint}'
         with safe_ddp_context(None, True):
diff --git a/swift/llm/template/base.py b/swift/llm/template/base.py
@@ -85,6 +85,7 @@ def __init__(
         from .template_meta import TemplateMeta
         from swift.plugin import agent_templates, loss_scale_map
         self._processor_inited = False
+        self._version = 'v1'  # Avoid compatibility issues caused by load_from_cache_file caching.
         self.max_length = max_length
 
         if not use_chat_template: