Skip to content

Dsv3 dev #10273

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 83 commits into
base: develop
Choose a base branch
from
Open

Dsv3 dev #10273

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
83 commits
Select commit Hold shift + click to select a range
2ab44c4
support fp8 && refine gemm runtime && opt cpu sync in moe (#10116)
zhangbo9674 Mar 13, 2025
6a203f5
Add distributed run for dsv3 (#10119)
ForFishes Mar 13, 2025
464075a
op fuse flag (#10122)
zhangbo9674 Mar 13, 2025
6c3172a
Add fused_swiglu_act(transpose)_quant op to extern op in gpt-3. (#10124)
Mar 13, 2025
4eeb192
Added topk_to_multihot and grad kernel to prevent CPU Stall. (#10127)
Mar 13, 2025
614e1aa
fix (#10130)
risemeup1 Mar 14, 2025
99c047a
[Distribution] Support DualPipeV for deepseek (#10138)
zhangyuqin1998 Mar 15, 2025
1fbbe2a
Fix cpu stall in permute and unpermute (#10147)
umiswing Mar 16, 2025
f4bf969
Delete fp8 gemm warning (#10134)
zhangbo9674 Mar 17, 2025
9d98f08
Adapt to the fix_cpu_stall (#10154)
zhangyuqin1998 Mar 17, 2025
cb4be8f
Add flag DSV3_USE_FP8_GEMM (#10133)
zhangbo9674 Mar 18, 2025
a1759fe
using gather (#10158)
ForFishes Mar 18, 2025
473ac84
Opt dualpipe overlap (#10173)
zhangyuqin1998 Mar 18, 2025
bfaa97e
fix expert parameters (#10184)
ForFishes Mar 18, 2025
37ec16e
Use Fp8 dispatch for moe layer (#10165)
zhangbo9674 Mar 18, 2025
a360eeb
fix (#10189)
risemeup1 Mar 19, 2025
5ecbd04
fix (#10191)
risemeup1 Mar 19, 2025
314841d
Fix dequant bug and dw compute bug (#10193)
zhangbo9674 Mar 19, 2025
58edb00
optimize ds3 attention impl (#10200)
phlrain Mar 19, 2025
fea2a8c
Adding TokenDispatcherUtils, for MoE token dispatch and regroup, in p…
Mar 19, 2025
cf5f6a5
Add several fused quanted ops in support of FP8 training (#10202)
Mar 19, 2025
0a77769
Optimize attention output linear fp8 memory (#10204)
phlrain Mar 19, 2025
dff10bf
Merge MoEFlexTokenLayer to MoELayer (#10205)
zhangyuqin1998 Mar 19, 2025
a874a9b
add timer for deepep (#10211)
ForFishes Mar 19, 2025
ce00414
fix (#10207)
zhangbo9674 Mar 19, 2025
eb2c988
[revert]Add timer for deepep (#10212)
ForFishes Mar 19, 2025
642869e
fix permute grad bug in pp (#10213)
zhangbo9674 Mar 19, 2025
22e4f9a
Revert "optimize ds3 attention impl (#10200)" (#10208)
phlrain Mar 19, 2025
f566b66
Spliting fusion moe, fix memlory leakage of fusion moe (#10192)
zhangyuqin1998 Mar 20, 2025
c886cb7
Remove dependency on FP8 when using BF16 (#10219)
zhangbo9674 Mar 20, 2025
f27dab3
support overlap for fusion moe (#10220)
zhangyuqin1998 Mar 20, 2025
d536792
Optimize gpu memory usage for UnPermute (#10217)
zhangbo9674 Mar 20, 2025
bb73548
Default config for dsv3 (#10225)
ForFishes Mar 20, 2025
221333f
support first_k_dense_replace for dualpipe (#10226)
zhangyuqin1998 Mar 20, 2025
53bc36b
Fix arch issue(CUDA 222) in hopper arch (#10209)
Mar 20, 2025
996336f
fix maxseqlen to 4097 (#10230)
ForFishes Mar 20, 2025
b8eff33
Adding Tdu FP32 type support (#10250)
Mar 25, 2025
e3ba0cc
fix_leakage (#10254)
zhangyuqin1998 Mar 25, 2025
c578139
Fix 0size for fp8 ExpertNode (#10263)
zhangbo9674 Mar 26, 2025
574b186
Optimizing TokenDispatcherUtils ops' performance. (#10262)
Mar 26, 2025
89c150d
optmize grouped gemm (#10274)
phlrain Mar 26, 2025
ad0a6f6
add record_stream for dispatch and combine output (#10270)
zhangbo9674 Mar 27, 2025
97df5a8
[fp8]zip and group gemm (#10214)
risemeup1 Mar 27, 2025
6dcb2ba
Adding fused_act_dequant kernel. (#10276)
Mar 27, 2025
411eef3
Fp8 speed optmize (#10272)
phlrain Mar 28, 2025
07572b4
Implemented unzip_stable_op to fully substitute permute function (#10…
Mar 28, 2025
e90215b
Adding simple token_zip op to get better performance. (#10300)
Mar 28, 2025
868add1
fix oom (#10307)
zhangbo9674 Mar 31, 2025
599fe9a
fix (#10320)
zhangbo9674 Apr 1, 2025
4770b1f
Adding probs recover functionality to tokens_zip_op (#10308)
Apr 1, 2025
01f08ba
optimize linear keepx quant (#10322)
phlrain Apr 2, 2025
925e330
add num_sms (#10324)
zhangbo9674 Apr 2, 2025
b7723f6
Support Group Gemm Mask (#10280)
risemeup1 Apr 3, 2025
0e158fb
Fix fused_act_dequant op with memset output to 0 (#10339)
Apr 3, 2025
c3c6695
optimize linear keepx quant (#10332)
phlrain Apr 3, 2025
1433482
fix mem leakage (#10344)
zhangyuqin1998 Apr 3, 2025
4113c5d
fix bug of expert init (#10347)
ForFishes Apr 4, 2025
a14cafd
fix bug (#10348)
zhangbo9674 Apr 8, 2025
d3ea4ec
Undo old cupti patch (#10367)
Apr 9, 2025
ab69953
Add FP32 support for zip op, optimize precision in bf16. (#10433)
Apr 21, 2025
6e8b317
Adding swiglu_prod_act_quant op (spaq) (#10463)
May 12, 2025
65d2d3a
optimize fp8 deep gemm tma (#10580)
phlrain May 12, 2025
07d4241
update default sm num (#10586)
phlrain May 13, 2025
ae560af
Support arbitrary num_experts and topk, with bfloat16 zip prob. (#10583)
May 13, 2025
670cbd9
Refine setup.py (#10577)
risemeup1 May 15, 2025
5406b5e
update DeepGEMM (#10429)
chen2016013 May 21, 2025
67b21ae
Support fusion moe (#10507)
risemeup1 May 22, 2025
4bfc44d
Add arbitrary padding to handle extreme inbalance case. (#10623)
May 22, 2025
f705b6b
Add int64_t index type for possible overflow position. (#10663)
May 27, 2025
a4d90ab
support new fa3 api (#10661)
phlrain May 27, 2025
6c206e1
Add fused_stack_transpose_quant kernel (optional transpose) (#10649)
lshpku May 28, 2025
07be865
Adding fused_swiglu_probs_bwd op (#10604)
May 28, 2025
361ef08
fix zip overflow (#10672)
May 29, 2025
f2712b7
Add fused_transpose_split_quant kernel (#10657)
lshpku Jun 3, 2025
85cd9cb
patch possible dequant overflow (#10691)
Jun 3, 2025
4ba0e30
limit stack use to prevent CUDA error 2 (#10696)
Jun 4, 2025
4a23f8b
Disable fast math due to precision issue (#10697)
Jun 4, 2025
42b72cd
fix grid y problem in spaq (#10709)
Jun 5, 2025
80f18ee
refine fp8 code (#10669)
zhangbo9674 Jun 9, 2025
4a872d4
Add cuda stream for fused quant kernel (#10716)
lshpku Jun 12, 2025
7ecb1dc
fix big_tensor issue in swiglu_probs_bwd (#10735)
A-nnonymous Jun 13, 2025
382fa86
fix 0size problem in unzip-zip op (#10755)
A-nnonymous Jun 20, 2025
c466d2e
Add expert subbatch & inplace fused_swiglu_probs_bwd ops (#10757)
sneaxiy Jun 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,3 +62,12 @@ repos:
language: python
files: \.(md|markdown|rst)$
pass_filenames: true

- repo: local
hooks:
- id: clang-format
name: clang-format
description: Format files with ClangFormat.
entry: bash ./tools/codestyle/clang_format.sh -i
language: system
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|xpu|kps)$
28 changes: 17 additions & 11 deletions llm/config/deepseek-v3/pretrain_argument.json
Original file line number Diff line number Diff line change
@@ -1,41 +1,47 @@
{
"model_name_or_path": "deepseek-ai/DeepSeek-V3",
"model_name_or_path": "./model_config/DeepSeek-V3-test",
"tokenizer_name_or_path": "deepseek-ai/DeepSeek-V3",
"input_dir": "./data",
"output_dir": "./checkpoints/pretrain_ckpts",
"per_device_train_batch_size": 1,
"gradient_accumulation_steps": 4,
"gradient_accumulation_steps": 120,
"per_device_eval_batch_size": 1,
"tensor_parallel_degree": 1,
"pipeline_parallel_degree": 1,
"sharding_parallel_degree": 8,
"expert_parallel_degree": 4,
"sharding_parallel_degree": 64,
"sharding_parallel_config": "split_param enable_fuse_optimizer_states",
"sharding_comm_buffer_size_MB": 2048,
"expert_parallel_degree": 64,
"sharding": "stage1",
"virtual_pp_degree": 1,
"sequence_parallel": 0,
"use_flash_attention": true,
"max_seq_length": 4096,
"max_seq_length": 4097,
"learning_rate": 3e-05,
"min_learning_rate": 3e-06,
"warmup_steps": 30,
"logging_steps": 1,
"max_steps": 10000,
"max_steps": 200,
"save_steps": 5000,
"eval_steps": 1000,
"weight_decay": 0.01,
"bf16": true,
"fp16_opt_level": "O2",
"warmup_ratio": 0.01,
"max_grad_norm": 1.0,
"dataloader_num_workers": 1,
"amp_master_grad": 1,
"dataloader_num_workers": 8,
"continue_training": 0,
"do_train": true,
"do_eval": true,
"do_predict": false,
"disable_tqdm": true,
"recompute": true,
"recompute": false,
"distributed_dataloader": 1,
"recompute_granularity": "full",
"unified_checkpoint": true,
"save_total_limit": 2
}
"save_total_limit": 2,
"skip_profile_timer": false,
"use_fused_rms_norm": true,
"fuse_attention_ffn": true,
"use_fused_rope": true
}
67 changes: 67 additions & 0 deletions llm/model_config/DeepSeek-V3-test/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
{
"architectures": [
"DeepseekV3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_deepseek.DeepseekV3Config",
"AutoModel": "modeling_deepseek.DeepseekV3Model",
"AutoModelForCausalLM": "modeling_deepseek.DeepseekV3ForCausalLM"
},
"aux_loss_alpha": 0.001,
"bos_token_id": 0,
"eos_token_id": 1,
"ep_size": 1,
"first_k_dense_replace": 0,
"hidden_act": "silu",
"hidden_size": 7168,
"initializer_range": 0.02,
"intermediate_size": 18432,
"kv_lora_rank": 512,
"max_position_embeddings": 163840,
"model_type": "deepseek_v3",
"moe_intermediate_size": 2048,
"moe_layer_freq": 1,
"n_group": 8,
"n_routed_experts": 256,
"n_shared_experts": 1,
"norm_topk_prob": true,
"num_attention_heads": 128,
"num_experts_per_tok": 8,
"num_hidden_layers": 2,
"num_key_value_heads": 128,
"num_nextn_predict_layers": 1,
"pretraining_tp": 1,
"q_lora_rank": 1536,
"qk_nope_head_dim": 128,
"qk_rope_head_dim": 64,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"beta_fast": 32,
"beta_slow": 1,
"factor": 40,
"mscale": 1.0,
"mscale_all_dim": 1.0,
"original_max_position_embeddings": 4096,
"type": "yarn"
},
"rope_theta": 10000,
"routed_scaling_factor": 2.5,
"scoring_func": "sigmoid",
"seq_aux": true,
"tie_word_embeddings": false,
"topk_group": 4,
"topk_method": "noaux_tc",
"dtype": "bfloat16",
"transformers_version": "4.33.1",
"use_cache": true,
"v_head_dim": 128,
"vocab_size": 129280,
"using_flex_token": true,
"using_fake_gate": true,
"use_fused_rms_norm": true,
"fuse_attention_ffn": true,
"use_fused_rope": true,
"token_drop_steps": true
}
Loading
Loading