Dsv3 dev #10273

phlrain · 2025-03-26T10:00:40Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Description

* add distributed run * fix topo * add distributed print

* Add fused_swiglu_act(transpose)_quant op to extern op in gpt-3 * Polishing code. * remove unecessary lines. * remove unecessary lines in cu * Add padding function to fused_swiglu_act_quant_op

* [Distribution] Support DualPipeV for deepseek * add * fix * add * add

* commit for save * revert gpu sum in add loss for mtp

* refine * fix

* add flag DSV3_USE_FP8_GEMM * fix

* add flag DSV3_USE_FP8_GEMM * fix * add fp8 comm * fix bug * fix bug * fix bug * fix bug * fix bug * fix * fix bug * fix bug * replace index_select to gather * close fuse_moe * fix dequant bug

* fix dequant bug * fix bug * fix bug * fix

…reparation for GroupedGEMM using in training. (#10190) * Add regroup_tokens op and optest, fix topk_to_multihot setup.py * Add test file * fix miscs * Add tokens_unzip & weighted_zip in preparation of fp8-groupedgemm * Added expert_idx output to tokens_unzip op. * Fix prob datatype issue. * Implemented double input&output regroup op. * Further fix bf16 issues. * Fix implicit bug. * Change the unzip op to save more useful data * Refractor and combining tokens_unzip_and_zip. * Fixed concurrent semaphore bug. * delete synchronize in zip op, and starting to add guided unzip kernel. * Add fp8 support for unzip op, but cannot fake a tensor for testing. * Added guided_unzip op. * Modified guided_unzip to satisfy real usage. * polish code * Fix typos and polish & tested code

* Added two fused op, refractor some old swiglu code. * delete unecessary print.

* optimize atten impl * optimize_attention_output_linear_fp8_memory

* Revert "add timer for deepep (#10211)" This reverts commit a874a9b. * revert timer

This reverts commit 58edb00.

* Support overlap for fusion moe, fix memlory leakage of fusion moe * fix * fix conflict

Co-authored-by: Pan Zhaowu <panzhaowu@baidu.com>

* First version, passed precision test. * Add optest. * restore setup.py * Adding optional prob for spaq * Optimized spaq in last-dim 8x cases. * fix type * Further improve performance with_prob * remove unessesary calculations.

* Add arbitrary expert_num and topk support for unzip and zip. * Merge bfloat16 zip prob support for flex num_experts and topk

* fix * fix

* merge * Update m_grouped_gemm.py

* support fusion moe * fix * fix * fix

* Add fused swiglu_probs_bwd op * add o2s as output * fix 3d tensor input and add vectorize optimizations. * fix tests of vec4 * Optimize reduce performance * delete timeline * Update setup_fp8.py fix arch * Fix multi-dimension issue.

* control stack use to prevent overflow * fix

* Disable fast math for fused precision isssue. * Disable TDU fm.

* refine fp8 code * fix bug * fix bug * reine mem * fix * refine meme * add fuse pass * add fuse config * refine tma * add cinn decorate * refine

CLAassistant · 2025-06-12T14:17:58Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
10 out of 11 committers have signed the CLA.

✅ umiswing
✅ zhangbo9674
✅ phlrain
✅ chen2016013
✅ risemeup1
✅ zhangyuqin1998
✅ sneaxiy
✅ lshpku
✅ A-nnonymous
✅ ForFishes
❌ Zhaowu Pan

Zhaowu Pan seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Co-authored-by: zhangyuqin <zhangyuqin@baidu.com>

zhangbo9674 and others added 30 commits March 13, 2025 16:24

support fp8 && refine gemm runtime && opt cpu sync in moe (#10116)

2ab44c4

Add distributed run for dsv3 (#10119)

6a203f5

* add distributed run * fix topo * add distributed print

op fuse flag (#10122)

464075a

Add fused_swiglu_act(transpose)_quant op to extern op in gpt-3. (#10124)

6c3172a

* Add fused_swiglu_act(transpose)_quant op to extern op in gpt-3 * Polishing code. * remove unecessary lines. * remove unecessary lines in cu * Add padding function to fused_swiglu_act_quant_op

Added topk_to_multihot and grad kernel to prevent CPU Stall. (#10127)

4eeb192

fix (#10130)

614e1aa

[Distribution] Support DualPipeV for deepseek (#10138)

99c047a

* [Distribution] Support DualPipeV for deepseek * add * fix * add * add

Fix cpu stall in permute and unpermute (#10147)

1fbbe2a

* commit for save * revert gpu sum in add loss for mtp

Delete fp8 gemm warning (#10134)

f4bf969

* refine * fix

Adapt to the fix_cpu_stall (#10154)

9d98f08

Add flag DSV3_USE_FP8_GEMM (#10133)

cb4be8f

* add flag DSV3_USE_FP8_GEMM * fix

using gather (#10158)

a1759fe

Opt dualpipe overlap (#10173)

473ac84

fix expert parameters (#10184)

bfaa97e

Use Fp8 dispatch for moe layer (#10165)

37ec16e

* add flag DSV3_USE_FP8_GEMM * fix * add fp8 comm * fix bug * fix bug * fix bug * fix bug * fix bug * fix * fix bug * fix bug * replace index_select to gather * close fuse_moe * fix dequant bug

fix (#10189)

a360eeb

fix (#10191)

5ecbd04

Fix dequant bug and dw compute bug (#10193)

314841d

* fix dequant bug * fix bug * fix bug * fix

optimize ds3 attention impl (#10200)

58edb00

Add several fused quanted ops in support of FP8 training (#10202)

cf5f6a5

* Added two fused op, refractor some old swiglu code. * delete unecessary print.

Optimize attention output linear fp8 memory (#10204)

0a77769

* optimize atten impl * optimize_attention_output_linear_fp8_memory

Merge MoEFlexTokenLayer to MoELayer (#10205)

dff10bf

add timer for deepep (#10211)

a874a9b

fix (#10207)

ce00414

[revert]Add timer for deepep (#10212)

eb2c988

* Revert "add timer for deepep (#10211)" This reverts commit a874a9b. * revert timer

fix permute grad bug in pp (#10213)

642869e

Revert "optimize ds3 attention impl (#10200)" (#10208)

22e4f9a

This reverts commit 58edb00.

Spliting fusion moe, fix memlory leakage of fusion moe (#10192)

f566b66

* Support overlap for fusion moe, fix memlory leakage of fusion moe * fix * fix conflict

Remove dependency on FP8 when using BF16 (#10219)

c886cb7

phlrain and others added 26 commits April 3, 2025 17:37

optimize linear keepx quant (#10332)

c3c6695

Co-authored-by: Pan Zhaowu <panzhaowu@baidu.com>

fix mem leakage (#10344)

1433482

fix bug of expert init (#10347)

4113c5d

fix bug (#10348)

a14cafd

Undo old cupti patch (#10367)

d3ea4ec

Add FP32 support for zip op, optimize precision in bf16. (#10433)

ab69953

Adding swiglu_prod_act_quant op (spaq) (#10463)

6e8b317

* First version, passed precision test. * Add optest. * restore setup.py * Adding optional prob for spaq * Optimized spaq in last-dim 8x cases. * fix type * Further improve performance with_prob * remove unessesary calculations.

optimize fp8 deep gemm tma (#10580)

65d2d3a

update default sm num (#10586)

07d4241

Support arbitrary num_experts and topk, with bfloat16 zip prob. (#10583)

ae560af

* Add arbitrary expert_num and topk support for unzip and zip. * Merge bfloat16 zip prob support for flex num_experts and topk

Refine setup.py (#10577)

670cbd9

* fix * fix

update DeepGEMM (#10429)

5406b5e

* merge * Update m_grouped_gemm.py

Support fusion moe (#10507)

67b21ae

* support fusion moe * fix * fix * fix

Add arbitrary padding to handle extreme inbalance case. (#10623)

4bfc44d

Add int64_t index type for possible overflow position. (#10663)

f705b6b

support new fa3 api (#10661)

a4d90ab

Add fused_stack_transpose_quant kernel (optional transpose) (#10649)

6c206e1

Adding fused_swiglu_probs_bwd op (#10604)

07be865

* Add fused swiglu_probs_bwd op * add o2s as output * fix 3d tensor input and add vectorize optimizations. * fix tests of vec4 * Optimize reduce performance * delete timeline * Update setup_fp8.py fix arch * Fix multi-dimension issue.

fix zip overflow (#10672)

361ef08

Add fused_transpose_split_quant kernel (#10657)

f2712b7

patch possible dequant overflow (#10691)

85cd9cb

limit stack use to prevent CUDA error 2 (#10696)

4ba0e30

* control stack use to prevent overflow * fix

Disable fast math due to precision issue (#10697)

4a23f8b

* Disable fast math for fused precision isssue. * Disable TDU fm.

fix grid y problem in spaq (#10709)

42b72cd

refine fp8 code (#10669)

80f18ee

* refine fp8 code * fix bug * fix bug * reine mem * fix * refine meme * add fuse pass * add fuse config * refine tma * add cinn decorate * refine

Add cuda stream for fused quant kernel (#10716)

4a872d4

A-nnonymous and others added 3 commits June 13, 2025 16:50

fix big_tensor issue in swiglu_probs_bwd (#10735)

7ecb1dc

fix 0size problem in unzip-zip op (#10755)

382fa86

Co-authored-by: zhangyuqin <zhangyuqin@baidu.com>

Add expert subbatch & inplace fused_swiglu_probs_bwd ops (#10757)

c466d2e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dsv3 dev #10273

Dsv3 dev #10273

phlrain commented Mar 26, 2025

Uh oh!

CLAassistant commented Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

Dsv3 dev #10273

Are you sure you want to change the base?

Dsv3 dev #10273

Conversation

phlrain commented Mar 26, 2025

Before submitting

PR types

PR changes

Description

Uh oh!

CLAassistant commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

CLAassistant commented Jun 12, 2025 •

edited

Loading