Skip to content

[XPU] Fix PD + MTP#6495

Merged
Jiang-Jia-Jun merged 10 commits intoPaddlePaddle:developfrom
cmcamdy:xpu_mtp_pd
Feb 27, 2026
Merged

[XPU] Fix PD + MTP#6495
Jiang-Jia-Jun merged 10 commits intoPaddlePaddle:developfrom
cmcamdy:xpu_mtp_pd

Conversation

@cmcamdy
Copy link
Collaborator

@cmcamdy cmcamdy commented Feb 24, 2026

Motivation

  • P卡上,PD+MTP会hang,本PR为修复此现象

Modifications

1.token_processor.py中,spec模式下对task是否有正常值的判定需要根据accept num大小来判定是否需要提前退出
2.speculate_update.xpu算子与GPU对齐,同时为了适配1,D仍需要在stop flag为true的时候更新seq_lens_decoder,否则当D第一轮直接推出eos时,会导致task一直滞留在槽位中,不推理(execute_model)也不调度(schedule),最终的结果只能是client超时。
3.修正D在第一轮(接收到P的task时)的seq_lens_this_time值,正常值应该是length + 1(P的mtp = 1),修正之后在P卡上会减少一些口吃现象。
4.修正D在第一轮(接收到P的task时)的draft tokens值,需要从request中取出,否则在pre_process中会取出奇怪的token

Usage or Command

Accuracy Tests

  • 4修复前({"role": "user", "content": "你好,你是谁?"}):
image - 4修复后 image
  • 21B-A3B TP1,benchmark对比
image

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link

paddle-bot bot commented Feb 24, 2026

Thanks for your contribution!

@codecov-commenter
Copy link

codecov-commenter commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@7b1d787). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/worker/gpu_model_runner.py 0.00% 4 Missing and 1 partial ⚠️
...tdeploy/model_executor/xpu_pre_and_post_process.py 0.00% 4 Missing ⚠️
fastdeploy/spec_decode/mtp.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #6495   +/-   ##
==========================================
  Coverage           ?   70.41%           
==========================================
  Files              ?      394           
  Lines              ?    53860           
  Branches           ?     8463           
==========================================
  Hits               ?    37925           
  Misses             ?    13204           
  Partials           ?     2731           
Flag Coverage Δ
GPU 70.41% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zhupengyang
zhupengyang previously approved these changes Feb 27, 2026
Copy link
Collaborator

@zhupengyang zhupengyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Deleter-D
Deleter-D previously approved these changes Feb 27, 2026
Copy link
Collaborator

@Deleter-D Deleter-D left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

这个PR旨在修复XPU卡上PD(Prefill-Decode分离)+ MTP(Multi-Token Prediction)模式下出现的hang问题。主要通过以下修改来实现:

Changes:

  • 修正了speculative decoding模式下的任务判定逻辑,根据accept_num大小判断是否需要提前退出
  • 对齐XPU和GPU的speculate_update算子实现,并在stop flag为true时更新seq_lens_decoder以防止任务滞留
  • 修正了Decoder在第一轮接收Prefill任务时的seq_lens_this_time和draft_tokens值
  • 新增了多个XPU算子以支持speculative decoding的任务恢复和状态更新

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
fastdeploy/worker/xpu_model_runner.py 在PD模式下为D端添加seq_lens_this_time和draft_tokens的初始化;修改skip_save_output逻辑;新增mask_rollback初始化
fastdeploy/worker/gpu_model_runner.py 同步XPU的修改,在PD模式下初始化seq_lens_this_time和draft_tokens
fastdeploy/spec_decode/mtp.py 在PD模式下设置正确的seq_lens_this_time_buffer值
fastdeploy/model_executor/xpu_pre_and_post_process.py 更新speculate_update调用和speculate_save_output参数,支持mask_rollback
custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/speculate_update.cpp 新增speculate_update算子的wrapper实现
custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/speculate_set_value_by_flags.cpp 修改参数为非const以支持写回操作
custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/speculate_set_stop_value_multi_seqs.cpp 新增min_tokens参数支持最小token限制检查
custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/recover_spec_decode_task.cpp 新增recover_spec_decode_task算子wrapper
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_update.xpu 新增XPU kernel实现speculate_update功能
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_set_value_by_flags.xpu 修改逻辑以支持stop_flags的正确处理
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_set_stop_value_multi_seqs.xpu 新增min_tokens检查逻辑
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_schedule_cache.xpu 修正数据类型和添加内存屏障
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/recover_spec_decode_task.xpu 新增XPU kernel实现任务恢复功能
custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h 新增函数声明和修改参数类型
custom_ops/xpu_ops/src/ops/recover_decode_task.cc 扩展支持speculative decoding任务恢复
custom_ops/xpu_ops/src/ops/pybind/pybind.cc 更新Python绑定以支持新的算子签名
custom_ops/xpu_ops/src/ops/mtp/speculate_update.cc 新增speculate_update算子实现
custom_ops/xpu_ops/src/ops/mtp/speculate_set_value_by_flags.cc 修改参数为非const
custom_ops/xpu_ops/src/ops/mtp/speculate_set_stop_value_multi_seqs.cc 新增min_tokens参数
custom_ops/xpu_ops/src/ops/mtp/speculate_save_output.cc 扩展参数支持skip_prefill和preempted_idx
custom_ops/gpu_ops/speculate_decoding/speculate_update.cu 添加TODO注释关于seq_lens_decoder更新

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit 1344727 into PaddlePaddle:develop Feb 27, 2026
20 of 24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants