[XPU] Fix PD + MTP#6495
Merged
Jiang-Jia-Jun merged 10 commits intoPaddlePaddle:developfrom Feb 27, 2026
Merged
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #6495 +/- ##
==========================================
Coverage ? 70.41%
==========================================
Files ? 394
Lines ? 53860
Branches ? 8463
==========================================
Hits ? 37925
Misses ? 13204
Partials ? 2731
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Contributor
There was a problem hiding this comment.
Pull request overview
这个PR旨在修复XPU卡上PD(Prefill-Decode分离)+ MTP(Multi-Token Prediction)模式下出现的hang问题。主要通过以下修改来实现:
Changes:
- 修正了speculative decoding模式下的任务判定逻辑,根据accept_num大小判断是否需要提前退出
- 对齐XPU和GPU的speculate_update算子实现,并在stop flag为true时更新seq_lens_decoder以防止任务滞留
- 修正了Decoder在第一轮接收Prefill任务时的seq_lens_this_time和draft_tokens值
- 新增了多个XPU算子以支持speculative decoding的任务恢复和状态更新
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/worker/xpu_model_runner.py | 在PD模式下为D端添加seq_lens_this_time和draft_tokens的初始化;修改skip_save_output逻辑;新增mask_rollback初始化 |
| fastdeploy/worker/gpu_model_runner.py | 同步XPU的修改,在PD模式下初始化seq_lens_this_time和draft_tokens |
| fastdeploy/spec_decode/mtp.py | 在PD模式下设置正确的seq_lens_this_time_buffer值 |
| fastdeploy/model_executor/xpu_pre_and_post_process.py | 更新speculate_update调用和speculate_save_output参数,支持mask_rollback |
| custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/speculate_update.cpp | 新增speculate_update算子的wrapper实现 |
| custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/speculate_set_value_by_flags.cpp | 修改参数为非const以支持写回操作 |
| custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/speculate_set_stop_value_multi_seqs.cpp | 新增min_tokens参数支持最小token限制检查 |
| custom_ops/xpu_ops/src/plugin/src/wrapper/mtp_wrapper/recover_spec_decode_task.cpp | 新增recover_spec_decode_task算子wrapper |
| custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_update.xpu | 新增XPU kernel实现speculate_update功能 |
| custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_set_value_by_flags.xpu | 修改逻辑以支持stop_flags的正确处理 |
| custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_set_stop_value_multi_seqs.xpu | 新增min_tokens检查逻辑 |
| custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_schedule_cache.xpu | 修正数据类型和添加内存屏障 |
| custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/recover_spec_decode_task.xpu | 新增XPU kernel实现任务恢复功能 |
| custom_ops/xpu_ops/src/plugin/include/xpu/plugin.h | 新增函数声明和修改参数类型 |
| custom_ops/xpu_ops/src/ops/recover_decode_task.cc | 扩展支持speculative decoding任务恢复 |
| custom_ops/xpu_ops/src/ops/pybind/pybind.cc | 更新Python绑定以支持新的算子签名 |
| custom_ops/xpu_ops/src/ops/mtp/speculate_update.cc | 新增speculate_update算子实现 |
| custom_ops/xpu_ops/src/ops/mtp/speculate_set_value_by_flags.cc | 修改参数为非const |
| custom_ops/xpu_ops/src/ops/mtp/speculate_set_stop_value_multi_seqs.cc | 新增min_tokens参数 |
| custom_ops/xpu_ops/src/ops/mtp/speculate_save_output.cc | 扩展参数支持skip_prefill和preempted_idx |
| custom_ops/gpu_ops/speculate_decoding/speculate_update.cu | 添加TODO注释关于seq_lens_decoder更新 |
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_update.xpu
Outdated
Show resolved
Hide resolved
.../xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_set_stop_value_multi_seqs.xpu
Show resolved
Hide resolved
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_update.xpu
Show resolved
Hide resolved
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_update.xpu
Outdated
Show resolved
Hide resolved
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_set_value_by_flags.xpu
Show resolved
Hide resolved
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/speculate_update.xpu
Show resolved
Hide resolved
custom_ops/xpu_ops/src/plugin/src/kernel/kunlun3cpp/mtp_kernel/recover_spec_decode_task.xpu
Outdated
Show resolved
Hide resolved
Jiang-Jia-Jun
approved these changes
Feb 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Modifications
1.token_processor.py中,spec模式下对task是否有正常值的判定需要根据accept num大小来判定是否需要提前退出
2.speculate_update.xpu算子与GPU对齐,同时为了适配1,D仍需要在stop flag为true的时候更新seq_lens_decoder,否则当D第一轮直接推出eos时,会导致task一直滞留在槽位中,不推理(execute_model)也不调度(schedule),最终的结果只能是client超时。
3.修正D在第一轮(接收到P的task时)的seq_lens_this_time值,正常值应该是length + 1(P的mtp = 1),修正之后在P卡上会减少一些口吃现象。
4.修正D在第一轮(接收到P的task时)的draft tokens值,需要从request中取出,否则在pre_process中会取出奇怪的token
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.