Integrate DeepSeek Sparse Attention with Tokamax Flash Attention by RissyRan · Pull Request #3087 · AI-Hypercomputer/maxtext

RissyRan · 2026-02-04T23:08:45Z

Description

Integrate DSA with Tokamax Flash Attention

Leverages make_dynamic_splash_mha to support 3D dynamic index masking within the Flash Attention
Moves index_mask reshaping to occur locally within the corresponding dot_product attention call

Tests

Added dedicated unit tests for the Flash Attention path. Note that because it requires seq must be a multiple of 128 (TPU num of lanes), new test sequence length has been standardized to 128. Observed that both dot_product and flash attention require normalization to pass at longer sequence length like seq=128.
- Add unit tests and run with python3 -m unittest tests.unit.deepseek32_vs_reference_test - logs: link
- Performed a manual diff of the final three entries of the last batch. I do observed 1-2 outliers with index_topk=4 between the dot_product and flash_attention implementations - link
- Verified that the batch mask indices are also identical across both attention paths - link
- Also tested with a different topK value on the reference side resulted in unit test failures for both dot_product and flash_attention paths, even when normalization was applied. Given that the failures are consistent across both implementations, utilizing normalization within the unit tests appears to be an acceptable baseline for stability.
Observed a ~10% throughput boost (from 74 tflops/s/device to 82 tflops/s/device) compared to the baseline implementation, tested on a small model version of DS v3.2 (2B model size). More benchmarks will be covered in b/469549024.

V3.2 version (2B) - batch=12, seq=4096, FSDP

Per train step:
 Total TFLOPs: 145.89 

# dot_product
I0204 22:40:21.182972 140619320139328 metric_logger.py:179] completed step: 9, seconds: 1.951, TFLOP/s/device: 74.797, Tokens/s/device: 12599.730, total_weights: 98304, loss: 11.982

# flash 
I0204 22:45:41.304875 139758854504000 metric_logger.py:179] completed step: 9, seconds: 1.765, TFLOP/s/device: 82.661, Tokens/s/device: 13924.521, total_weights: 98304, loss: 11.980

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-02-04T23:33:40Z

Codecov Report

❌ Patch coverage is 35.71429% with 9 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/MaxText/layers/attention_op.py	35.71%	8 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-02-05T02:58:06Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This pull request successfully integrates DeepSeek Sparse Attention with Tokamax Flash Attention, which provides a notable performance improvement. The changes are well-structured, and the logic for handling the dynamic sparse mask within the flash attention kernel appears correct.

🔍 General Feedback

The unit tests have been significantly improved to cover both the dot_product and flash attention paths, ensuring parity between the two implementations. The parameterization of the tests is a good addition.
The movement of the index_mask reshaping logic into the apply_attention_dot function is a clean refactoring.
The updates to the configuration validation and compile tests are thorough and ensure the new attention mechanism is properly supported.

Overall, this is a high-quality contribution that enhances the performance of sparse attention. The suggestions provided are minor and aimed at improving code clarity and test readability.

src/MaxText/layers/attention_op.py

tests/unit/deepseek32_vs_reference_test.py

RissyRan

@gemini-cli /review

github-actions · 2026-02-06T23:50:29Z

🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This pull request successfully integrates DeepSeek Sparse Attention with Tokamax Flash Attention, which is a valuable performance enhancement. The implementation is solid, and the added tests are comprehensive, ensuring the correctness of the new attention path.

🔍 General Feedback

The changes are well-structured and easy to follow.
The extension of the test suite to cover both dot_product and flash attention with various configurations is a great addition and significantly improves the robustness of the implementation.
The PR description is clear and provides good context, including performance numbers.

tests/unit/deepseek32_vs_reference_test.py

shuningjin

Thank you for the tokamax flash integration! Excited to see throughput improving! Overall looks good.

Observed that both dot_product and flash attention require normalization to pass at longer sequence length like seq=128.

Yes, normalization is needed with larger seq. In the test, the torch params are manually initialized with N(0, 1); without init they are all zeros.

maxtext/tests/unit/deepseek32_vs_reference_test.py

Lines 708 to 715 in 95ef3e1

    
           def init_torch_weights(module, std=1): 
        
             """ 
        
             Initialize all parameters in the module with N(0,std). 
        
             This simple strategy is intended only for unit test. 
        
             """ 
        
             with torch.no_grad(): 
        
               for _, param in module.named_parameters(): 
        
                 torch.nn.init.normal_(param, mean=0.0, std=std)

As a result of std=1, the logits are much larger than what we usually have, e.g., 2000 (log). Also I see normalization is used for gpt-oss attention tests.

maxtext/tests/unit/gpt_vs_reference_test.py

Line 445 in 95ef3e1

return unnormalized_output / sum_val

Performed a manual diff of the final three entries of the last batch. I do observed 1-2 outliers with index_topk=4 between the dot_product and flash_attention implementations

(Optional) Do you think we can add a test that directly compare flash against dot product, like what we have in attention_test.py?

maxtext/tests/unit/attention_test.py

Lines 478 to 479 in 95ef3e1

    
           def tpu_kernel_attention_helper(self, num_kv_heads): 
        
             """Test equivalence between dot_product and TPU accelerated"""

either inside deepseek32_vs_reference_test.py (e.g., this can pass) or attention_test.

src/MaxText/layers/attention_op.py

RissyRan · 2026-02-09T18:34:26Z

Thank you for the tokamax flash integration! Excited to see throughput improving! Overall looks good.

Observed that both dot_product and flash attention require normalization to pass at longer sequence length like seq=128.

Yes, normalization is needed with larger seq. In the test, the torch params are manually initialized with N(0, 1); without init they are all zeros.

maxtext/tests/unit/deepseek32_vs_reference_test.py

Lines 708 to 715 in 95ef3e1

def init_torch_weights(module, std=1):

"""

Initialize all parameters in the module with N(0,std).

This simple strategy is intended only for unit test.

"""

with torch.no_grad():

for _, param in module.named_parameters():

torch.nn.init.normal_(param, mean=0.0, std=std)

As a result of std=1, the logits are much larger than what we usually have, e.g., 2000 (log). Also I see normalization is used for gpt-oss attention tests.

maxtext/tests/unit/gpt_vs_reference_test.py

Line 445 in 95ef3e1

return unnormalized_output / sum_val

Performed a manual diff of the final three entries of the last batch. I do observed 1-2 outliers with index_topk=4 between the dot_product and flash_attention implementations

(Optional) Do you think we can add a test that directly compare flash against dot product, like what we have in attention_test.py?

maxtext/tests/unit/attention_test.py

Lines 478 to 479 in 95ef3e1

def tpu_kernel_attention_helper(self, num_kv_heads):

"""Test equivalence between dot_product and TPU accelerated"""

either inside deepseek32_vs_reference_test.py (e.g., this can pass) or attention_test.

Thanks for giving that a shot, Shuning! If this normalization holds up, I’d actually prefer to compare it directly against the reference implementation. Since the codebase is evolving so fast, there's a risk of regressions in dot_product as well. Comparing against the reference feels like the safest bet—what do you think?

shuningjin

Thanks!

gagika

Thanks, one comment, can be a follow up fix as well.

src/MaxText/layers/attention_op.py

RissyRan requested review from A9isha, NicoGrande, NuojCheng, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jesselu-google, jiangjy1982, khatwanimohit, parambole, richjames0, shralex, shuningjin, suexu1025 and vipannalla as code owners February 4, 2026 23:08

RissyRan changed the title ~~[WIP] Integrate DS V32 with flash integration~~ [WIP] Integrate DS V32 with flash attention Feb 4, 2026

RissyRan force-pushed the v32_flash_integration branch 3 times, most recently from 4b7e14c to 908d7bb Compare February 5, 2026 01:01

RissyRan changed the title ~~[WIP] Integrate DS V32 with flash attention~~ [WIP] Integrate DeepSeek Sparse Attention with Tokamax Flash Attention Feb 5, 2026

RissyRan force-pushed the v32_flash_integration branch 3 times, most recently from 7b2fe81 to ffdc797 Compare February 5, 2026 02:21

RissyRan changed the title ~~[WIP] Integrate DeepSeek Sparse Attention with Tokamax Flash Attention~~ Integrate DeepSeek Sparse Attention with Tokamax Flash Attention Feb 5, 2026

RissyRan assigned shuningjin and gagika Feb 5, 2026

RissyRan added pull ready gemini-review and removed pull ready labels Feb 5, 2026

github-actions bot reviewed Feb 5, 2026

View reviewed changes

src/MaxText/layers/attention_op.py Show resolved Hide resolved

tests/unit/deepseek32_vs_reference_test.py Outdated Show resolved Hide resolved

RissyRan commented Feb 6, 2026

View reviewed changes

AI-Hypercomputer deleted a comment from github-actions bot Feb 6, 2026

github-actions bot reviewed Feb 6, 2026

View reviewed changes

tests/unit/deepseek32_vs_reference_test.py Show resolved Hide resolved

tests/unit/deepseek32_vs_reference_test.py Show resolved Hide resolved

shuningjin reviewed Feb 9, 2026

View reviewed changes

src/MaxText/layers/attention_op.py Outdated Show resolved Hide resolved

RissyRan force-pushed the v32_flash_integration branch from ffdc797 to 364e38b Compare February 9, 2026 18:31

RissyRan assigned NuojCheng Feb 9, 2026

shuningjin approved these changes Feb 9, 2026

View reviewed changes

gagika approved these changes Feb 10, 2026

View reviewed changes

src/MaxText/layers/attention_op.py Outdated Show resolved Hide resolved

RissyRan force-pushed the v32_flash_integration branch 2 times, most recently from 5114fd5 to eaeca36 Compare February 10, 2026 18:55

Integrate sparse attention with flash attention

b9457e4

RissyRan force-pushed the v32_flash_integration branch from eaeca36 to b9457e4 Compare February 10, 2026 18:56

RissyRan unassigned shuningjin, gagika and NuojCheng Feb 10, 2026

RissyRan added the pull ready label Feb 10, 2026

copybara-service bot merged commit f227fa2 into main Feb 10, 2026
28 of 30 checks passed

copybara-service bot deleted the v32_flash_integration branch February 10, 2026 20:40

	def init_torch_weights(module, std=1):
	"""
	Initialize all parameters in the module with N(0,std).
	This simple strategy is intended only for unit test.
	"""
	with torch.no_grad():
	for _, param in module.named_parameters():
	torch.nn.init.normal_(param, mean=0.0, std=std)

	def tpu_kernel_attention_helper(self, num_kv_heads):
	"""Test equivalence between dot_product and TPU accelerated"""

Comments

Conversation

RissyRan commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

📋 Review Summary

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

📋 Review Summary

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

shuningjin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RissyRan commented Feb 9, 2026

Uh oh!

shuningjin left a comment

Choose a reason for hiding this comment

Uh oh!

gagika left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RissyRan commented Feb 4, 2026 •

edited

Loading

codecov bot commented Feb 4, 2026 •

edited

Loading

shuningjin left a comment •

edited

Loading