Add blocksize=64 4-bit quantization support for ROCm CDNA (warp64) GPUs #1856

Abdennacer-Badaoui · 2026-02-11T18:31:37Z

Description:

Summary

Add kQuantizeBlockwise64 kernel that supports blocksize=64 with 4-bit quantization (FP4/NF4) on both warp32 (RDNA) and warp64 (CDNA) hardware
Previously, blocksize=64 for 4-bit was only supported on consumer RDNA GPU (warp size 32). Data center CDNA GPUs (MI300, MI325) could not use it because the existing kernel requires threads == blocksize/2 = 32, which underutilizes the 64-wide wavefront
The new kernel processes 2 quantization blocks of 64 values per thread block using 64 threads, with logical warps of 32 (WarpReduce<float, 32>) to perform independent reductions per block

Quick comparaison

Test configuration:

Device: AMD Instinct Mi325X VF
PyTorch: 2.8.0+rocm7.1.0.git7a520360
HIP: 7.1.25424-4179531dcd

FP4 Quantization Error (Mean Absolute Error)

Shape	Blocksize=128	Blocksize=64	Error Reduction
1K x 1K	0.102941	0.096551	+6.2%
2K x 2K	0.102949	0.096549	+6.2%
4K x 4K	0.102950	0.096545	+6.2%
8K x 4K	0.102948	0.096545	+6.2%
4K x 11K (LLaMA FFN)	0.102948	0.096545	+6.2%
4K x 14K (LLaMA2 FFN)	0.102946	0.096545	+6.2%

NF4 Quantization Error (Mean Absolute Error)

Shape	Blocksize=128	Blocksize=64	Error Reduction
1K x 1K	0.076826	0.072796	+5.2%
2K x 2K	0.076834	0.072794	+5.3%
4K x 4K	0.076836	0.072794	+5.3%
8K x 4K	0.076836	0.072796	+5.3%
4K x 11K (LLaMA FFN)	0.076835	0.072796	+5.3%
4K x 14K (LLaMA2 FFN)	0.076835	0.072796	+5.3%

github-actions · 2026-02-12T20:05:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kQuantizeBlockwise64 addition for rocm

e8cd44c

matthewdouglas added the ROCm label Feb 12, 2026

Merge branch 'main' into rocm-64-blocksize-support

f19fc0b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add blocksize=64 4-bit quantization support for ROCm CDNA (warp64) GPUs #1856

Add blocksize=64 4-bit quantization support for ROCm CDNA (warp64) GPUs #1856

Abdennacer-Badaoui commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Add blocksize=64 4-bit quantization support for ROCm CDNA (warp64) GPUs #1856

Are you sure you want to change the base?

Add blocksize=64 4-bit quantization support for ROCm CDNA (warp64) GPUs #1856

Conversation

Abdennacer-Badaoui commented Feb 11, 2026

Summary

Quick comparaison

Test configuration:

FP4 Quantization Error (Mean Absolute Error)

NF4 Quantization Error (Mean Absolute Error)

Uh oh!

github-actions bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants