Skip to content

Add CUDA kernel support for 4-bit quantization with blocksize=32#1854

Merged
matthewdouglas merged 7 commits intobitsandbytes-foundation:mainfrom
Abdennacer-Badaoui:32-blocksize-support
Feb 12, 2026
Merged

Add CUDA kernel support for 4-bit quantization with blocksize=32#1854
matthewdouglas merged 7 commits intobitsandbytes-foundation:mainfrom
Abdennacer-Badaoui:32-blocksize-support

Conversation

@Abdennacer-Badaoui
Copy link
Contributor

@Abdennacer-Badaoui Abdennacer-Badaoui commented Feb 4, 2026

Description

Implements specialized CUDA kernel to support blocksize=32 for 4-bit quantization (FP4/NF4), addressing feature request in #986.
Smaller block sizes provide better quantization accuracy by computing separate scaling factors for smaller groups of values, reducing quantization error at the cost of slightly increased metadata overhead.

Key Changes

New quantization kernel (kQuantizeBlockwise32):

  • Optimized for blocksize=32, processes 2 blocks per warp (32 threads)
  • Threads 0-15 handle block 0, threads 16-31 handle block 1
  • Each block computes independent scale factor for finer granularity

Dequantization: Reuses existing generic kernel with proper dual-scale lookup

Testing: Extended test suites in test_functional.py, test_linear4bit.py and tests/test_ops.py

Quick comparaison

Test configuration: torch.float16, CUDA, averaged over 1000 runs per shape

FP4 Quantization Error Comparison

Shape Blocksize=64 Blocksize=32 Improvement
1K×1K 0.096540 0.088918 +7.9%
2K×2K 0.096548 0.088919 +7.9%
4K×4K 0.096545 0.088919 +7.9%
8K×4K 0.096545 0.088919 +7.9%
1K×768 (LLaMA-like) 0.096547 0.088918 +7.9%
4K×11K (LLaMA FFN) 0.096546 0.088920 +7.9%

NF4 Quantization Error Comparison

Shape Blocksize=64 Blocksize=32 Improvement
1K×1K 0.072798 0.067750 +6.9%
2K×2K 0.072795 0.067748 +6.9%
4K×4K 0.072795 0.067747 +6.9%
8K×4K 0.072795 0.067748 +6.9%
1K×768 (LLaMA-like) 0.072793 0.067749 +6.9%
4K×11K (LLaMA FFN) 0.072795 0.067748 +6.9%

@Abdennacer-Badaoui
Copy link
Contributor Author

@matthewdouglas for review :)

@matthewdouglas matthewdouglas added the CUDA Issues and PRs related to the CUDA backend, excluding installation/support help. label Feb 12, 2026
@github-actions
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@matthewdouglas
Copy link
Member

Looks good! Thanks for this!

@matthewdouglas matthewdouglas merged commit 7b6c76f into bitsandbytes-foundation:main Feb 12, 2026
85 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CUDA Issues and PRs related to the CUDA backend, excluding installation/support help.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants