Add CUDA kernel support for 4-bit quantization with blocksize=32 by Abdennacer-Badaoui · Pull Request #1854 · bitsandbytes-foundation/bitsandbytes

Abdennacer-Badaoui · 2026-02-04T14:42:39Z

Description

Implements specialized CUDA kernel to support blocksize=32 for 4-bit quantization (FP4/NF4), addressing feature request in #986.
Smaller block sizes provide better quantization accuracy by computing separate scaling factors for smaller groups of values, reducing quantization error at the cost of slightly increased metadata overhead.

Key Changes

New quantization kernel (kQuantizeBlockwise32):

Optimized for blocksize=32, processes 2 blocks per warp (32 threads)
Threads 0-15 handle block 0, threads 16-31 handle block 1
Each block computes independent scale factor for finer granularity

Dequantization: Reuses existing generic kernel with proper dual-scale lookup

Testing: Extended test suites in test_functional.py, test_linear4bit.py and tests/test_ops.py

Quick comparaison

Test configuration: torch.float16, CUDA, averaged over 1000 runs per shape

FP4 Quantization Error Comparison

Shape	Blocksize=64	Blocksize=32	Improvement
1K×1K	0.096540	0.088918	+7.9%
2K×2K	0.096548	0.088919	+7.9%
4K×4K	0.096545	0.088919	+7.9%
8K×4K	0.096545	0.088919	+7.9%
1K×768 (LLaMA-like)	0.096547	0.088918	+7.9%
4K×11K (LLaMA FFN)	0.096546	0.088920	+7.9%

NF4 Quantization Error Comparison

Shape	Blocksize=64	Blocksize=32	Improvement
1K×1K	0.072798	0.067750	+6.9%
2K×2K	0.072795	0.067748	+6.9%
4K×4K	0.072795	0.067747	+6.9%
8K×4K	0.072795	0.067748	+6.9%
1K×768 (LLaMA-like)	0.072793	0.067749	+6.9%
4K×11K (LLaMA FFN)	0.072795	0.067748	+6.9%

Abdennacer-Badaoui · 2026-02-04T14:43:38Z

@matthewdouglas for review :)

github-actions · 2026-02-12T18:37:08Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

matthewdouglas · 2026-02-12T19:57:21Z

Looks good! Thanks for this!

Abdennacer-Badaoui added 3 commits February 4, 2026 11:30

kQuantizeBlockwise32 kernel addition

3658b51

fix

21b9e0a

adding tests

0c7256d

Abdennacer-Badaoui added 4 commits February 5, 2026 16:19

fix

e11737b

docstring update

4fd8d6e

independent logical warp

2b27016

update values

ad44138

matthewdouglas added the CUDA Issues and PRs related to the CUDA backend, excluding installation/support help. label Feb 12, 2026

matthewdouglas merged commit 7b6c76f into bitsandbytes-foundation:main Feb 12, 2026
85 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CUDA kernel support for 4-bit quantization with blocksize=32#1854

Add CUDA kernel support for 4-bit quantization with blocksize=32#1854
matthewdouglas merged 7 commits intobitsandbytes-foundation:mainfrom
Abdennacer-Badaoui:32-blocksize-support

Abdennacer-Badaoui commented Feb 4, 2026 •

edited

Loading

Uh oh!

Abdennacer-Badaoui commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 12, 2026

Uh oh!

matthewdouglas commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Abdennacer-Badaoui commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key Changes

Quick comparaison

FP4 Quantization Error Comparison

NF4 Quantization Error Comparison

Uh oh!

Abdennacer-Badaoui commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 12, 2026

Uh oh!

matthewdouglas commented Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Abdennacer-Badaoui commented Feb 4, 2026 •

edited

Loading