Enable int4 quantization for parakeet#17061
Merged
larryliu0820 merged 1 commit intomainfrom Feb 3, 2026
Merged
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17061
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 6 Pending, 1 Unrelated FailureAs of commit 89808f9 with merge base 9d257c8 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
e3ca374 to
c9a7aee
Compare
4944410 to
b52a1b6
Compare
mergennachin
requested changes
Jan 30, 2026
Contributor
|
Also please report size reductions
|
b52a1b6 to
b5351cd
Compare
mergennachin
approved these changes
Feb 2, 2026
3c9465a to
3df475a
Compare
15e2493 to
78a070d
Compare
78a070d to
bd33b02
Compare
bd33b02 to
15e2493
Compare
- Add int4/int8 quantization support for Parakeet TDT model export using torchao
- Add storage_offset support in CUDA AOTI shims to enable quantized weight tensor views
- Extract quantization utilities to a separate module for reusability
Added support for quantizing encoder and decoder components with multiple configurations:
- **Linear layers**: `4w`, `8w`, `8da4w`, `8da8w` quantization configs
- **Embedding layers**: `4w`, `8w` quantization configs
- **Packing formats**: `tile_packed_to_4d` for optimized inference on CUDA
| Argument | Description |
|----------|-------------|
| `--qlinear_encoder` | Quantization config for encoder linear layers |
| `--qlinear_encoder_group_size` | Group size for encoder quantization (default: 32) |
| `--qlinear_encoder_packing_format` | Packing format for encoder |
| `--qlinear` | Quantization config for decoder linear layers |
| `--qlinear_group_size` | Group size for decoder quantization (default: 32) |
| `--qlinear_packing_format` | Packing format for decoder |
| `--qembedding` | Quantization config for embedding layer |
| `--qembedding_group_size` | Group size for embedding quantization |
Modified `aoti_torch__reinterpret_tensor` in `backends/cuda/runtime/shims/memory.cpp` to support non-zero storage offsets, which is required for int4 quantized weight tensors:
- **Removed** the `validate_storage_offset` check that rejected non-zero offsets
- **Added** logic to compute the adjusted data pointer: `base_ptr + storage_offset * element_size`
- **Updated** memory tracking to use `base_data_ptr` for reference counting
- **Added** tracking for offset `data_ptr` as `NOT_OWN` to enable proper tensor deletion
This enables the CUDA backend to handle tensor views created by torchao's int4 quantization, which uses `_convert_weight_to_int4pack` and `_weight_int4pack_mm` operations that produce tensors with non-zero storage offsets.
- Extracted `quantize()` function to `examples/models/parakeet/quantize.py`
- Model moved to CUDA after preprocessor export when `--backend cuda` is specified
- Example inputs created on correct device to match model parameters
python examples/models/parakeet/export_parakeet_tdt.py \
--backend cuda \
--dtype bf16 \
--qlinear_encoder 4w \
--qlinear_encoder_packing_format tile_packed_to_4d \
--qlinear 4w \
--qlinear_packing_format tile_packed_to_4d \
--output-dir ./parakeet_int4
Test Plan
[x] Export with CUDA backend and int4 quantization completes successfully
[x] Model runs through encoder with storage_offset tensors
[x] Verify full transcription accuracy matches eager mode
[x] Verify model size reduction with quantization
Co-authored-by: Cursor <cursoragent@cursor.com>
15e2493 to
89808f9
Compare
mergennachin
approved these changes
Feb 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Changes
Quantization Support for Parakeet
Added support for quantizing encoder and decoder components with multiple configurations:
4w,8w,8da4w,8da8wquantization configs4w,8wquantization configstile_packed_to_4dfor optimized inference on CUDANew CLI Arguments
--qlinear_encoder--qlinear_encoder_group_size--qlinear_encoder_packing_format--qlinear--qlinear_group_size--qlinear_packing_format--qembedding--qembedding_group_sizeCode Organization
quantize_model_()function toexamples/models/parakeet/quantize.py(mirrors optimum-executorch naming)--backend cudais specifiedExample Usage
Int4 Linear with Tile Packing
python examples/models/parakeet/export_parakeet_tdt.py \ --backend cuda \ --dtype bf16 \ --qlinear_encoder 4w \ --qlinear_encoder_packing_format tile_packed_to_4d \ --qlinear 4w \ --qlinear_packing_format tile_packed_to_4d \ --output-dir ./parakeet_int4Int4 Linear + Int8 Embedding
python examples/models/parakeet/export_parakeet_tdt.py \ --backend cuda \ --dtype bf16 \ --qlinear_encoder 4w \ --qlinear_encoder_packing_format tile_packed_to_4d \ --qlinear 4w \ --qlinear_packing_format tile_packed_to_4d \ --qembedding 8w \ --output-dir ./parakeet_int4_emb8Test Plan
Model Sizes
bfloat16 (baseline)
Encoder and decoder int4 group (32) wise quantized
Int4 linear + Int8 embedding