Skip to content

Enable int4 quantization for parakeet#17061

Merged
larryliu0820 merged 1 commit intomainfrom
parakeet_int4
Feb 3, 2026
Merged

Enable int4 quantization for parakeet#17061
larryliu0820 merged 1 commit intomainfrom
parakeet_int4

Conversation

@larryliu0820
Copy link
Contributor

@larryliu0820 larryliu0820 commented Jan 30, 2026

Summary

  • Add int4/int8 quantization support for Parakeet TDT model export using torchao
  • Extract quantization utilities to a separate module for reusability

Changes

Quantization Support for Parakeet

Added support for quantizing encoder and decoder components with multiple configurations:

  • Linear layers: 4w, 8w, 8da4w, 8da8w quantization configs
  • Embedding layers: 4w, 8w quantization configs
  • Packing formats: tile_packed_to_4d for optimized inference on CUDA

New CLI Arguments

Argument Description
--qlinear_encoder Quantization config for encoder linear layers
--qlinear_encoder_group_size Group size for encoder quantization (default: 32)
--qlinear_encoder_packing_format Packing format for encoder
--qlinear Quantization config for decoder linear layers
--qlinear_group_size Group size for decoder quantization (default: 32)
--qlinear_packing_format Packing format for decoder
--qembedding Quantization config for embedding layer
--qembedding_group_size Group size for embedding quantization

Code Organization

  • Extracted quantize_model_() function to examples/models/parakeet/quantize.py (mirrors optimum-executorch naming)
  • Model moved to CUDA after preprocessor export when --backend cuda is specified
  • Example inputs created on correct device to match model parameters

Example Usage

Int4 Linear with Tile Packing

python examples/models/parakeet/export_parakeet_tdt.py \
    --backend cuda \
    --dtype bf16 \
    --qlinear_encoder 4w \
    --qlinear_encoder_packing_format tile_packed_to_4d \
    --qlinear 4w \
    --qlinear_packing_format tile_packed_to_4d \
    --output-dir ./parakeet_int4

Int4 Linear + Int8 Embedding

python examples/models/parakeet/export_parakeet_tdt.py \
    --backend cuda \
    --dtype bf16 \
    --qlinear_encoder 4w \
    --qlinear_encoder_packing_format tile_packed_to_4d \
    --qlinear 4w \
    --qlinear_packing_format tile_packed_to_4d \
    --qembedding 8w \
    --output-dir ./parakeet_int4_emb8

Test Plan

  • Export with CUDA backend and int4 quantization completes successfully
  • Model runs through encoder with storage_offset tensors
  • Verify full transcription accuracy matches eager mode
  • Verify model size reduction with quantization

Model Sizes

bfloat16 (baseline)

-rw-r--r--. 1 dev dev 1274592908 Jan 23 07:01 aoti_cuda_blob.ptd
-rw-r--r--. 1 dev dev    4497576 Jan 23 07:01 model.pte

Encoder and decoder int4 group (32) wise quantized

-rw-r--r--. 1 dev dev 542162572 Jan 30 07:03 aoti_cuda_blob.ptd
-rw-r--r--. 1 dev dev   4125496 Jan 30 07:03 model.pte

Int4 linear + Int8 embedding

-rw-r--r--.  1 dev dev 536943756 Feb  2 18:21 aoti_cuda_blob.ptd
-rw-r--r--.  1 dev dev   4134072 Feb  2 18:21 model.pte

@larryliu0820 larryliu0820 requested a review from lucylq as a code owner January 30, 2026 07:32
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17061

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 6 Pending, 1 Unrelated Failure

As of commit 89808f9 with merge base 9d257c8 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 30, 2026
@larryliu0820 larryliu0820 added the release notes: desktop for desktop/laptop workstream label Jan 30, 2026
@larryliu0820 larryliu0820 temporarily deployed to upload-benchmark-results January 30, 2026 08:47 — with GitHub Actions Inactive
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@larryliu0820 larryliu0820 temporarily deployed to upload-benchmark-results January 30, 2026 18:47 — with GitHub Actions Inactive
@larryliu0820 larryliu0820 force-pushed the parakeet_int4 branch 2 times, most recently from 4944410 to b52a1b6 Compare January 30, 2026 19:11
@mergennachin
Copy link
Contributor

Also please report size reductions

  • encoder quantized (4w, groupwise)
  • encoder and decoder quantized (4w, groupwise)
  • encoder, decoder (4w, groupwise) and embedding quantized (8w, per-channel)

@larryliu0820 larryliu0820 temporarily deployed to upload-benchmark-results January 30, 2026 20:45 — with GitHub Actions Inactive
@larryliu0820 larryliu0820 temporarily deployed to upload-benchmark-results February 2, 2026 08:15 — with GitHub Actions Inactive
@larryliu0820 larryliu0820 force-pushed the parakeet_int4 branch 3 times, most recently from 3c9465a to 3df475a Compare February 2, 2026 22:09
@larryliu0820 larryliu0820 had a problem deploying to upload-benchmark-results February 2, 2026 22:25 — with GitHub Actions Error
@larryliu0820 larryliu0820 force-pushed the parakeet_int4 branch 2 times, most recently from 15e2493 to 78a070d Compare February 2, 2026 23:15
- Add int4/int8 quantization support for Parakeet TDT model export using torchao
- Add storage_offset support in CUDA AOTI shims to enable quantized weight tensor views
- Extract quantization utilities to a separate module for reusability

Added support for quantizing encoder and decoder components with multiple configurations:
- **Linear layers**: `4w`, `8w`, `8da4w`, `8da8w` quantization configs
- **Embedding layers**: `4w`, `8w` quantization configs
- **Packing formats**: `tile_packed_to_4d` for optimized inference on CUDA

| Argument | Description |
|----------|-------------|
| `--qlinear_encoder` | Quantization config for encoder linear layers |
| `--qlinear_encoder_group_size` | Group size for encoder quantization (default: 32) |
| `--qlinear_encoder_packing_format` | Packing format for encoder |
| `--qlinear` | Quantization config for decoder linear layers |
| `--qlinear_group_size` | Group size for decoder quantization (default: 32) |
| `--qlinear_packing_format` | Packing format for decoder |
| `--qembedding` | Quantization config for embedding layer |
| `--qembedding_group_size` | Group size for embedding quantization |

Modified `aoti_torch__reinterpret_tensor` in `backends/cuda/runtime/shims/memory.cpp` to support non-zero storage offsets, which is required for int4 quantized weight tensors:

- **Removed** the `validate_storage_offset` check that rejected non-zero offsets
- **Added** logic to compute the adjusted data pointer: `base_ptr + storage_offset * element_size`
- **Updated** memory tracking to use `base_data_ptr` for reference counting
- **Added** tracking for offset `data_ptr` as `NOT_OWN` to enable proper tensor deletion

This enables the CUDA backend to handle tensor views created by torchao's int4 quantization, which uses `_convert_weight_to_int4pack` and `_weight_int4pack_mm` operations that produce tensors with non-zero storage offsets.

- Extracted `quantize()` function to `examples/models/parakeet/quantize.py`
- Model moved to CUDA after preprocessor export when `--backend cuda` is specified
- Example inputs created on correct device to match model parameters

python examples/models/parakeet/export_parakeet_tdt.py \
    --backend cuda \
    --dtype bf16 \
    --qlinear_encoder 4w \
    --qlinear_encoder_packing_format tile_packed_to_4d \
    --qlinear 4w \
    --qlinear_packing_format tile_packed_to_4d \
    --output-dir ./parakeet_int4

Test Plan
[x] Export with CUDA backend and int4 quantization completes successfully
[x] Model runs through encoder with storage_offset tensors
[x] Verify full transcription accuracy matches eager mode
[x] Verify model size reduction with quantization

Co-authored-by: Cursor <cursoragent@cursor.com>
@larryliu0820 larryliu0820 temporarily deployed to upload-benchmark-results February 3, 2026 00:21 — with GitHub Actions Inactive
@larryliu0820 larryliu0820 merged commit c655dc6 into main Feb 3, 2026
331 of 337 checks passed
@larryliu0820 larryliu0820 deleted the parakeet_int4 branch February 3, 2026 01:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: desktop for desktop/laptop workstream

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants