Enable int4 quantization for parakeet by larryliu0820 · Pull Request #17061 · pytorch/executorch

larryliu0820 · 2026-01-30T07:32:32Z

Summary

Add int4/int8 quantization support for Parakeet TDT model export using torchao
Extract quantization utilities to a separate module for reusability

Changes

Quantization Support for Parakeet

Added support for quantizing encoder and decoder components with multiple configurations:

Linear layers: 4w, 8w, 8da4w, 8da8w quantization configs
Embedding layers: 4w, 8w quantization configs
Packing formats: tile_packed_to_4d for optimized inference on CUDA

New CLI Arguments

Argument	Description
`--qlinear_encoder`	Quantization config for encoder linear layers
`--qlinear_encoder_group_size`	Group size for encoder quantization (default: 32)
`--qlinear_encoder_packing_format`	Packing format for encoder
`--qlinear`	Quantization config for decoder linear layers
`--qlinear_group_size`	Group size for decoder quantization (default: 32)
`--qlinear_packing_format`	Packing format for decoder
`--qembedding`	Quantization config for embedding layer
`--qembedding_group_size`	Group size for embedding quantization

Code Organization

Extracted quantize_model_() function to examples/models/parakeet/quantize.py (mirrors optimum-executorch naming)
Model moved to CUDA after preprocessor export when --backend cuda is specified
Example inputs created on correct device to match model parameters

Example Usage

Int4 Linear with Tile Packing

python examples/models/parakeet/export_parakeet_tdt.py \
    --backend cuda \
    --dtype bf16 \
    --qlinear_encoder 4w \
    --qlinear_encoder_packing_format tile_packed_to_4d \
    --qlinear 4w \
    --qlinear_packing_format tile_packed_to_4d \
    --output-dir ./parakeet_int4

Int4 Linear + Int8 Embedding

python examples/models/parakeet/export_parakeet_tdt.py \
    --backend cuda \
    --dtype bf16 \
    --qlinear_encoder 4w \
    --qlinear_encoder_packing_format tile_packed_to_4d \
    --qlinear 4w \
    --qlinear_packing_format tile_packed_to_4d \
    --qembedding 8w \
    --output-dir ./parakeet_int4_emb8

Test Plan

Export with CUDA backend and int4 quantization completes successfully
Model runs through encoder with storage_offset tensors
Verify full transcription accuracy matches eager mode
Verify model size reduction with quantization

Model Sizes

bfloat16 (baseline)

-rw-r--r--. 1 dev dev 1274592908 Jan 23 07:01 aoti_cuda_blob.ptd
-rw-r--r--. 1 dev dev    4497576 Jan 23 07:01 model.pte

Encoder and decoder int4 group (32) wise quantized

-rw-r--r--. 1 dev dev 542162572 Jan 30 07:03 aoti_cuda_blob.ptd
-rw-r--r--. 1 dev dev   4125496 Jan 30 07:03 model.pte

Int4 linear + Int8 embedding

-rw-r--r--.  1 dev dev 536943756 Feb  2 18:21 aoti_cuda_blob.ptd
-rw-r--r--.  1 dev dev   4134072 Feb  2 18:21 model.pte

pytorch-bot · 2026-01-30T07:32:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17061

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures, 6 Pending, 1 Unrelated Failure

As of commit 89808f9 with merge base 9d257c8 ():

NEW FAILURES - The following jobs have failed:

pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t 91bb5996646dda3b886a3f7ffed6aa5448b270add1ec57f5da3750bb5382bc95 /exec failed with exit code 3
trunk / test-arm-backend-ethos-u (test_model_smollm2-135M) / linux-job (gh)
RuntimeError: Command docker exec -t 39bb5c89ad2a305ddb6fe4a8df89a7d6318c83182c9bda9ccb005dbfc3bbe275 /exec failed with exit code 2
trunk / test-llama-runner-linux (bf16, portable, linux.arm64.2xlarge, executorch-ubuntu-22.04-gcc11-aarch64) / linux-job (gh)
RuntimeError: Command docker exec -t 30716aa502558e5fd4d52ef94f3f597915b9c871eb3398cac96e76e511c94f68 /exec failed with exit code 35
trunk / test-qnn-model (fp32, mv2) / linux-job (gh)
RuntimeError: Command docker exec -t 07658f01ecc107ceea4f5907ba72b9032f31da03bb85ed73998479d841383b64 /exec failed with exit code 1

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest-buck / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 3

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

examples/models/parakeet/quantize.py

.github/workflows/cuda.yml

examples/models/parakeet/README.md

mergennachin · 2026-01-30T19:59:01Z

Also please report size reductions

encoder quantized (4w, groupwise)
encoder and decoder quantized (4w, groupwise)
encoder, decoder (4w, groupwise) and embedding quantized (8w, per-channel)

examples/models/parakeet/README.md

- Add int4/int8 quantization support for Parakeet TDT model export using torchao - Add storage_offset support in CUDA AOTI shims to enable quantized weight tensor views - Extract quantization utilities to a separate module for reusability Added support for quantizing encoder and decoder components with multiple configurations: - **Linear layers**: `4w`, `8w`, `8da4w`, `8da8w` quantization configs - **Embedding layers**: `4w`, `8w` quantization configs - **Packing formats**: `tile_packed_to_4d` for optimized inference on CUDA | Argument | Description | |----------|-------------| | `--qlinear_encoder` | Quantization config for encoder linear layers | | `--qlinear_encoder_group_size` | Group size for encoder quantization (default: 32) | | `--qlinear_encoder_packing_format` | Packing format for encoder | | `--qlinear` | Quantization config for decoder linear layers | | `--qlinear_group_size` | Group size for decoder quantization (default: 32) | | `--qlinear_packing_format` | Packing format for decoder | | `--qembedding` | Quantization config for embedding layer | | `--qembedding_group_size` | Group size for embedding quantization | Modified `aoti_torch__reinterpret_tensor` in `backends/cuda/runtime/shims/memory.cpp` to support non-zero storage offsets, which is required for int4 quantized weight tensors: - **Removed** the `validate_storage_offset` check that rejected non-zero offsets - **Added** logic to compute the adjusted data pointer: `base_ptr + storage_offset * element_size` - **Updated** memory tracking to use `base_data_ptr` for reference counting - **Added** tracking for offset `data_ptr` as `NOT_OWN` to enable proper tensor deletion This enables the CUDA backend to handle tensor views created by torchao's int4 quantization, which uses `_convert_weight_to_int4pack` and `_weight_int4pack_mm` operations that produce tensors with non-zero storage offsets. - Extracted `quantize()` function to `examples/models/parakeet/quantize.py` - Model moved to CUDA after preprocessor export when `--backend cuda` is specified - Example inputs created on correct device to match model parameters python examples/models/parakeet/export_parakeet_tdt.py \ --backend cuda \ --dtype bf16 \ --qlinear_encoder 4w \ --qlinear_encoder_packing_format tile_packed_to_4d \ --qlinear 4w \ --qlinear_packing_format tile_packed_to_4d \ --output-dir ./parakeet_int4 Test Plan [x] Export with CUDA backend and int4 quantization completes successfully [x] Model runs through encoder with storage_offset tensors [x] Verify full transcription accuracy matches eager mode [x] Verify model size reduction with quantization Co-authored-by: Cursor <cursoragent@cursor.com>

larryliu0820 requested a review from lucylq as a code owner January 30, 2026 07:32

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 30, 2026

larryliu0820 added the release notes: desktop for desktop/laptop workstream label Jan 30, 2026

larryliu0820 temporarily deployed to upload-benchmark-results January 30, 2026 08:47 — with GitHub Actions Inactive

larryliu0820 force-pushed the parakeet_int4 branch from e3ca374 to c9a7aee Compare January 30, 2026 17:26

mergennachin requested review from Copilot, manuelcandales and mergennachin January 30, 2026 17:29

Copilot started reviewing on behalf of mergennachin January 30, 2026 17:32 View session

Copilot AI reviewed Jan 30, 2026

View reviewed changes

larryliu0820 temporarily deployed to upload-benchmark-results January 30, 2026 18:47 — with GitHub Actions Inactive

larryliu0820 force-pushed the parakeet_int4 branch 2 times, most recently from 4944410 to b52a1b6 Compare January 30, 2026 19:11

mergennachin requested changes Jan 30, 2026

View reviewed changes

larryliu0820 temporarily deployed to upload-benchmark-results January 30, 2026 20:45 — with GitHub Actions Inactive

larryliu0820 force-pushed the parakeet_int4 branch from b52a1b6 to b5351cd Compare February 2, 2026 07:21

larryliu0820 temporarily deployed to upload-benchmark-results February 2, 2026 08:15 — with GitHub Actions Inactive

mergennachin approved these changes Feb 2, 2026

View reviewed changes

examples/models/parakeet/README.md Outdated Show resolved Hide resolved

examples/models/parakeet/README.md Outdated Show resolved Hide resolved

larryliu0820 force-pushed the parakeet_int4 branch 3 times, most recently from 3c9465a to 3df475a Compare February 2, 2026 22:09

larryliu0820 had a problem deploying to upload-benchmark-results February 2, 2026 22:25 — with GitHub Actions Error

larryliu0820 force-pushed the parakeet_int4 branch 2 times, most recently from 15e2493 to 78a070d Compare February 2, 2026 23:15

larryliu0820 requested a review from GregoryComer as a code owner February 2, 2026 23:15

larryliu0820 force-pushed the parakeet_int4 branch from 78a070d to bd33b02 Compare February 2, 2026 23:15

larryliu0820 requested a review from kirklandsign as a code owner February 2, 2026 23:15

larryliu0820 force-pushed the parakeet_int4 branch from bd33b02 to 15e2493 Compare February 2, 2026 23:16

larryliu0820 force-pushed the parakeet_int4 branch from 15e2493 to 89808f9 Compare February 2, 2026 23:17

mergennachin approved these changes Feb 2, 2026

View reviewed changes

larryliu0820 temporarily deployed to upload-benchmark-results February 3, 2026 00:21 — with GitHub Actions Inactive

larryliu0820 merged commit c655dc6 into main Feb 3, 2026
331 of 337 checks passed

larryliu0820 deleted the parakeet_int4 branch February 3, 2026 01:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable int4 quantization for parakeet#17061

Enable int4 quantization for parakeet#17061
larryliu0820 merged 1 commit intomainfrom
parakeet_int4

larryliu0820 commented Jan 30, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

larryliu0820 commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Quantization Support for Parakeet

New CLI Arguments

Code Organization

Example Usage

Int4 Linear with Tile Packing

Int4 Linear + Int8 Embedding

Test Plan

Model Sizes

bfloat16 (baseline)

Encoder and decoder int4 group (32) wise quantized

Int4 linear + Int8 embedding

Uh oh!

pytorch-bot bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17061

❌ 4 New Failures, 6 Pending, 1 Unrelated Failure

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergennachin commented Jan 30, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

larryliu0820 commented Jan 30, 2026 •

edited

Loading

pytorch-bot bot commented Jan 30, 2026 •

edited

Loading