Skip to content

Remove non-blockwise 8-bit optimizer and legacy quantization CUDA/HIP kernels#1880

Merged
matthewdouglas merged 1 commit intomainfrom
remove-unused-kernels
Feb 23, 2026
Merged

Remove non-blockwise 8-bit optimizer and legacy quantization CUDA/HIP kernels#1880
matthewdouglas merged 1 commit intomainfrom
remove-unused-kernels

Conversation

@matthewdouglas
Copy link
Member

Summary

Follows up on #1871 (which removed the Python-side APIs) by deleting the corresponding C++, CUDA, and HIP implementation code. No Python callers remain for any of this.

What was removed

CUDA/HIP kernels (csrc/kernels.cu, csrc/kernels.hip):

  • kPreconditionOptimizerStatic8bit1State
  • kOptimizerStatic8bit1State
  • kPreconditionOptimizerStatic8bit2State
  • kOptimizerStatic8bit2State
  • kPercentileClipping<float/half, 2048, 4>
  • kQuantize / kDequantize (global/non-blockwise variants)
  • All associated #define macros and explicit template instantiation blocks

Ops-level launchers (csrc/ops.cu, csrc/ops.hip):

  • optimizerStatic8bit<T, OPTIMIZER>
  • percentileClipping<T>()
  • quantize() / dequantize()

Headers (csrc/kernels.cuh, csrc/kernels_hip.cuh, csrc/ops.cuh, csrc/ops_hip.cuh):

  • All declarations for the above

Python interface (csrc/pythonInterface.cpp):

  • MAKE_FUNC8 + 8 instantiations (adam/momentum/rmsprop/lion × fp32/fp16)
  • MAKE_CFUNC8 + 8 C-exported wrappers (cadam_static_8bit_grad_32/16, etc.)
  • cquantize / cdequantize
  • cpercentile_clipping_g32 / cpercentile_clipping_g16

The blockwise variants (optimizerStatic8bitBlockwise, kOptimizerStatic8bit1StateBlockwise, kOptimizerStatic8bit2StateBlockwise) are untouched.

Agent guide updates

Removed stale references to percentile_clipping, block_wise parameter, optimizer_update_8bit (non-blockwise), and the non-blockwise dispatch branch from agents/api_surface.md, agents/architecture_guide.md, and agents/security_guide.md.

@matthewdouglas matthewdouglas added this to the v0.50.0 milestone Feb 23, 2026
@matthewdouglas matthewdouglas added ROCm CUDA Issues and PRs related to the CUDA backend, excluding installation/support help. labels Feb 23, 2026
@github-actions
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member Author

@matthewdouglas matthewdouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: #1880 — Remove non-blockwise 8-bit optimizer and legacy quantization CUDA/HIP kernels

Follow-up to #1871: mechanically removes the C++/CUDA/HIP implementations of functions whose Python-side APIs were already deleted. The scope is exactly right — only code with confirmed-zero Python callers is deleted, and the blockwise variants are untouched throughout.

No blocking issues.

What was checked:

  • Python caller verification: grep across the entire Python tree confirms zero remaining references to all removed symbols (optimizer_update_8bit, percentile_clipping, cquantize, cdequantize, cadam_static_8bit_*, etc.). The deletions are safe.

  • Blockwise path untouched: optimizerStatic8bitBlockwise, kOptimizerStatic8bit1StateBlockwise, kOptimizerStatic8bit2StateBlockwise, and all their header declarations/instantiations remain in place. Verified in both kernels.cu and kernels.hip.

  • 32-bit variant not accidentally removed: The remaining kPreconditionOptimizer32bit1State / kPreconditionOptimizer32bit2State references in the diff are the 32-bit ops-path kernels, which are distinct from the deleted 8-bit static ones. No over-deletion.

  • CUDA/HIP symmetry: Both backends receive identical removals (quantize/dequantize launchers, optimizerStatic8bit, percentileClipping, template instantiations, header declarations). ✓

  • The +1 in csrc/ops.cu: Fixes indentation on MAKE_optimizerStatic8bitBlockwise(half, ADAM); which was over-indented inside the now-removed MAKE_optimizerStatic8bit block. Correct.

  • Agent docs: api_surface.md (parameter table and deprecated-symbols table), architecture_guide.md (ops list and dispatch pseudocode), and security_guide.md (trusted hot paths) are all consistently updated. The architecture guide's dispatch pseudocode now accurately shows only the uint8 → blockwise branch.

  • Security: Clear (internal author, no new dependencies)

  • Downstream impact: None (Python API already removed in #1871; no new breakage)

  • Tests: Adequate (no Python callers → no test changes needed; full suite passes)

  • CI: All pass (CUDA 11.8–13.x, ROCm 6.2–7.2, CPU x64/aarch64/macOS/Windows, XPU, Lint)

  • Serialization: Not affected

  • torch.compile: Not affected

@matthewdouglas matthewdouglas merged commit d5cb0f2 into main Feb 23, 2026
227 of 229 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CUDA Issues and PRs related to the CUDA backend, excluding installation/support help. ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant