-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add support for tensor core matrix multiply-accumulate (MMA) operations on modern NVIDIA GPUs.
Motivation
Tensor cores provide massive throughput for matrix operations (e.g., 16x16x16 matrix multiply in a single instruction). This is essential for competitive matmul performance on modern GPUs (Volta and later).
Design considerations
- WMMA (Warp Matrix Multiply Accumulate) API operates on matrix fragments
- Fragment types:
a(M×K),b(K×N),c/d(M×N) - Supported shapes: 16x16x16, 32x8x16, 8x32x16 (varies by GPU architecture)
- Need words for: load fragment, store fragment, MMA compute
Implementation notes
- Maps to
nvvm.wmmaornvvm.mmaintrinsics - Requires floating-point support (depends on f16/f32 types)
- Fragment storage is distributed across warp lanes
- This is a significant feature that may need its own design document
Priority
Nice to have — needed for peak matmul performance but not for correctness.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request