Skip to content

Tensor core / MMA intrinsics #11

@tetsuo-cpp

Description

@tetsuo-cpp

Summary

Add support for tensor core matrix multiply-accumulate (MMA) operations on modern NVIDIA GPUs.

Motivation

Tensor cores provide massive throughput for matrix operations (e.g., 16x16x16 matrix multiply in a single instruction). This is essential for competitive matmul performance on modern GPUs (Volta and later).

Design considerations

  • WMMA (Warp Matrix Multiply Accumulate) API operates on matrix fragments
  • Fragment types: a (M×K), b (K×N), c/d (M×N)
  • Supported shapes: 16x16x16, 32x8x16, 8x32x16 (varies by GPU architecture)
  • Need words for: load fragment, store fragment, MMA compute

Implementation notes

  • Maps to nvvm.wmma or nvvm.mma intrinsics
  • Requires floating-point support (depends on f16/f32 types)
  • Fragment storage is distributed across warp lanes
  • This is a significant feature that may need its own design document

Priority

Nice to have — needed for peak matmul performance but not for correctness.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions