Skip to content

Warp-level primitives: shuffle and reductions #10

@tetsuo-cpp

Description

@tetsuo-cpp

Summary

Add warp-level intrinsics for shuffle operations and warp reductions.

Words to implement

Word Stack effect Description
SHFL-DOWN ( val offset -- result ) Warp shuffle down
SHFL-UP ( val offset -- result ) Warp shuffle up
SHFL-XOR ( val mask -- result ) Warp shuffle XOR (butterfly)
SHFL-IDX ( val idx -- result ) Warp shuffle to specific lane

Motivation

  • Needed for high-performance reductions (e.g., sum across a warp without shared memory)
  • Used in split-K matmul variants
  • Warp-level operations avoid shared memory round-trips

Implementation notes

  • Maps to nvvm.shfl.sync intrinsics in NVVM
  • Full warp mask (0xFFFFFFFF) can be the default
  • May also want WARP-SIZE (constant 32) and LANE-ID words

Priority

Nice to have — needed for advanced GPU optimization patterns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions