-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add support for GPU shared memory (__shared__ in CUDA) to enable tiled algorithms.
Motivation
Tiled matmul is the baseline practical GPU implementation. Without shared memory, you're limited to naive matmul which is ~100x slower. Shared memory acts as a programmer-managed cache shared across all threads in a block.
Design considerations
- Need a way to declare shared memory (size, element type)
- Need a way to read from and write to shared memory
- Possible syntax:
SHARED <name> <size>declaration + reuse@ ! - MLIR mapping:
memref.allocin GPU shared address space, orgpu.allocwith appropriate memory space
Implementation notes
- May require extending
!forth.stackor adding a new shared memory type - The
memrefshould be in GPU shared address space (address space 3 for NVVM) - Allocation should be at kernel entry, not per-thread
Priority
High — required for any practical tiled GPU algorithm.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request