Shared memory support for GPU scratchpad

## Summary

Add support for GPU shared memory (`__shared__` in CUDA) to enable tiled algorithms.

## Motivation

Tiled matmul is the baseline practical GPU implementation. Without shared memory, you're limited to naive matmul which is ~100x slower. Shared memory acts as a programmer-managed cache shared across all threads in a block.

## Design considerations

- Need a way to **declare** shared memory (size, element type)
- Need a way to **read from** and **write to** shared memory
- Possible syntax: `SHARED <name> <size>` declaration + reuse `@ !`
- MLIR mapping: `memref.alloc` in GPU shared address space, or `gpu.alloc` with appropriate memory space

## Implementation notes

- May require extending `!forth.stack` or adding a new shared memory type
- The `memref` should be in GPU shared address space (address space 3 for NVVM)
- Allocation should be at kernel entry, not per-thread

## Priority

High — required for any practical tiled GPU algorithm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared memory support for GPU scratchpad #4

Summary

Motivation

Design considerations

Implementation notes

Priority

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Shared memory support for GPU scratchpad #4

Description

Summary

Motivation

Design considerations

Implementation notes

Priority

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions