# Grid primitives

Grid-level primitives operate across **all blocks of a single kernel launch** — i.e. the entire device for the duration of one kernel. They sit one tier above block-scope primitives and one tier below "finish the kernel and launch a new one" (the only fully cross-block thread synchronization Quadrants offers).

Grid ops live under `qd.simt.grid`. The namespace currently contains a single op, the device-scope memory fence:

## What's available

| Op                        | CUDA | AMDGPU | Vulkan | Metal |
|---------------------------|------|--------|--------|-------|
| `qd.simt.grid.mem_fence()`| yes  | yes    | yes    | yes\* |

\* On Metal, `grid.mem_fence()` lowers (via MoltenVK / SPIRV-Cross → MSL) to `atomic_thread_fence(metal::memory_scope_device)`. Per the Metal Shading Language specification, that builtin only synchronizes *atomic* and *relaxed-atomic* memory accesses across the device, not plain (non-atomic) loads / stores. So a Metal-portable cross-workgroup producer-consumer pattern needs the produced value itself to be published through an atomic op (e.g. `qd.atomic_or(a[i], 1)`); CUDA, AMDGPU, and native Vulkan (Linux / Windows) order non-atomic stores around `grid.mem_fence()` strictly. The same caveat applies to **Vulkan on macOS**, since on macOS Vulkan is really MoltenVK lowering SPIR-V to MSL — treat `qd.vulkan` on Darwin the same as `qd.metal` for `grid.mem_fence()` purposes. Empirically validate any other pattern, or split into two kernel launches.

Per-backend lowering:

- **CUDA**: `__threadfence()` (NVVM intrinsic `nvvm_membar_gl`).
- **AMDGPU**: an LLVM IR `fence syncscope("agent") acq_rel` synthesized in `llvm_context.cpp::patch_fence` after the bitcode is retargeted to AMDGCN. (We can't use `__builtin_amdgcn_fence` directly because the runtime stubs are compiled host-side.)
- **Vulkan**: SPIR-V `OpMemoryBarrier(ScopeDevice, AcquireRelease | UniformMemory)` emitted by `spirv_codegen.cpp` for the `gridMemoryBarrier` internal op.
- **Metal**: same SPIR-V op as Vulkan; MoltenVK / SPIRV-Cross translates it to MSL `atomic_thread_fence(memory_scope_device)`. Available since MSL 2.0 (macOS 10.13+ / iOS 11+).

### Naming and deprecated alias

The op was previously named `qd.simt.grid.memfence()` (no underscore). It is still callable under that name as a deprecated alias that emits `DeprecationWarning` on first use; new code should use `qd.simt.grid.mem_fence()`. The alias will be removed in a future release.

## Barrier vs fence at grid scope

There is no `grid.sync()` — Quadrants does not expose a thread-converging barrier across blocks within a single kernel launch. The reasons are practical: CUDA cooperative groups need launch-time opt-in, AMDGPU and SPIR-V either lack a comparable primitive or expose it only under non-portable extensions, and the latency of an in-kernel grid barrier is comparable to a kernel relaunch on most hardware.

What Quadrants does provide at grid scope is a pure **memory fence**:

- **`qd.simt.grid.mem_fence()`** orders memory operations issued by the calling thread so that prior writes are visible to other threads **anywhere in the grid** (across blocks) before any subsequent read in the calling thread can be reordered ahead of the fence. It does **not** synchronize threads.
- For full thread synchronization across the grid, finish the current kernel and launch a new one — the implicit kernel-end barrier is the canonical cross-block synchronization in Quadrants.

## Semantics

### `qd.simt.grid.mem_fence()`

A device-scope memory fence. Lowering on each backend is described in "Per-backend lowering" above. No convergence requirement on any backend — safe to call from divergent control flow (e.g. inside `if tid == 0`).

Use this when one block (or one thread per block) needs to publish data to global memory and have the publication be visible to other blocks **without** waiting at a kernel boundary. The canonical use case is the decoupled-look-back pattern in Onesweep-style device scans:

```python
@qd.kernel
def lookback_scan(...) -> None:
    bid = qd.simt.block.global_thread_idx() // BLOCK_SIZE
    tid = qd.simt.block.global_thread_idx() %  BLOCK_SIZE

    block_sum = ...

    if tid == 0:
        partials[bid] = block_sum
        qd.simt.grid.mem_fence()
        flags[bid] = STATE_AGGREGATE

    if tid == 0:
        prev = bid - 1
        while prev >= 0:
            while flags[prev] == STATE_INVALID:
                qd.simt.grid.mem_fence()
            qd.simt.grid.mem_fence()
            block_sum += partials[prev]
            ...
```

The three `grid.mem_fence()` calls are doing different jobs:

1. The first orders the publication: any block reading `flags[bid] == STATE_AGGREGATE` is guaranteed to also see the published `partials[bid]`.
2. The second is **inside** the spin-wait, and is what forces each iteration to actually re-fetch `flags[prev]` from global memory (see "Spin-wait gotcha" below).
3. The third is the symmetric reader-side fence: after observing the predecessor's flag, the reader needs to refresh its view of `partials[prev]` (the scope of which is already global, but the fence pins down the ordering).

The fence does not require thread convergence, which is why it appears inside `if tid == 0` without deadlocking — `qd.simt.block.sync()` would deadlock there; `grid.mem_fence()` is safe.

> **Metal / Vulkan-on-macOS portability:** the example as written above relies on the producer's plain (non-atomic) store `flags[bid] = STATE_AGGREGATE` becoming visible to other workgroups once `grid.mem_fence()` retires. CUDA, AMDGPU, and native Vulkan honor this strictly; Metal (and therefore Vulkan-on-macOS) does **not** — `atomic_thread_fence(memory_scope_device)` only orders atomic accesses across the device. To make this idiom Metal-portable, publish through an atomic store (`qd.atomic_or(flags[bid], STATE_AGGREGATE)`) or split the producer and consumer phases into separate kernel launches.

#### Spin-wait gotcha

Quadrants ndarray reads compile to plain LLVM loads with no ordering or volatility. A naive spin-wait

```python
while flags[prev] == STATE_INVALID:
    pass
```

is therefore unsafe: LLVM's loop-invariant-code-motion will hoist the load out of the loop (the loop body has no aliasing writes), so once the first read sees `STATE_INVALID` the loop never observes the producer's update — it spins forever. Two correct spellings:

- **Fence inside the loop body** (used in the example above):

  ```python
  while flags[prev] == STATE_INVALID:
      qd.simt.grid.mem_fence()
  ```

  The fence has compiler-visible global side effects, which prevents LICM from hoisting the `flags[prev]` load across it; each iteration is forced to re-fetch from global memory. Cost: one fence per spin iteration. Works on CUDA, AMDGPU, and native Vulkan; **does not work on Metal / Vulkan-on-macOS** (the producer's plain store may never become visible — see the portability note above). Use the atomic-load idiom below for Metal-portable code.

- **Atomic load via `atomic_or` with zero**:

  ```python
  while qd.atomic_or(flags[prev], 0) == STATE_INVALID:
      pass
  ```

  `qd.atomic_or(x, 0)` is an atomic op that returns the current value without modifying it, which forces a real memory access on every iteration and additionally pins down ordering. Slightly cheaper than a per-iteration full grid fence on most hardware, arguably the cleaner expression of "atomically load this slot", and the only Metal-portable spelling — pair it with an atomic *store* on the producer side (`qd.atomic_or(flags[bid], STATE_AGGREGATE)` instead of a plain store + fence) so that both ends of the handshake go through atomic ops.

## Performance and portability notes

- **GPU-portable.** All four GPU backends implement `grid.mem_fence()` natively (see "Per-backend lowering"). The one residual portability gap is Metal / Vulkan-on-macOS only ordering *atomic* memory accesses across the device — patterns relying on plain stores becoming visible across workgroups must publish through atomics on those targets.
- **Cost scales with the global-cache invalidation domain.** A grid fence drains the L2 (and on some GPUs the L1) caches of all SMs / CUs touching the address. On A100 / H100 the cost is on the order of tens to low hundreds of nanoseconds per call; AMDGPU `acq_rel` agent fences and Vulkan `OpMemoryBarrier(ScopeDevice)` are in the same ballpark on contemporary hardware. In tight loops, prefer batching multiple cross-block updates per fence.
- **Pair with the right ordering of memory ops.** The fence orders the *calling thread*'s memory ops; readers in other blocks need their own fence (or an atomic load) to refresh their view. The producer-fence + consumer-fence pattern is the canonical idiom.
- **Not a substitute for atomics on contended locations.** A fence orders writes but does not serialize them. If multiple blocks write to the same location, you need an atomic regardless of how the fence is placed.

## Related

- `qd.simt.block.*` — the block-scope counterpart, including `qd.simt.block.mem_fence()` (block-scope fence) and `qd.simt.block.sync()` (block-scope barrier).
- `qd.simt.subgroup.*` — subgroup-scope barriers, fences, shuffles, and reductions.
- [parallelization](parallelization.md) — the broader synchronization story; explains how grid-scope fences fit relative to atomics, block barriers, and the kernel boundary.