Subgroup primitives#
Subgroup operations let threads within the same subgroup (warp on NVIDIA, wave on AMD) exchange register values directly, without using shared memory or barriers. They are the building block for fast in-warp data exchange — broadcasts, neighbour exchanges, permutations, reductions — and are used internally by Tile16x16 (see tile16).
Subgroup ops live under qd.simt.subgroup and are written so the same Python source compiles to the right vendor primitive on each backend.
What’s available#
Op |
CUDA |
AMDGPU |
SPIR-V (Vulkan) |
dtypes |
|---|---|---|---|---|
|
yes |
yes |
yes |
i32, u32, f32, f64, i64, u64 |
|
yes |
yes* |
yes |
i32, u32, f32, f64, i64, u64 |
|
yes |
yes* |
yes |
any type supporting |
|
yes |
yes |
yes |
any type supporting |
* AMDGPU shuffle_down (and therefore reduce_add, which is built on it) is currently emulated via ds_bpermute (~50 cycle latency).
The remaining shuffle flavours (shuffle_up, shuffle_xor) are exposed in the Python module but are not yet implemented across backends. Calling them will fail at codegen. Use shuffle with an explicit lane index in the meantime — every shuffle pattern can be expressed that way.
The SPIR-V-only no-arg reductions (subgroup.reduce_mul / reduce_min / reduce_max / reduce_and / reduce_or / reduce_xor, plus the original reduce_add(value) with no log2_size) have been removed in favour of the portable sized API described below. For reductions other than sum, build a sized helper on top of shuffle_down / shuffle following the same pattern.
Semantics#
All of these ops operate within a single subgroup: they do not move data through memory and do not synchronise across subgroups.
shuffle(value, index)#
Each lane returns the value held by the lane whose subgroup-local id equals index.
valueis a scalar in a register. Supported dtypes are 32-bit and 64-bit signed/unsigned ints andf32/f64. (64-bit types are split into two 32-bit shuffles on AMDGPU; CUDA dispatches to its native 64-bit helpers.)indexis au32. Ifindexis out of range for the active subgroup the result is implementation-defined, so passsubgroup.invocation_id()-derived values or known-good lane ids.
shuffle_down(value, offset)#
Lane i returns the value held by lane i + offset. Lanes near the top of the subgroup — where i + offset >= subgroup_size — receive an implementation-defined value (typically their own value), so reduction patterns must only trust lane 0’s final result, or mask out the out-of-range lanes.
valueandoffsetdtypes: same asshuffleabove;offsetis au32.Maps to
__shfl_down_syncon CUDA andOpGroupNonUniformShuffleDownon SPIR-V. On AMDGPU it is currently emulated withds_bpermute(see the support matrix above).
Common to both#
Ops are issued under a full active mask on CUDA (
0xFFFFFFFF). Call them from uniform control flow; calling from divergent control flow is undefined on most backends. (this means: all threads have to execute the shuffle)Subgroup size varies by backend (32 on NVIDIA, 32 or 64 on AMD, 32 in Vulkan compute on most GPUs).
reduce_add(value, log2_size)#
Sums value across 2**log2_size consecutive lanes via a shuffle_down tree. The result is valid in lane 0 of each group; other lanes hold partial sums and should be considered undefined.
log2_sizeis aqd.template()— a compile-time constant. The body unrolls into exactlylog2_sizeshuffle_down + addpairs in the calling kernel’s IR, with no runtime loop overhead.2**log2_sizemust not exceed the active subgroup size on the target (32 on CUDA/Metal and on RDNA, 64 on CDNA). Passing a larger value produces implementation-defined results; it does not error.The reduction works on any type that supports
+andshuffle_down; in practice this means i32, u32, f32, f64, i64, u64.Decorated with
@qd.funcand inlined into the calling kernel — there is no kernel-launch overhead and no separate symbol to link.
Lanes 1..2**log2_size - 1 receive undefined-but-safe partial sums (they never touch out-of-range lanes because the tree shrinks each step), but only lane 0’s result is meaningful for the caller.
reduce_all_add(value, log2_size)#
Same sum as reduce_add, but broadcast to every lane in each 2**log2_size group. Implemented as a butterfly using shuffle with lane ^ mask, mask stepping through 1, 2, 4, ..., 2**(log2_size-1).
Same
log2_sizetemplate + size-cap contract asreduce_add.Use this when every lane needs the reduction result (e.g. to divide by the sum, or to branch on it uniformly). It costs exactly the same number of shuffles as
reduce_addbut leaves the answer in all lanes, so it replaces areduce_add+shuffle/broadcast pair.Uses
subgroup.shuffleunder the hood.
Examples#
Broadcast lane 0 to all lanes#
import quadrants as qd
from quadrants.lang.simt import subgroup
@qd.kernel
def broadcast(a: qd.types.ndarray(dtype=qd.f32, ndim=1)):
qd.loop_config(block_dim=64)
for i in range(a.shape[0]):
a[i] = subgroup.shuffle(a[i], qd.u32(0))
After the kernel, every lane in a subgroup holds the original value of its lane 0.
Identity shuffle (each lane reads its own id)#
Useful as a sanity check:
@qd.kernel
def identity(src: qd.types.ndarray(dtype=qd.f32, ndim=1),
dst: qd.types.ndarray(dtype=qd.f32, ndim=1)):
qd.loop_config(block_dim=64)
for i in range(src.shape[0]):
lane = subgroup.invocation_id()
dst[i] = subgroup.shuffle(src[i], qd.cast(lane, qd.u32))
dst[i] equals src[i] on every lane.
Swap neighbours (xor pattern via explicit lane)#
@qd.kernel
def swap_pairs(src: qd.types.ndarray(dtype=qd.f32, ndim=1),
dst: qd.types.ndarray(dtype=qd.f32, ndim=1)):
qd.loop_config(block_dim=64)
for i in range(src.shape[0]):
lane = subgroup.invocation_id()
dst[i] = subgroup.shuffle(src[i], qd.cast(lane ^ 1, qd.u32))
Pairs (0,1), (2,3), … swap their values.
Arbitrary per-lane gather#
@qd.kernel
def reverse4(src: qd.types.ndarray(dtype=qd.f32, ndim=1),
dst: qd.types.ndarray(dtype=qd.f32, ndim=1)):
qd.loop_config(block_dim=64)
for i in range(src.shape[0]):
lane = subgroup.invocation_id()
group_base = (lane // 4) * 4
src_lane = group_base + 3 - lane % 4
dst[i] = subgroup.shuffle(src[i], qd.cast(src_lane, qd.u32))
Within each group of 4 contiguous lanes the values are reversed.
Tree reduction with shuffle_down#
Classic warp-level sum of 4 values — after the second step, lane 0 of each group of 4 holds the total:
@qd.kernel
def reduce4(src: qd.types.ndarray(dtype=qd.f32, ndim=1),
dst: qd.types.ndarray(dtype=qd.f32, ndim=1)):
qd.loop_config(block_dim=64)
for i in range(src.shape[0]):
val = src[i]
val = val + subgroup.shuffle_down(val, qd.u32(2))
val = val + subgroup.shuffle_down(val, qd.u32(1))
dst[i] = val
Extend the pattern (offsets 16, 8, 4, 2, 1, …) to reduce a full subgroup; only lane 0’s final value is meaningful, because the lanes near the top read past the end of the subgroup.
Sum 32 lanes with reduce_add#
The same tree, packaged as a one-liner. Lane 0 of each group of 32 holds the total; other lanes hold partial sums:
@qd.kernel
def sum32(src: qd.types.ndarray(dtype=qd.f32, ndim=1),
dst: qd.types.ndarray(dtype=qd.f32, ndim=1)):
qd.loop_config(block_dim=32)
for i in range(src.shape[0]):
total = subgroup.reduce_add(src[i], 5)
if subgroup.invocation_id() == 0:
dst[i // 32] = total
5 is log2_size; 2**5 == 32 matches the block dim. The body of reduce_add unrolls at trace time into five shuffle_down + add pairs, so the generated IR is identical to a hand-written tree reduction.
Broadcast the sum to all lanes with reduce_all_add#
When every lane needs the reduction result — e.g. to normalise by the sum — use the butterfly variant. No follow-up broadcast needed:
@qd.kernel
def normalize32(a: qd.types.ndarray(dtype=qd.f32, ndim=1)):
qd.loop_config(block_dim=32)
for i in range(a.shape[0]):
total = subgroup.reduce_all_add(a[i], 5)
a[i] = a[i] / total
Every lane in each group of 32 sees the same total.
Partial-subgroup reductions#
log2_size does not have to match the full subgroup. Sum groups of 8 with reduce_add(v, 3) or groups of 16 with reduce_all_add(v, 4); the caller just ensures 2**log2_size <= subgroup_size (so 5 on CUDA / Metal / RDNA, up to 6 on CDNA).
Performance notes#
Shuffles are register-to-register on CUDA (
__shfl_sync,__shfl_down_sync) and on SPIR-V where the GPU has hardware support — typically a handful of cycles, no memory traffic.AMDGPU
shuffleandshuffle_downboth go throughds_permute/ds_bpermutetoday (LDS-routed, roughly tens of cycles).reduce_addandreduce_all_addboth issue exactlylog2_sizeshuffles andlog2_sizeadds per call. No barriers, no shared memory, no launch overhead (they inline).Pick
reduce_all_addoverreduce_add + broadcastwhen you need the result in every lane — same cost, one fewer shuffle.64-bit dtypes (
i64,u64,f64) are emulated as two 32-bit shuffles on AMDGPU. Prefer 32-bit values when you have a choice.