Atomics#
Atomic read-modify-write operations on a single memory location. They do not synchronize threads; the only ordering they provide is the per-location atomicity of the read-modify-write itself. For cooperative ops across threads see the qd.simt.block.*, qd.simt.subgroup.*, and qd.simt.grid.* namespaces. Bit-counting helpers on integer registers (qd.math.popcnt, qd.math.clz) are documented in math.
What’s available#
All atomic ops follow the same shape: qd.atomic_op(x, y) performs x = op(x, y) atomically and returns the old value of x. x must be a writable memory target (a field element, ndarray element, or matrix slot); scalars and constant expressions are not allowed.
“int” below means any of i32 / u32 / i64 / u64. “Floats” means any of f16 / f32 / f64. Unless otherwise noted, “native” means the op lowers to a single hardware atomic instruction (or its SPIR-V / LLVM-IR equivalent), and “CAS” means a software compare-and-swap loop emitted around a non-atomic compute.
Op |
CUDA |
AMDGPU |
SPIR-V (Vulkan / Metal) |
CPU |
|---|---|---|---|---|
|
int / f32 native; f64 native (sm_60+) |
int / f32 native; f64 hardware-dependent |
int native; f16 / f32 / f64 capability-gated, else CAS |
int / f32 / f64 native; f16 via CAS |
|
rewritten to |
(same) |
(same) |
(same) |
|
CAS on every dtype |
CAS |
CAS |
CAS |
|
int native; floats via CAS |
int native; floats via CAS |
int native; floats via CAS |
int native; floats via CAS |
|
int only (native) |
int only (native) |
int only (native) |
int only (native) |
|
int / float native ( |
int / float native ( |
int native; f32 / f64 global via uint-bitcast |
int / float native ( |
|
int native ( |
int native ( |
int native ( |
int native ( |
A few cross-cutting notes that the cells above abbreviate:
atomic_subis not a separate op in the IR.quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flattenrewrites everyatomic_sub(x, y)intoatomic_add(x, -y)before codegen sees it, so per-backend support and per-dtype behaviour are exactly those ofatomic_add.CAS-loop ops are noticeably slower than native atomics, especially under contention — every contending thread retries the load + compare-exchange until it wins. Prefer pre-aggregating into a register or shared array and issuing a single atomic at the end of the block where possible.
f16 floats always use a CAS loop (no native f16 atomic on any backend except SPIR-V with the right capability bit).
On CPU, “native” does not guarantee a single machine instruction. On x86 and other architectures without hardware float atomics, the compiler backend lowers native float
atomic_add(and integermin/max) to a CAS loop in machine code. Under high contention the performance is similar to the explicit “CAS” entries; the difference is that “native” ops benefit from hardware acceleration where available.SPIR-V capability bits (
spirv_has_atomic_float_add,spirv_has_atomic_float64_add,spirv_has_atomic_float16_add) decide whetheratomic_addlowers to nativeOpAtomicFAddEXTor a uint-backed CAS — the dispatch happens per-call insidequadrants/codegen/spirv/spirv_codegen.cpp.i64/u64atomic RMW is not portable to Metal. Metal Shading Language only exposes 64-bit atomics asatomic_fetch_min/atomic_fetch_maxonuint64(Apple GPU family 9+, M3 / A17);atomic_add/sub/muland the bitwise family are unavailable on every Apple GPU. The Metal RHI today over-advertisesspirv_has_atomic_int64(gated on Apple7 / Mac2 inquadrants/rhi/metal/metal_device.mm), so 64-bit integer atomics under Metal fail at pipeline create time withRhiResult=-1. Usei32/u32for Metal portability. CUDA, AMDGPU, and Vulkan withVK_KHR_shader_atomic_int64are unaffected.
† i64 / u64 atomic RMW is not portable to Metal. Metal Shading Language only exposes 64-bit atomics as atomic_fetch_min / atomic_fetch_max on uint64, starting at Apple GPU family 9 (M3 / A17 and newer); atomic_add / sub / mul and the bitwise family are unavailable on every Apple GPU. The Metal RHI today over-advertises spirv_has_atomic_int64 (gated on Apple7 / Mac2 in quadrants/rhi/metal/metal_device.mm), so trying to use 64-bit integer atomics under Metal currently fails at pipeline create time with RhiResult=-1 (“SPIR-V shader was rejected by the backend”). Use i32 / u32 if you need cross-Metal portability. CUDA, AMDGPU, and Vulkan with VK_KHR_shader_atomic_int64 are unaffected.
‡ atomic_exchange on f16, on shared (qd.simt.block.SharedArray) float arrays, and on f64 in workgroup memory is not yet wired up. Global-memory atomic_exchange on every other dtype/backend combination listed above is supported; the SPIR-V path bitcasts through the corresponding uint type so no spirv_has_atomic_float_* capability is required.
§ atomic_cas on f32 / f64 is rejected at compile time (raises QuadrantsTypeError). Integer CAS (i32 / u32 / i64 / u64) is supported on every backend listed in the table above, with the same Metal caveat for i64 / u64 (†) as the rest of the 64-bit integer atomic family.
All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (qd.simt.block.SharedArray). They are sequentially consistent on the location they touch; they are not memory fences for the rest of the address space - to publish other writes alongside an atomic, pair the atomic with qd.simt.block.mem_fence() (block scope) or qd.simt.grid.mem_fence() (device scope).
Semantics#
qd.atomic_add(x, y) - and the rest of the family#
old = qd.atomic_add(x, y)
# Effect:
# tmp = load(x)
# store(x, op(tmp, y))
# old = tmp
# all three steps execute as a single atomic transaction on x.
Properties common to every qd.atomic_*:
Returns the old value, not the new one. This matches CUDA’s
atomicAddand is what enables building reservation patterns:slot = qd.atomic_add(counter, 1)gives every thread a unique index.Per-location atomicity, no fence on the rest of memory. Writes you issued before an atomic on
xare not necessarily visible to other threads after they observe the newx. Pair the atomic withqd.simt.block.mem_fence()orqd.simt.grid.mem_fence()if you need that ordering.Vector / matrix arguments fan out element-wise.
qd.atomic_add(field_of_vec3, qd.Vector([1.0, 2.0, 3.0]))issues three independent scalar atomic-adds, one per component. There is no all-or-nothing guarantee across the components.
qd.atomic_min(x, y) / qd.atomic_max(x, y)#
Atomically writes back min(x, y) (resp. max(x, y)); returns the old value of x. Float min/max are minNum / maxNum-style: if exactly one operand is NaN, the non-NaN operand wins.
Backends |
|
|
Both inputs |
|---|---|---|---|
CPU, CUDA, AMDGPU (LLVM) |
CAS over |
LLVM |
|
Vulkan, Metal (SPIR-V) |
capability-gated, usually unsupported |
CAS loop with GLSL |
undefined per spec; |
qd.atomic_and(x, y) / qd.atomic_or(x, y) / qd.atomic_xor(x, y)#
Bitwise atomics. Integer dtypes only — passing f32 / f64 raises a type error at compile time.
qd.atomic_sub(x, y) / qd.atomic_mul(x, y)#
Atomic subtract and atomic multiply. atomic_sub is rewritten to atomic_add(x, -y) at IR-construction time (quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten), so its per-backend behaviour is identical to atomic_add. atomic_mul always lowers to a CAS loop - no LLVM AtomicRMW or SPIR-V OpAtomic* op corresponds to multiply - and is intentionally not heavily optimised; prefer reducing to a different scheme on hot paths.
qd.atomic_exchange(x, y)#
Atomically writes y into x and returns the old value of x. Unlike the other qd.atomic_* ops the new value of x does not depend on its old value - x is unconditionally overwritten. The exchange always succeeds; there is no retry / failure path.
old = qd.atomic_exchange(x, y)
# Effect:
# tmp = load(x)
# store(x, y)
# old = tmp
# all three steps execute as a single atomic transaction on x.
Lowers to one native instruction on every backend (CUDA atomicExch, AMDGPU buffer_atomic_swap / global_atomic_swap, SPIR-V OpAtomicExchange, x86 xchg). Useful for take-ownership / hand-off patterns:
my_old_task = qd.atomic_exchange(slot, NO_TASK)
if my_old_task != NO_TASK:
process(my_old_task)
# Whatever was in `slot` is now mine to process; I left NO_TASK behind for the next worker. No retry needed - exchange
# always succeeds.
Vector / matrix arguments fan out per component, same as the rest of the qd.atomic_* family: a qd.atomic_exchange(field_of_vec3, qd.Vector([...])) issues three independent scalar exchanges, one per slot, with no all-or-nothing guarantee across the components.
qd.atomic_cas(x, expected, desired)#
Atomic compare-and-swap: writes desired into x if and only if x currently equals expected, and unconditionally returns the value originally at x. The user recovers whether the swap actually fired with one comparison:
old = qd.atomic_cas(x, expected, desired)
# Effect:
# tmp = load(x)
# if tmp == expected: store(x, desired)
# old = tmp
# all three steps (load, conditional store, return-old) execute as a single atomic transaction on x.
success = (old == expected)
This is the basic primitive on top of which arbitrary atomic read-modify-write operations can be built with a retry loop. Returning the prior value (rather than a (prior, success) pair) matches CUDA atomicCAS and SPIR-V OpAtomicCompareExchange; lowers to one native instruction on every backend (CUDA atomicCAS, AMDGPU *_atomic_cmpswap, SPIR-V OpAtomicCompareExchange, x86 cmpxchg).
CAS-loop pattern for ops the framework doesn’t expose natively (e.g. atomic-max-of-some-derived-quantity):
@qd.kernel
def cas_loop_max():
# Atomically: x = max(x, candidate). The framework already has atomic_max for primitives, but the same
# shape works for any reduction whose backend support is missing.
for _attempt in range(MAX_RETRIES):
cur = x[None]
new = qd.max(cur, candidate)
old = qd.atomic_cas(x[None], cur, new)
if old == cur:
break # CAS landed; we're done.
# Otherwise some other thread won the race; loop back and re-read.
Currently restricted to integer dtypes (i32 / u32 / i64 / u64); float CAS is rejected at compile time. The Metal i64 / u64 caveat in the support table footnote applies here too. There is no shared-memory CAS path yet.
Performance and portability notes#
Atomic contention is the silent killer of throughput. The cost of
qd.atomic_add(counter, 1)from every thread is dominated by serialization at the location, not by the per-thread arithmetic. If many threads hit the same slot, prefer a two-stage scheme: per-warp / per-block reduction first (qd.simt.block.reduceif available, orqd.simt.subgroup.reduce_add), then a single atomic per warp / block.Pair atomics with the right fence scope. A bare atomic only orders the location it touches. To make other writes visible to readers that observe the new atomic value, follow the atomic with a fence: block-scope (
qd.simt.block.mem_fence()) for shared-memory publishing, or grid-scope (qd.simt.grid.mem_fence()) for cross-block coordination.f64atomics fall off the fast path on most backends; if you only need monotonic accumulation, consider Kahan summation in registers and a single atomic-add at the end of the block.atomic_mulis generally a CAS loop under the hood; don’t put it on the hot path.
Atomic visibility scope across backends#
Every qd.atomic_* is emitted at device-wide scope: visible to all threads on the GPU executing the kernel, but not required to be coherent with the host CPU mid-kernel. The host only observes results once the kernel completes, at which point the launcher’s stream-sync flushes everything regardless. Choosing device scope (rather than the strongest “system” scope) lets every backend lower the op to a single hardware atomic instruction instead of a software CAS retry loop, which matters for correctness as much as for speed: under heavy contention, a CAS loop on a non-converging op like atomic_xor can livelock.
You don’t normally need to think about scope as a user. It’s listed here so the per-backend behaviour is explicit:
Backend |
Scope spelling in the IR |
|---|---|
CPU (x86_64) |
LLVM |
CUDA (NVPTX) |
LLVM |
AMDGPU |
LLVM |
Vulkan / Metal (SPIR-V) |
SPIR-V |
CPU and CUDA lower system-scope atomics directly to a single hardware instruction, so they leave the LLVM default alone. AMDGPU’s LLVM backend, in contrast, refuses to use its native single-instruction atomics at system scope (it would have to add cache-flush instructions that don’t exist for that op), and silently falls back to a CAS loop; setting syncscope("agent") is what unlocks the native flat_atomic_xor / global_atomic_xor / flat_atomic_smin / flat_atomic_add_f32 / … SPIR-V backends spell the same idea with the Device scope token. The user-visible semantics are identical across all four.
Native instruction vs CAS fallback#
The tables below reflect what the in-tree LLVM emits today for Quadrants’ default targets (x86_64; CUDA sm_60+; AMDGPU gfx942 / MI300X at syncscope("agent"); Vulkan/Metal via SPIR-V). Older / different GFX generations are footnoted.
Integer atomics (i32, u32, i64, u64):
Op |
CPU (x86_64) |
CUDA |
AMDGPU |
Vulkan / Metal (SPIR-V) |
|---|---|---|---|---|
|
✅ |
✅ |
✅ |
✅ |
|
✅¹ |
✅ |
✅ |
✅ |
|
🟡 |
✅ |
✅ |
✅ |
|
🟡 |
🟡 |
🟡 |
🟡 |
Floating-point atomics (f32, f64):
Op |
CPU (x86_64) |
CUDA |
AMDGPU |
Vulkan / Metal (SPIR-V) |
|---|---|---|---|---|
|
🟡 |
✅ |
✅² |
✅³ |
|
🟡 |
🟡 |
🟡² |
🟡 |
|
🟡 |
🟡 |
🟡 |
🟡 |
Key:
✅ — single hardware atomic instruction (
lock-prefixed x86, PTXatom.*, AMDGPUflat_atomic_*, or SPIR-VOpAtomic*).🟡 — software
cmpxchg/cmpswapretry loop.
f16 atomics are CAS on every backend (Quadrants forces a CAS loop built from llvm.minnum / llvm.maxnum / fadd), and on Vulkan / Metal are additionally gated on spirv_has_atomic_float16_* device capabilities.
¹ lock and / or / xor are single-instruction on x86, but they don’t expose the old value. When the qd.atomic_* return value is unused (the common case — fire-and-forget update) LLVM emits the single lock op. When the old value is consumed, x86 falls back to a cmpxchg loop.
² AMDGPU float-atomic support is GFX-dependent. Empirically with the bundled LLVM:
gfx942(CDNA3 / MI300X, Quadrants’ default AMDGPU target):atomic_addf32 / f64 are native (flat_atomic_add_f32/_f64),atomic_min/maxf64 are native (flat_atomic_min_f64/max_f64); f32 min/max still expand to CAS.gfx906,gfx90a,gfx1030,gfx1100: all f32 / f64 float atomics expand to CAS.
³ SPIR-V float atomic_add lowers to OpAtomicFAddEXT when the matching spirv_has_atomic_float{32,64}_add capability is present on the device, and to a CAS loop with a GLSL.std.450 payload otherwise. Quadrants does not currently emit OpAtomicFMinEXT / OpAtomicFMaxEXT, so float min/max is always CAS on SPIR-V backends.