Atomics#

Atomic read-modify-write operations on a single memory location. They do not synchronize threads; the only ordering they provide is the per-location atomicity of the read-modify-write itself. For cooperative ops across threads see the qd.simt.block.*, qd.simt.subgroup.*, and qd.simt.grid.* namespaces. Bit-counting helpers on integer registers (qd.math.popcnt, qd.math.clz) are documented in math.

What’s available#

All atomic ops follow the same shape: qd.atomic_op(x, y) performs x = op(x, y) atomically and returns the old value of x. x must be a writable memory target (a field element, ndarray element, or matrix slot); scalars and constant expressions are not allowed.

“int” below means any of i32 / u32 / i64 / u64. “Floats” means any of f16 / f32 / f64. Unless otherwise noted, “native” means the op lowers to a single hardware atomic instruction (or its SPIR-V / LLVM-IR equivalent), and “CAS” means a software compare-and-swap loop emitted around a non-atomic compute.

Op

CUDA

AMDGPU

SPIR-V (Vulkan / Metal)

CPU

atomic_add

int / f32 native; f64 native (sm_60+)

int / f32 native; f64 hardware-dependent

int native; f16 / f32 / f64 capability-gated, else CAS

int / f32 / f64 native; f16 via CAS

atomic_sub

rewritten to atomic_add(x, -y) at IR-construction time — see note below

(same)

(same)

(same)

atomic_mul

CAS on every dtype

CAS

CAS

CAS

atomic_min, atomic_max

int native; floats via CAS

int native; floats via CAS

int native; floats via CAS

int native; floats via CAS

atomic_and, atomic_or, atomic_xor

int only (native)

int only (native)

int only (native)

int only (native)

atomic_exchange

int / float native (atomicExch)

int / float native (*_atomic_swap)

int native; f32 / f64 global via uint-bitcast OpAtomicExchange; f16, shared float, workgroup f64 deferred‡

int / float native (xchg)

atomic_cas

int native (atomicCAS)

int native (*_atomic_cmpswap)

int native (OpAtomicCompareExchange); f32 / f64 rejected at compile time§

int native (cmpxchg)

A few cross-cutting notes that the cells above abbreviate:

  • atomic_sub is not a separate op in the IR. quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten rewrites every atomic_sub(x, y) into atomic_add(x, -y) before codegen sees it, so per-backend support and per-dtype behaviour are exactly those of atomic_add.

  • CAS-loop ops are noticeably slower than native atomics, especially under contention — every contending thread retries the load + compare-exchange until it wins. Prefer pre-aggregating into a register or shared array and issuing a single atomic at the end of the block where possible.

  • f16 floats always use a CAS loop (no native f16 atomic on any backend except SPIR-V with the right capability bit).

  • On CPU, “native” does not guarantee a single machine instruction. On x86 and other architectures without hardware float atomics, the compiler backend lowers native float atomic_add (and integer min / max) to a CAS loop in machine code. Under high contention the performance is similar to the explicit “CAS” entries; the difference is that “native” ops benefit from hardware acceleration where available.

  • SPIR-V capability bits (spirv_has_atomic_float_add, spirv_has_atomic_float64_add, spirv_has_atomic_float16_add) decide whether atomic_add lowers to native OpAtomicFAddEXT or a uint-backed CAS — the dispatch happens per-call inside quadrants/codegen/spirv/spirv_codegen.cpp.

  • i64 / u64 atomic RMW is not portable to Metal. Metal Shading Language only exposes 64-bit atomics as atomic_fetch_min / atomic_fetch_max on uint64 (Apple GPU family 9+, M3 / A17); atomic_add / sub / mul and the bitwise family are unavailable on every Apple GPU. The Metal RHI today over-advertises spirv_has_atomic_int64 (gated on Apple7 / Mac2 in quadrants/rhi/metal/metal_device.mm), so 64-bit integer atomics under Metal fail at pipeline create time with RhiResult=-1. Use i32 / u32 for Metal portability. CUDA, AMDGPU, and Vulkan with VK_KHR_shader_atomic_int64 are unaffected.

i64 / u64 atomic RMW is not portable to Metal. Metal Shading Language only exposes 64-bit atomics as atomic_fetch_min / atomic_fetch_max on uint64, starting at Apple GPU family 9 (M3 / A17 and newer); atomic_add / sub / mul and the bitwise family are unavailable on every Apple GPU. The Metal RHI today over-advertises spirv_has_atomic_int64 (gated on Apple7 / Mac2 in quadrants/rhi/metal/metal_device.mm), so trying to use 64-bit integer atomics under Metal currently fails at pipeline create time with RhiResult=-1 (“SPIR-V shader was rejected by the backend”). Use i32 / u32 if you need cross-Metal portability. CUDA, AMDGPU, and Vulkan with VK_KHR_shader_atomic_int64 are unaffected.

atomic_exchange on f16, on shared (qd.simt.block.SharedArray) float arrays, and on f64 in workgroup memory is not yet wired up. Global-memory atomic_exchange on every other dtype/backend combination listed above is supported; the SPIR-V path bitcasts through the corresponding uint type so no spirv_has_atomic_float_* capability is required.

§ atomic_cas on f32 / f64 is rejected at compile time (raises QuadrantsTypeError). Integer CAS (i32 / u32 / i64 / u64) is supported on every backend listed in the table above, with the same Metal caveat for i64 / u64 (†) as the rest of the 64-bit integer atomic family.

All atomic ops can be called on either global memory (fields, ndarrays) or block-shared memory (qd.simt.block.SharedArray). They are sequentially consistent on the location they touch; they are not memory fences for the rest of the address space - to publish other writes alongside an atomic, pair the atomic with qd.simt.block.mem_fence() (block scope) or qd.simt.grid.mem_fence() (device scope).

Semantics#

qd.atomic_add(x, y) - and the rest of the family#

old = qd.atomic_add(x, y)
# Effect:
#   tmp = load(x)
#   store(x, op(tmp, y))
#   old = tmp
# all three steps execute as a single atomic transaction on x.

Properties common to every qd.atomic_*:

  • Returns the old value, not the new one. This matches CUDA’s atomicAdd and is what enables building reservation patterns: slot = qd.atomic_add(counter, 1) gives every thread a unique index.

  • Per-location atomicity, no fence on the rest of memory. Writes you issued before an atomic on x are not necessarily visible to other threads after they observe the new x. Pair the atomic with qd.simt.block.mem_fence() or qd.simt.grid.mem_fence() if you need that ordering.

  • Vector / matrix arguments fan out element-wise. qd.atomic_add(field_of_vec3, qd.Vector([1.0, 2.0, 3.0])) issues three independent scalar atomic-adds, one per component. There is no all-or-nothing guarantee across the components.

qd.atomic_min(x, y) / qd.atomic_max(x, y)#

Atomically writes back min(x, y) (resp. max(x, y)); returns the old value of x. Float min/max are minNum / maxNum-style: if exactly one operand is NaN, the non-NaN operand wins.

Backends

f16

f32, f64

Both inputs NaN

CPU, CUDA, AMDGPU (LLVM)

CAS over llvm.minnum / llvm.maxnum

LLVM atomicrmw fmin / fmax

NaN (per LLVM minnum / maxnum spec)

Vulkan, Metal (SPIR-V)

capability-gated, usually unsupported

CAS loop with GLSL FMin / FMax

undefined per spec; NaN in practice

qd.atomic_and(x, y) / qd.atomic_or(x, y) / qd.atomic_xor(x, y)#

Bitwise atomics. Integer dtypes only — passing f32 / f64 raises a type error at compile time.

qd.atomic_sub(x, y) / qd.atomic_mul(x, y)#

Atomic subtract and atomic multiply. atomic_sub is rewritten to atomic_add(x, -y) at IR-construction time (quadrants/ir/frontend_ir.cpp::AtomicOpExpression::flatten), so its per-backend behaviour is identical to atomic_add. atomic_mul always lowers to a CAS loop - no LLVM AtomicRMW or SPIR-V OpAtomic* op corresponds to multiply - and is intentionally not heavily optimised; prefer reducing to a different scheme on hot paths.

qd.atomic_exchange(x, y)#

Atomically writes y into x and returns the old value of x. Unlike the other qd.atomic_* ops the new value of x does not depend on its old value - x is unconditionally overwritten. The exchange always succeeds; there is no retry / failure path.

old = qd.atomic_exchange(x, y)
# Effect:
#   tmp = load(x)
#   store(x, y)
#   old = tmp
# all three steps execute as a single atomic transaction on x.

Lowers to one native instruction on every backend (CUDA atomicExch, AMDGPU buffer_atomic_swap / global_atomic_swap, SPIR-V OpAtomicExchange, x86 xchg). Useful for take-ownership / hand-off patterns:

my_old_task = qd.atomic_exchange(slot, NO_TASK)
if my_old_task != NO_TASK:
    process(my_old_task)
# Whatever was in `slot` is now mine to process; I left NO_TASK behind for the next worker.  No retry needed - exchange
# always succeeds.

Vector / matrix arguments fan out per component, same as the rest of the qd.atomic_* family: a qd.atomic_exchange(field_of_vec3, qd.Vector([...])) issues three independent scalar exchanges, one per slot, with no all-or-nothing guarantee across the components.

qd.atomic_cas(x, expected, desired)#

Atomic compare-and-swap: writes desired into x if and only if x currently equals expected, and unconditionally returns the value originally at x. The user recovers whether the swap actually fired with one comparison:

old = qd.atomic_cas(x, expected, desired)
# Effect:
#   tmp = load(x)
#   if tmp == expected: store(x, desired)
#   old = tmp
# all three steps (load, conditional store, return-old) execute as a single atomic transaction on x.

success = (old == expected)

This is the basic primitive on top of which arbitrary atomic read-modify-write operations can be built with a retry loop. Returning the prior value (rather than a (prior, success) pair) matches CUDA atomicCAS and SPIR-V OpAtomicCompareExchange; lowers to one native instruction on every backend (CUDA atomicCAS, AMDGPU *_atomic_cmpswap, SPIR-V OpAtomicCompareExchange, x86 cmpxchg).

CAS-loop pattern for ops the framework doesn’t expose natively (e.g. atomic-max-of-some-derived-quantity):

@qd.kernel
def cas_loop_max():
    # Atomically: x = max(x, candidate). The framework already has atomic_max for primitives, but the same
    # shape works for any reduction whose backend support is missing.
    for _attempt in range(MAX_RETRIES):
        cur = x[None]
        new = qd.max(cur, candidate)
        old = qd.atomic_cas(x[None], cur, new)
        if old == cur:
            break  # CAS landed; we're done.
        # Otherwise some other thread won the race; loop back and re-read.

Currently restricted to integer dtypes (i32 / u32 / i64 / u64); float CAS is rejected at compile time. The Metal i64 / u64 caveat in the support table footnote applies here too. There is no shared-memory CAS path yet.

Performance and portability notes#

  • Atomic contention is the silent killer of throughput. The cost of qd.atomic_add(counter, 1) from every thread is dominated by serialization at the location, not by the per-thread arithmetic. If many threads hit the same slot, prefer a two-stage scheme: per-warp / per-block reduction first (qd.simt.block.reduce if available, or qd.simt.subgroup.reduce_add), then a single atomic per warp / block.

  • Pair atomics with the right fence scope. A bare atomic only orders the location it touches. To make other writes visible to readers that observe the new atomic value, follow the atomic with a fence: block-scope (qd.simt.block.mem_fence()) for shared-memory publishing, or grid-scope (qd.simt.grid.mem_fence()) for cross-block coordination.

  • f64 atomics fall off the fast path on most backends; if you only need monotonic accumulation, consider Kahan summation in registers and a single atomic-add at the end of the block.

  • atomic_mul is generally a CAS loop under the hood; don’t put it on the hot path.

Atomic visibility scope across backends#

Every qd.atomic_* is emitted at device-wide scope: visible to all threads on the GPU executing the kernel, but not required to be coherent with the host CPU mid-kernel. The host only observes results once the kernel completes, at which point the launcher’s stream-sync flushes everything regardless. Choosing device scope (rather than the strongest “system” scope) lets every backend lower the op to a single hardware atomic instruction instead of a software CAS retry loop, which matters for correctness as much as for speed: under heavy contention, a CAS loop on a non-converging op like atomic_xor can livelock.

You don’t normally need to think about scope as a user. It’s listed here so the per-backend behaviour is explicit:

Backend

Scope spelling in the IR

CPU (x86_64)

LLVM seq_cst (System)

CUDA (NVPTX)

LLVM seq_cst (System)

AMDGPU

LLVM seq_cst syncscope("agent")

Vulkan / Metal (SPIR-V)

SPIR-V Scope = Device

CPU and CUDA lower system-scope atomics directly to a single hardware instruction, so they leave the LLVM default alone. AMDGPU’s LLVM backend, in contrast, refuses to use its native single-instruction atomics at system scope (it would have to add cache-flush instructions that don’t exist for that op), and silently falls back to a CAS loop; setting syncscope("agent") is what unlocks the native flat_atomic_xor / global_atomic_xor / flat_atomic_smin / flat_atomic_add_f32 / … SPIR-V backends spell the same idea with the Device scope token. The user-visible semantics are identical across all four.

Native instruction vs CAS fallback#

The tables below reflect what the in-tree LLVM emits today for Quadrants’ default targets (x86_64; CUDA sm_60+; AMDGPU gfx942 / MI300X at syncscope("agent"); Vulkan/Metal via SPIR-V). Older / different GFX generations are footnoted.

Integer atomics (i32, u32, i64, u64):

Op

CPU (x86_64)

CUDA

AMDGPU

Vulkan / Metal (SPIR-V)

atomic_add, atomic_sub

atomic_and, atomic_or, atomic_xor

✅¹

atomic_min, atomic_max

🟡

atomic_mul

🟡

🟡

🟡

🟡

Floating-point atomics (f32, f64):

Op

CPU (x86_64)

CUDA

AMDGPU

Vulkan / Metal (SPIR-V)

atomic_add, atomic_sub

🟡

✅²

✅³

atomic_min, atomic_max

🟡

🟡

🟡²

🟡

atomic_mul

🟡

🟡

🟡

🟡

Key:

  • ✅ — single hardware atomic instruction (lock-prefixed x86, PTX atom.*, AMDGPU flat_atomic_*, or SPIR-V OpAtomic*).

  • 🟡 — software cmpxchg / cmpswap retry loop.

f16 atomics are CAS on every backend (Quadrants forces a CAS loop built from llvm.minnum / llvm.maxnum / fadd), and on Vulkan / Metal are additionally gated on spirv_has_atomic_float16_* device capabilities.

¹ lock and / or / xor are single-instruction on x86, but they don’t expose the old value. When the qd.atomic_* return value is unused (the common case — fire-and-forget update) LLVM emits the single lock op. When the old value is consumed, x86 falls back to a cmpxchg loop.

² AMDGPU float-atomic support is GFX-dependent. Empirically with the bundled LLVM:

  • gfx942 (CDNA3 / MI300X, Quadrants’ default AMDGPU target): atomic_add f32 / f64 are native (flat_atomic_add_f32 / _f64), atomic_min / max f64 are native (flat_atomic_min_f64 / max_f64); f32 min/max still expand to CAS.

  • gfx906, gfx90a, gfx1030, gfx1100: all f32 / f64 float atomics expand to CAS.

³ SPIR-V float atomic_add lowers to OpAtomicFAddEXT when the matching spirv_has_atomic_float{32,64}_add capability is present on the device, and to a CAS loop with a GLSL.std.450 payload otherwise. Quadrants does not currently emit OpAtomicFMinEXT / OpAtomicFMaxEXT, so float min/max is always CAS on SPIR-V backends.