Algorithms#
The algorithms here operate at device-level, using all available gpu cores, and contrast with:
Per-thread linear algorithms: operate on per-thread level
Subgroup-level operations: operate on per-subgroup level
Block-level operations: operate on per-block level
Each function is a kernel-side qd.func functions, which must be launched from inside a qd.kernel. They are fully compatible with graph. Every op requires caller-owned scratch, covered in a later section.
Experimental. These device-wide algorithms are a new addition and should be considered experimental: their signatures and semantics may change in a future release. (This does not apply to the deprecated
parallel_sortandPrefixSumExecutor, which are on their own removal track.)
What’s available#
Op |
What it does |
Call from kernel |
Call from host |
|---|---|---|---|
|
|
yes |
no |
|
|
yes |
no |
|
Stream compaction: copy |
yes |
no |
|
LSB radix sort (32-bit / 64-bit scalar keys, optional key-value). |
yes |
no |
|
Collapse each consecutive run of equal keys into |
yes |
no |
|
Host- and kernel-callable helpers returning the scratch slot count each op needs. |
yes |
yes |
|
Odd-even merge sort (in-place, key or key-value). Deprecated: prefer |
no |
yes |
|
Inclusive in-place prefix sum (i32 only). Deprecated: prefer |
no |
yes |
These algorithms run on all of Quadrants’ GPU backends (CUDA, AMDGPU, Vulkan, Metal); they are GPU-only and are not available on the CPU backend. (On Metal only 32-bit dtypes are supported - no i64 / u64 / f64; PrefixSumExecutor is CUDA / Vulkan only.) The *_scratch_slots helpers are plain integer arithmetic, so they also run on the host / CPU.
Composable @qd.func ops#
Every algorithm is a composable @qd.func (e.g. reduce_add, sort): call it at the top level of your own @qd.kernel, so the op is captured into the same kernel / graph as your surrounding phases. The function is annotated with requires_top_level=True, so attempting to use it outside of top level will throw an exception at compile time. Note that within a qd.loop_do_while counts as being at top-level [FIXME: add qd.checkpoint here too, once/when we merge qd.checkpoint].
Key sizing parameters#
Two parameters control sizing:
live element count,
n: , the number of elements that will actually be handled by the algorithm. This is a runtime value, provided in a scalar tensor. It can be modified without recompiling the kernel.maximum capacity,
log256_max_n: a compile-time constant. One compiled kernel can by used for any live count up to that capacity - see Capacity (log256_max_n).
Because each op runs entirely as device code it does no host-side validation (a size check would force a device-to-host read of the count that defeats graph capture), so you must size its scratch correctly up front - see Scratch space.
Capacity (log256_max_n)#
These functions bake their maximum capacity in as a compile-time constant, log256_max_n. An op compiled for a given log256_max_n correctly handles any live count n with 0 <= n <= 256 ** log256_max_n. Passing in an n outside of these bounds is unchecked undefined behavior.
The capacity grows fast - each level multiplies it by 256:
|
Max count ( |
|---|---|
|
|
|
|
|
|
|
|
log256_max_n = 4 already exceeds what a 32-bit index can address (256 ** 4 == 2 ** 32), so values above 4 can never describe a valid (i32-indexed) input and are rejected with a ValueError; in practice you rarely need more than 3-4.
Although the algorithm functions are qd.funcs, they are top-level funcs, and contain multiple gpu/llvm kernels. The exact number that are launched is a function of log256_max_n, via static for loops. Hence log256_max_n must be a compile time constant.
Pick it from an upper bound, not the current count. Use the smallest log256_max_n whose capacity covers the largest count you will ever feed the captured graph - log256_max_n = ceil(log256(capacity)), floored at 1. Size it against a provisioned upper bound (a buffer capacity, a padded element count, …), not today’s n, so the same graph serves the whole range below it:
def log256_max_n(capacity: int) -> int:
d = 1
while 256 ** d < capacity:
d += 1
return d
Over-specifying is safe. A capacity larger than the live count just adds staircase levels that operate on length-1 buffers - harmless identity no-ops. The only cost is a few extra empty launches and marginally larger scratch, so when in doubt, round up.
Under-specifying is a bug. If
256 ** log256_max_n < nthe op cannot address the tail of the input, and there is no host-side guard to catch it. Always cover your worst case.
Size scratch with the same log256_max_n you compile the op for - reduce_scratch_slots(capacity, log256_max_n), exclusive_scan_scratch_slots(capacity, log256_max_n), sort_scratch_slots(capacity, log256_max_n) - see Scratch space. (The *_scratch_slots helpers can also derive the minimal depth from N for you when you omit the argument - but that path is host-only, so when composing the op into a kernel always pass the explicit depth.)
Scratch space#
The algorithms need scratch space in order to run. This is temporary space used by the algorithms. The caller owns the scratch space. It does not need to be initialized in any way. It does need to exist, and be correctly sized, and typed. Every algorithm takes a mandatory scratch argument, to receive the scratch space.
Ask first, then allocate. Each algorithm ships a companion *_scratch_slots(N) function - branch-free integer arithmetic, no device round-trip - that returns the minimum number of slots needed for a length-N input. Allocate at least that many; allocating more is fine. These functions are both host- and kernel-callable: pass a Python int to size an allocation up front, or call the same function inside a @qd.kernel on a device-read N to recompute the requirement on-device and validate it against scratch.shape[0] without ever reading N back to the host..
Algorithm |
Sizing function |
Scratch dtype |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The slot count is dtype-width-independent (it is a count, not a byte count). For the 4-byte / 8-byte algorithms (reduce, scan) you allocate the same number of slots but in a u32 buffer for 4-byte element dtypes and a u64 buffer for 8-byte ones - the partials are bit_cast to / from the element dtype. select, reduce_by_key_add, and the radix sort always use u32 scratch (they stage counts / indices / tile histograms, which are u32 regardless of the element / key dtype).
import quadrants as qd
N = 1_000_000 # capacity: an upper bound on the live count you will feed the captured graph
D = 3 # log256_max_n: 256**3 = 16,777,216 >= N
# 4-byte reduce: u32 scratch, sized for the same depth D the func is compiled for.
scratch = qd.ndarray(qd.u32, shape=qd.algorithms.reduce_scratch_slots(N, D))
# 8-byte reduce: same slot count, u64 buffer.
scratch64 = qd.ndarray(qd.u64, shape=qd.algorithms.reduce_scratch_slots(N, D))
No on-device check. The composable @qd.func forms run directly as device code, so they do no scratch-sufficiency check. Size scratch correctly up front with the matching helper - reduce_scratch_slots(N, log256_max_n), exclusive_scan_scratch_slots(N, log256_max_n), select_scratch_slots(N), reduce_by_key_scratch_slots(N), or sort_scratch_slots(N, log256_max_n) - for the capacity you compile the op against. An undersized buffer corrupts the output (or reads / writes out of bounds) rather than raising.
Semantics#
The active ops below share a calling convention and several rules; these are stated once in Common conventions, and only the op-specific behavior is repeated per op. The internal algorithm for each op is in Under the hood.
Each op section ends with a runnable toy example. They all assume this prelude:
import numpy as np
import quadrants as qd
qd.init(arch=qd.gpu)
N, D = 8, 1 # 8 elements; D = log256_max_n = 1 -> capacity 256**1 = 256 >= N
Common conventions#
Call site. Every op is a composable @qd.func; call it at the top level of your own @qd.kernel (see Composable @qd.func ops). Never nest the call in ordinary runtime for / if / while control flow: that demotes the op’s internal phase loops out of top-level position and drops the per-phase grid-wide barriers it relies on, silently corrupting the result.
Shared arguments. Every op takes a device-resident live count n and a compile-time capacity log256_max_n; the ops that handle typed values also take the element dtype as an explicit template (dtype / key_dtype / value_dtype):
n- the live count, read on-device (e.g.count[0], orn[()]for a 0-d count). See Key sizing parameters.log256_max_n- the compile-time phase count; the op handles any count<= 256 ** log256_max_n, so one captured graph replays across that whole range. See Capacity (log256_max_n).dtype/key_dtype/value_dtype- the element dtype, passed explicitly because anndarraykernel argument exposes no in-kernel.dtype.scratch- caller-owned workspace; size it with the matching*_scratch_slots(capacity, log256_max_n)helper for the capacity you compile against. See Scratch space.
Tensor polymorphism. Every 1-D tensor argument is an ndarray kernel parameter, so it is polymorphic over qd.field, qd.ndarray, and qd.Tensor.
Scalar dtypes & scratch width (reduce / exclusive_scan). The element dtype is one of {qd.i32, qd.u32, qd.f32, qd.i64, qd.u64, qd.f64}; narrower / wider scalar dtypes (qd.i16, qd.f16, …) and struct dtypes raise NotImplementedError. 4-byte dtypes stage their partials through a u32 scratch and 8-byte dtypes through a u64 scratch (same slot count either way; see Scratch space).
Identity value (reduce / exclusive_scan, min / max). The identity is the value that leaves a result unchanged when combined under the op - add -> 0, min -> the largest representable value, max -> the smallest. It pads out-of-range lanes and seeds exclusive_scan’s out[0]. It is derived in-kernel from the element dtype, so there is no runtime identity argument (mirroring the block.reduce_min / subgroup.reduce_min typed wrappers): 0 for add; for min, +inf (floats) / INT{32,64}_MAX (signed ints) / UINT{32,64}_MAX (unsigned); for max, -inf (floats) / INT{32,64}_MIN (signed ints) / 0 (unsigned).
Floating-point non-associativity. For add on f32 / f64, the device combine order differs from a left-to-right host pass, so results are not bitwise-equal to numpy.sum / numpy.cumsum (nor reproducible across N changes). Tests tolerate a small relative error rather than asserting bitwise equality.
Composition shape. Each op is one call placed among your other top-level phases:
@qd.kernel
def my_pipeline(
arr: qd.types.ndarray(dtype=qd.f32, ndim=1),
out: qd.types.ndarray(dtype=qd.f32, ndim=1),
scratch: qd.types.ndarray(dtype=qd.u32, ndim=1),
count: qd.types.ndarray(dtype=qd.i32, ndim=1),
):
# ... other top-level phases ...
qd.algorithms.reduce_add(arr, out, scratch, count[0], qd.f32, log256_max_n)
# ... more top-level phases ...
The per-op snippets below show only the call line; drop it into a kernel like the one above.
qd.algorithms.reduce_{add,min,max}#
Reduction over a 1-D tensor: out[0] holds sum(arr[0:n]) / min(arr[0:n]) / max(arr[0:n]). Signature reduce_{add,min,max}(arr, out, scratch, n, dtype, log256_max_n).
Arguments (see Common conventions for n / dtype / log256_max_n, out, and tensor polymorphism):
arr: 1-D input tensor.out: 1-element tensor with the same dtype asarr.scratch:reduce_scratch_slots(N, log256_max_n)slots -u32for 4-bytearr,u64for 8-byte.
Constraints (plus the shared scalar-dtype set and f32 / f64 non-associativity from Common conventions):
Shape:
arrmust be 1-D andout.shapemust be(1,); both share the same dtype.
Scratch footprint: reduce_scratch_slots(N) ~ ceil(N / BLOCK_DIM) slots, where BLOCK_DIM = 256 (N = 1G is ~4M slots). See Scratch space.
Example - sum of an array:
arr = qd.field(qd.i32, shape=N)
out = qd.field(qd.i32, shape=1)
scratch = qd.field(qd.u32, shape=qd.algorithms.reduce_scratch_slots(N, D))
count = qd.field(qd.i32, shape=1)
arr.from_numpy(np.array([3, 1, 4, 1, 5, 9, 2, 6], dtype=np.int32))
count.from_numpy(np.array([N], dtype=np.int32))
@qd.kernel
def run():
qd.algorithms.reduce_add(arr, out, scratch, count[0], qd.i32, D)
run()
print(out.to_numpy()[0]) # 31 (3 + 1 + 4 + 1 + 5 + 9 + 2 + 6)
reduce_min / reduce_max are identical apart from the call name (they give 1 and 9 for this input).
qd.algorithms.exclusive_scan_{add,min,max}#
Device-wide exclusive prefix scan over a 1-D tensor: out[i] holds the reduction (sum / min / max) of arr[0:i], and out[0] is always the op’s identity value (0 for add). Signature exclusive_scan_{add,min,max}(arr, out, scratch, n, dtype, log256_max_n).
Arguments (see Common conventions for n / dtype / log256_max_n):
arr/out: 1-D input / output tensors, same shape and dtype;outmust be a distinct buffer fromarr(see constraints).scratch:exclusive_scan_scratch_slots(N, log256_max_n)slots -u32for 4-bytearr,u64for 8-byte.
Constraints (plus the shared scalar-dtype set and add non-associativity from Common conventions):
Shape:
arrandoutmust both be 1-D with the same shape and dtype.No in-place scan:
outmust be a distinct buffer fromarr. Calling without is arrraisesValueError. (The kernels do not protect against same-buffer aliasing; allocating one extra buffer once is cheap relative to the scan itself.)
Scratch footprint: exclusive_scan_scratch_slots(N) ~ ceil(N / BLOCK_DIM) slots for the per-tile partials plus the recursive staircase above them (4112 slots at N = 1M). See Scratch space.
Example - running total:
arr = qd.field(qd.i32, shape=N)
out = qd.field(qd.i32, shape=N) # must be a distinct buffer from arr
scratch = qd.field(qd.u32, shape=qd.algorithms.exclusive_scan_scratch_slots(N, D))
count = qd.field(qd.i32, shape=1)
arr.from_numpy(np.array([3, 1, 4, 1, 5, 9, 2, 6], dtype=np.int32))
count.from_numpy(np.array([N], dtype=np.int32))
@qd.kernel
def run():
qd.algorithms.exclusive_scan_add(arr, out, scratch, count[0], qd.i32, D)
run()
print(out.to_numpy()) # [ 0 3 4 8 9 14 23 25 ] (out[i] = sum(arr[:i]))
qd.algorithms.select#
Stream compaction. Copy every arr[i] whose corresponding flags[i] is 1 into a dense prefix of out, in stable input order, and write the count of selected elements to num_out[0]. Signature select(arr, flags, out, num_out, scratch, n, log256_max_n) - there is no dtype argument: the scatter out[idx] = arr[i] lowers per-field, so select works for scalar and struct element dtypes unchanged.
Arguments (see Common conventions for n / log256_max_n and the num_out host-hop rule):
arr: 1-D input of any scalar dtype in{qd.i32, qd.u32, qd.f32, qd.i64, qd.u64, qd.f64}or anyqd.types.struct(...)/qd.Struct.field({...})composite (e.g. libuipcVector2i/Vector3i/Vector4i/LinearBVHAABB-style structs).flags: 1-Dqd.i32tensor with the same shape asarr. Every entry must be exactly0or1(1selects). The algorithm prefix-sumsflagsdirectly as counts, so non-0/1 values produce wrong indices and a wrongnum_outcount - the caller is responsible for normalization (no implicit normalization pass). Populate it with a kernel applying whatever predicate you want.out: 1-D tensor, same dtype asarr, withlen(out) >= len(arr)so the worst-case all-selected run is safe. Onlyout[0 : num_out[0]]carries meaningful data on return; the tail is left untouched.num_out: 1-elementqd.i32tensor receiving the selected count.scratch:select_scratch_slots(N)slots - alwaysu32, regardless ofarr.dtype. See Scratch space.
Scratch footprint: select_scratch_slots(N) ~ N u32 slots (one write index per input element). See Scratch space.
Example - keep the flagged elements:
arr = qd.field(qd.i32, shape=N)
flags = qd.field(qd.i32, shape=N)
out = qd.field(qd.i32, shape=N) # len(out) >= len(arr)
num_out = qd.field(qd.i32, shape=1)
scratch = qd.field(qd.u32, shape=qd.algorithms.select_scratch_slots(N))
count = qd.field(qd.i32, shape=1)
arr.from_numpy(np.array([10, 11, 12, 13, 14, 15, 16, 17], dtype=np.int32))
flags.from_numpy(np.array([1, 0, 1, 1, 0, 0, 1, 0], dtype=np.int32))
count.from_numpy(np.array([N], dtype=np.int32))
@qd.kernel
def run():
qd.algorithms.select(arr, flags, out, num_out, scratch, count[0], D)
run()
k = int(num_out.to_numpy()[0])
print(k) # 4
print(out.to_numpy()[:k]) # [10 12 13 16] (the flagged elements, in input order)
qd.algorithms.sort#
Ascending in-place LSB radix sort over a 1-D tensor of 32-bit or 64-bit scalar keys (u32 / i32 / f32 / u64 / i64 / f64), with optional lock-step permutation of a values tensor (key-value sort). Called as a @qd.func at the top level of your own @qd.kernel so the sort composes with your other phases into one compiled kernel / captured graph:
sort(keys, tmp_keys, values, tmp_values, scratch, n, key_dtype, has_values, end_bit, log256_max_n)
Here n is a 0-d i32 ndarray and the compile-time flags (key_dtype, has_values, end_bit, log256_max_n) are passed explicitly (see Common conventions for n / key_dtype / log256_max_n). Pass real values / tmp_values with has_values=True for a key-value sort; for a keys-only sort pass keys / tmp_keys again in the values / tmp_values slots as placeholders and set has_values=False (every value access is has_values-guarded). The func does no host-side validation or scratch-sufficiency check (a DtoH would defeat graph capture), so size scratch correctly up front.
Arguments:
keys: 1-D tensor (qd.field,qd.ndarray, orqd.Tensor). Sorted in place.tmp_keys: ping-pong workspace, same shape & dtype askeys, distinct buffer. Contents on return are intermediate and should be considered garbage.values,tmp_values: the key-value buffers (any supported scalar dtype, independent of the key dtype, same shape askeys, distinct from each other) whenhas_values=True. For a keys-only sort passkeys/tmp_keyshere as placeholders and sethas_values=False.scratch: required caller-owned 1-Dqd.u32tensor used as the per-pass tile-histogram + scan workspace. Size it withqd.algorithms.sort_scratch_slots(N, log256_max_n)(the footprint is dtype-independent - tile histograms areu32regardless of key width). The func does no sufficiency check, so size it correctly up front. See Scratch space.n: 0-di32ndarray (shape=()) holding the element count on-device (read asn[()]).key_dtype: the key element dtype, passed explicitly (see Common conventions).has_values: compile-time bool - whethervalues/tmp_valuesare real buffers (True) or placeholders (False).end_bit: number of low key bits to sort. Use the full key width (32 for 4-byte keys, 64 for 8-byte) unless the high bits are known to be zero (e.g.16for keys< 2**16, to save passes). Must be a positive multiple of8that yields an even number of digit passes so the result lands back inkeys.log256_max_n: scan depthD(the compile-time capacity; see Common conventions). Sizescratchwith the sameD.
Constraints:
Dtypes: the key dtype and value dtype are each independently one of
{qd.u32, qd.i32, qd.f32, qd.u64, qd.i64, qd.f64}. Narrower scalar dtypes (qd.i16,qd.f16, …) and struct dtypes raiseNotImplementedErrorat compile time. 8-byte keys run 8 digit passes per sort; 4-byte keys run 4. Scratch footprint is the same for both widths (the per-tile histograms areu32regardless).Aliasing:
keysandtmp_keysmust be distinct buffers; same for realvalues/tmp_values.sortdoes not check this (passing the same buffer corrupts the sort).Stability: stable sort - equal keys keep their original input order in the output.
NaN handling (f32): matches
numpy.sort(NaNs land at the end). NaNs are not tested separately and should not be relied on for ordering invariants beyondnumpy.sort.
Scratch footprint: num_blocks * 256 + ... u32 slots (where num_blocks = ceil(N / 256)), plus the scan staircase. Call qd.algorithms.sort_scratch_slots(N, log256_max_n) to get the exact slot count (pure host arithmetic, no device round-trip). See Scratch space.
Example - key-value sort. Unlike the others, sort takes its count as a 0-d tensor (read on-device as n[()]) and its buffers as kernel arguments. Here values rides along with the keys (the original indices), has_values=True, and end_bit=32 is the full i32 key width:
i32_1d = qd.types.ndarray(qd.i32, ndim=1)
u32_1d = qd.types.ndarray(qd.u32, ndim=1)
i32_0d = qd.types.ndarray(qd.i32, ndim=0)
@qd.kernel
def run(keys: i32_1d, tmp_keys: i32_1d, values: i32_1d, tmp_values: i32_1d, scratch: u32_1d, n: i32_0d):
qd.algorithms.sort(keys, tmp_keys, values, tmp_values, scratch, n, qd.i32, True, 32, D)
keys = qd.ndarray(qd.i32, shape=(N,))
tmp_keys = qd.ndarray(qd.i32, shape=(N,))
values = qd.ndarray(qd.i32, shape=(N,))
tmp_values = qd.ndarray(qd.i32, shape=(N,))
scratch = qd.ndarray(qd.u32, shape=(qd.algorithms.sort_scratch_slots(N, D),))
n = qd.ndarray(qd.i32, shape=())
keys.from_numpy(np.array([3, 1, 4, 1, 5, 9, 2, 6], dtype=np.int32))
values.from_numpy(np.arange(N, dtype=np.int32)) # payload: each key's original index
n.fill(N)
run(keys, tmp_keys, values, tmp_values, scratch, n)
print(keys.to_numpy()) # [1 1 2 3 4 5 6 9]
print(values.to_numpy()) # [1 3 6 0 2 4 7 5] (original indices; stable for the tied 1s)
qd.algorithms.reduce_by_key_add#
Collapse every consecutive run of equal keys into a single output entry (unique_key, sum_of_values_in_run). Keys that compare equal but are separated by other keys form separate runs. For a global per-key sum, sort by key first (e.g. with qd.algorithms.sort) and then reduce-by-key. Signature reduce_by_key_add(keys_in, values_in, keys_out, values_out, num_runs, scratch, n, value_dtype, log256_max_n).
Arguments (see Common conventions for n / log256_max_n and the num_runs host-hop rule):
keys_in: 1-D tensor ofu32/i32/f32.values_in: 1-D tensor ofu32/i32/f32, same shape askeys_in.keys_out: 1-D tensor of the same dtype askeys_in, withlen(keys_out) >= len(keys_in)so the worst-case-all-unique run is safe. Onlykeys_out[0 : num_runs[0]]carries meaningful data on return; the tail is untouched.values_out: 1-D tensor of the same dtype asvalues_in, same length requirement. The firstnum_runs[0]slots are overwritten; the tail past that prefix is left untouched.num_runs: 1-elementqd.i32tensor receiving the number of runs.scratch:reduce_by_key_scratch_slots(N)slots - alwaysu32, regardless of key / value dtype. See Scratch space.value_dtype: the values dtype, passed explicitly (used only to write the typed zero intovalues_outbefore the scatter’satomic_add; keys are handled generically).
Constraints:
Dtypes (first land):
keys_in.dtypeandvalues_in.dtypein {qd.i32,qd.u32,qd.f32}. Other dtypes raiseNotImplementedError.Reduction: only
addis exposed for first land.min/maxvariants needatomic_min/atomic_maxforf32, which has spottier cross-backend support; deferred to a follow-up gated on real-world demand.f32 non-associativity: the order of additions inside a run is set by hardware atomic ordering, not host order, so
f32results are not bitwise-equal to a serial scan. Tests tolerate a small relative error.NaN handling (f32 keys):
NaN != NaNis true, so each NaN-keyed element becomes its own run. Consistent with treating NaN as “different from everything”, which matches the run-length-encoding spirit.
Scratch footprint: reduce_by_key_scratch_slots(N) ~ 1.004 * N u32 slots. See Scratch space.
Example - sum consecutive runs of equal keys:
keys_in = qd.field(qd.i32, shape=N)
values_in = qd.field(qd.i32, shape=N)
keys_out = qd.field(qd.i32, shape=N)
values_out = qd.field(qd.i32, shape=N)
num_runs = qd.field(qd.i32, shape=1)
scratch = qd.field(qd.u32, shape=qd.algorithms.reduce_by_key_scratch_slots(N))
count = qd.field(qd.i32, shape=1)
keys_in.from_numpy(np.array([1, 1, 1, 2, 2, 3, 3, 3], dtype=np.int32))
values_in.from_numpy(np.array([5, 2, 1, 4, 4, 6, 1, 1], dtype=np.int32))
count.from_numpy(np.array([N], dtype=np.int32))
@qd.kernel
def run():
qd.algorithms.reduce_by_key_add(keys_in, values_in, keys_out, values_out, num_runs, scratch, count[0], qd.i32, D)
run()
r = int(num_runs.to_numpy()[0])
print(r) # 3
print(keys_out.to_numpy()[:r]) # [1 2 3]
print(values_out.to_numpy()[:r]) # [8 8 8] (5+2+1, 4+4, 6+1+1)
qd.algorithms.parallel_sort(keys, values=None)#
Deprecated. New code should call the LSB radix sort
qd.algorithms.sort(a@qd.func) instead. The radix sort is asymptoticallyO(N log_radix N)rather thanO(N log^2 N), is stable (odd-even merge sort is not), supports 32-bit and 64-bit scalar keys across CUDA / AMDGPU / Vulkan / Metal, and acceptsqd.field,qd.ndarray, andqd.Tensor(parallel_sortis field-only). The only thingparallel_sortis competitive on is very small N (~4K and below); even there the radix path is comparable on modern hardware. To migrate, allocatetmp_keysof the same shape and dtype askeysplus au32scratchbuffer, then callsortat the top level of a kernel (see its section above for the full signature).parallel_sortis kept for one release cycle for backward compat and will be removed thereafter.
In-place sort. Reorders keys ascending; if values is provided, applies the same permutation to values (key-value sort). Both arguments must be 1-D qd.field - parallel_sort reaches into snode.ptr.offset internally, so ndarray is not supported and will fail at compile time with an AttributeError.
import quadrants as qd
keys = qd.field(qd.i32, shape=(N,))
qd.algorithms.parallel_sort(keys)
keys = qd.field(qd.i32, shape=(N,))
vals = qd.field(qd.f32, shape=(N,))
qd.algorithms.parallel_sort(keys, vals)
Algorithm. Batcher’s odd-even merge sort. Time complexity
O(N log^2 N), work-efficient for small / mid-sized arrays.Key dtype. Whatever the key field’s dtype is, as long as
<is meaningful for it (integer and float types).Stability. Odd-even merge sort is not a stable sort - equal keys may be reordered relative to one another. If stability matters, encode tiebreakers into the keys (e.g. pack the original index into the low bits).
Memory. Strictly in-place - no auxiliary buffers from the caller’s perspective.
Performance characteristic. Beats radix-style sorts for small N (roughly N <= 4K).
qd.algorithms.PrefixSumExecutor#
Deprecated. New code should call
qd.algorithms.exclusive_scan_addinstead.PrefixSumExecutoris inclusive-only,i32-only, and CUDA / Vulkan-only; the new functional API covers{i32, u32, f32, i64, u64, f64}on every supported backend and runs the exclusive variant directly. To migrate from inclusive in-place to exclusive out-of-place, drop theExecutorwrapper, allocate a distinctoutfield, and post-process if you actually need the inclusive form (inclusive[i] = exclusive[i] + arr[i]).PrefixSumExecutoris kept for one release cycle for backward compat and will be removed in a future release.
Inclusive in-place prefix sum (scan) over a 1-D i32 field. Construct once with the array length, then call .run(field) to scan.
psum = qd.algorithms.PrefixSumExecutor(N)
arr = qd.field(qd.i32, shape=(N,))
# ... fill arr ...
psum.run(arr)
# arr now holds the inclusive prefix sum: arr[i] = sum(arr_original[0..=i]).
Constructor:
length: int- the fixed number of elements the executor will scan on every.run()call. Internally allocates an auxiliaryqd.field(i32, shape=padded_length)sized to the Kogge-Stone hierarchy (block size = 64).
run(input_arr):
input_arrmust be a 1-Dqd.field(qd.i32, shape=(length,))- its length must match the constructor’slengthexactly.run()always blitslengthelements betweeninput_arrand the internal buffer; passing a shorter field results in out-of-bounds reads / writes (no runtime check today).Returns nothing;
input_arris overwritten with the scan result.
Constraints:
Dtype:
qd.i32only. Calling with any other dtype raisesRuntimeError("Only qd.i32 type is supported for prefix sum.").Inclusive only. No exclusive variant exposed. To convert to exclusive, post-process:
exclusive[i] = inclusive[i] - input_original[i].Backend coverage. CUDA and Vulkan only. AMDGPU and Metal raise
RuntimeError(f"{arch} is not supported for prefix sum.")at compile time.
The implementation is a Kogge-Stone hierarchical scan: per-block inclusive scan on shared memory, then a small recursive scan over per-block totals, then a uniform-add pass to propagate back. This means the executor reuses the underlying buffer across calls, which is why it’s a class (allocate once, run many times) rather than a free function.
No explicit fence is required between a kernel that writes the input and the subsequent .run() call. .run() launches its own kernels under the hood, and the kernel boundary serializes against prior writes from host-launched kernels.
Under the hood#
Implementation detail for the curious - not needed to use the ops. In every case the func emits a fixed-depth (log256_max_n) staircase of phases inside your kernel; each phase is a separate offloaded launch, so correctness relies on the same launch-boundary serialization as your surrounding top-level loops. BLOCK_DIM = 256 throughout. Each op is implemented from scratch in Quadrants; the classical design it follows is cited in its subsection.
reduce_{add,min,max}#
Fixed-depth tree reduction. Each phase uses
BLOCK_DIM = 256threads per block and reduces 256 elements per block viablock.reduce_{add,min,max}.log256_max_n = 1coversN <= 256;2covers up to256^2 = 65536; and so on. Out-of-range lanes contribute the identity value.The output is written to
out[0].
exclusive_scan_{add,min,max}#
Blelloch’s 1990 work-efficient three-pass exclusive scan, realized on the GPU as a balanced per-block tree in the style of Harris, Sengupta & Owens (GPU Gems 3, ch. 39):
Pass 1 - per-block tile reduce of
arrinto the caller’sscratch(one slot per block).Pass 2 - exclusive-scan the partials buffer in place. For
N <= BLOCK_DIM^2(= 65536) a single block does this. For largerN, the staircase recursesD - 2further levels: another tile-reduce on the partials, a recursive scan, then a downsweep that applies the higher-level prefixes.Pass 3 - per-block tile scan + add the block prefix from scratch. Each block re-reads its tile from
arr, runsblock.exclusive_scanto get per-thread tile prefixes, and adds itsblock_prefixfrom the scanned partials.
Total scratch at N = 1M is exclusive_scan_scratch_slots(N) = 4096 + 16 = 4112 slots (~16 KB for 4-byte dtypes, ~32 KB for 8-byte).
select#
The textbook scan-based compaction (a Blelloch 1990 application of scan):
Exclusive scan of
flagsinto the caller’su32scratch, producing per-element write indices. Same staircase phases asexclusive_scan_add(out-of-place:flagsstays intact for the scatter / count, the indices land inscratch[0:N]and the partials above them).Scatter: a phase reads each
(arr[i], flags[i], indices[i])and, if the flag is set, writesout[indices[i]] = arr[i]. No races, by construction of the exclusive scan over 0 / 1 flags.Count tail: a one-thread phase computes
indices[N-1] + flags[N-1]and stores it innum_out[0].
sort#
Classical histogram-scan-scatter LSB radix sort (Knuth, TAOCP Vol. 3 Sec. 5.2.5; the per-pass digit-histogram scan is Blelloch’s) with 8-bit digits, four passes for
u32/i32/f32(eight for the 64-bit dtypes). Each digit pass is three internal kernels:Histogram - every block computes its per-digit count into shared memory, then publishes the 256-bin tile histogram to the shared u32 scratch (digit-major layout:
tile_histograms[d * num_blocks + b]).Scan - in-place exclusive scan over the flat tile_histograms buffer. The digit-major layout makes a single 1-D scan enough to produce per-(digit, block) global offsets.
Scatter - each block ranks its keys via
block.radix_rank_match_atomic_or(wave32 + wave64 clean), looks up the per-(digit, block) global offset from the scan output, and scatters keys (and values, if provided) to the destination buffer.
After each pass
keys<->tmp_keysare swapped. An even pass count lands the sorted keys back inkeys.Signed-integer (
i32/i64) and floating-point (f32/f64) keys are mapped to a sortable unsigned representation (u32/u64) before the first pass and mapped back after the last via in-place “twiddle” kernels (signed: XOR sign bit; float: flip sign bit on positives, flip all bits on negatives - the standard sortable-key transform).u32/u64keys are sorted directly with no twiddle.
reduce_by_key_add#
Scan + scatter + atomic_add over head flags - no segmented-scan primitive needed; the scan is Blelloch’s (same staircase as exclusive_scan_add):
Head-flag pass.
head_flags[i] = 1ifi == 0orkeys[i] != keys[i-1], else0. Written to the caller’su32scratch (bit-cast fromi32).In-place exclusive scan of
head_flags(same staircase phases asexclusive_scan_add). After this,scratch[i] = sum(head_flags[0:i]).Zero-init
values_out[0:N]. The scatter usesatomic_add; slots must start at the additive identity0.Scatter. For each
i, recomputehead_flag(i)fromkeys[i]/keys[i-1], derive the run indexpos = scratch[i] + head_flag(i) - 1(inclusive scan minus 1), and writekeys_out[pos] = keys[i]+atomic_add(values_out[pos], values[i]).Count.
num_runs[0] = scratch[N-1] + head_flag(N-1).