Algorithms#
Device-wide algorithms — primitives that consume and produce whole arrays, executed as one or more kernel launches under the hood. They sit one tier above grid-scope synchronization: they use block, subgroup, and grid primitives internally and expose a high-level entry point that the user calls from host (Python) code, not from inside a kernel.
What’s available#
Op |
What it does |
CUDA |
AMDGPU |
Vulkan |
Metal |
|---|---|---|---|---|---|
|
Odd-even merge sort (in-place, key or key-value) |
yes |
yes* |
yes |
yes* |
|
Inclusive in-place prefix sum (i32 only) |
yes |
no |
yes |
no |
* parallel_sort runs anywhere a Quadrants kernel runs; portability is inherited from the underlying kernel infrastructure. AMDGPU and Metal coverage is exercised less heavily than CUDA / Vulkan; report any failures.
Semantics#
qd.algorithms.parallel_sort(keys, values=None)#
In-place sort. Reorders keys ascending; if values is provided, applies the same permutation to values (key-value sort). Both arguments must be 1-D qd.field — parallel_sort reaches into snode.ptr.offset internally, so ndarray is not supported and will fail at compile time with an AttributeError.
import quadrants as qd
keys = qd.field(qd.i32, shape=(N,))
qd.algorithms.parallel_sort(keys)
keys = qd.field(qd.i32, shape=(N,))
vals = qd.field(qd.f32, shape=(N,))
qd.algorithms.parallel_sort(keys, vals)
Algorithm. Batcher’s odd-even merge sort. Time complexity
O(N log² N), work-efficient for small / mid-sized arrays.Key dtype. Whatever the key field’s dtype is, as long as
<is meaningful for it (integer and float types).Stability. Odd-even merge sort is not a stable sort — equal keys may be reordered relative to one another. If stability matters, encode tiebreakers into the keys (e.g. pack the original index into the low bits).
Memory. Strictly in-place — no auxiliary buffers from the caller’s perspective.
Performance characteristic. Beats radix-style sorts for small N (roughly N ≲ 4K).
qd.algorithms.PrefixSumExecutor#
Inclusive in-place prefix sum (scan) over a 1-D i32 field. Construct once with the array length, then call .run(field) to scan.
psum = qd.algorithms.PrefixSumExecutor(N)
arr = qd.field(qd.i32, shape=(N,))
# ... fill arr ...
psum.run(arr)
# arr now holds the inclusive prefix sum: arr[i] = sum(arr_original[0..=i]).
Constructor:
length: int— the fixed number of elements the executor will scan on every.run()call. Internally allocates an auxiliaryqd.field(i32, shape=padded_length)sized to the Kogge-Stone hierarchy (block size = 64).
run(input_arr):
input_arrmust be a 1-Dqd.field(qd.i32, shape=(length,))— its length must match the constructor’slengthexactly.run()always blitslengthelements betweeninput_arrand the internal buffer; passing a shorter field results in out-of-bounds reads / writes (no runtime check today).Returns nothing;
input_arris overwritten with the scan result.
Constraints:
Dtype:
qd.i32only. Calling with any other dtype raisesRuntimeError("Only qd.i32 type is supported for prefix sum.").Inclusive only. No exclusive variant exposed. To convert to exclusive, post-process:
exclusive[i] = inclusive[i] - input_original[i].Backend coverage. CUDA and Vulkan only. AMDGPU and Metal raise
RuntimeError(f"{arch} is not supported for prefix sum.")at compile time.
The implementation is a Kogge-Stone hierarchical scan: per-block inclusive scan on shared memory, then a small recursive scan over per-block totals, then a uniform-add pass to propagate back. This means the executor reuses the underlying buffer across calls, which is why it’s a class (allocate once, run many times) rather than a free function.
No explicit fence is required between a kernel that writes the input and the subsequent .run() call. .run() launches its own kernels under the hood, and the kernel boundary serializes against prior writes from host-launched kernels.
Examples#
Sort indices by per-element key#
N = 1000
keys = qd.field(qd.f32, shape=(N,))
indices = qd.field(qd.i32, shape=(N,))
@qd.kernel
def init() -> None:
for i in range(N):
keys[i] = qd.random()
indices[i] = i
init()
qd.algorithms.parallel_sort(keys, indices)
# keys is now ascending; indices[k] is the original index of the k-th smallest key.
Compact-array offsets via prefix sum#
N = 100_000
flags = qd.field(qd.i32, shape=(N,)) # 0 or 1 per element
offsets = qd.field(qd.i32, shape=(N,))
@qd.kernel
def populate(data: qd.types.NDArray[qd.f32, 1], threshold: qd.f32) -> None:
for i in range(N):
flags[i] = 1 if data[i] > threshold else 0
@qd.kernel
def copy_flags() -> None:
for i in range(N):
offsets[i] = flags[i]
scan = qd.algorithms.PrefixSumExecutor(N)
populate(data, 0.5)
copy_flags()
scan.run(offsets)
# offsets[i] is now the 1-based output position of element i if it was selected.
The compact-output kernel reads offsets[i] (or offsets[i] - flags[i] for 0-based) to decide where to write surviving elements. This is the textbook scan-based select / compact pattern; the only Quadrants-specific note is the i32-only restriction.