Streams#

Streams allow concurrent execution of GPU operations. By default, all Quadrants kernels launch on the default stream, which serializes everything. With streams, you can run multiple top-level for loops in parallel.

Supported platforms#

Backend	Supported
CUDA	Yes
AMDGPU	Yes
CPU	No-op
Metal	No-op
Vulkan	No-op

On backends without native stream support, stream operations are no-ops and for loops run serially. Code using streams is portable across all backends — it will run without modifications, but serially.

Stream parallelism#

Inside a @qd.kernel, each with qd.stream_parallel(): block runs on its own GPU stream.

import quadrants as qd

qd.init(arch=qd.cuda)

N = 1024
a = qd.field(qd.f32, shape=(N,))
b = qd.field(qd.f32, shape=(N,))
c = qd.field(qd.f32, shape=(N,))

@qd.kernel
def compute_ab():
    with qd.stream_parallel():
        for i in range(N):
            a[i] = compute_a(i)
    with qd.stream_parallel():
        for j in range(N):
            b[j] = compute_b(j)

@qd.kernel
def combine():
    for i in range(N):
        c[i] = a[i] + b[i]

compute_ab()  # the two stream_parallel blocks run concurrently
combine()     # runs after compute_ab() returns — a[] and b[] are ready

Consecutive with qd.stream_parallel(): blocks run concurrently. Multiple for loops within a single block share a stream and run serially on it. All streams are synchronized before the kernel returns.

Restrictions#

All top-level statements in a kernel must be either all stream_parallel blocks or all regular statements. Mixing the two at the top level is a compile-time error.
Nesting stream_parallel blocks is not supported.

Explicit streams#

For cases that require manual control — such as launching separate kernels on different streams or interoperating with PyTorch — you can create and manage streams directly.

Creating and using streams#

Any @qd.kernel function accepts a special qd_stream keyword argument — you do not need to declare it in the kernel signature. The @qd.kernel decorator handles it automatically.

@qd.kernel
def my_kernel():
    for i in range(N):
        a[i] = i

s1 = qd.create_stream()
s2 = qd.create_stream()

my_kernel(qd_stream=s1)
my_kernel(qd_stream=s2)

s1.synchronize()
s2.synchronize()

s1.destroy()
s2.destroy()

Kernels on different streams may execute concurrently. Call synchronize() to block until all work on a stream completes.

Events#

Events let you express dependencies between streams without full synchronization.

s1 = qd.create_stream()
s2 = qd.create_stream()

@qd.kernel
def produce():
    for i in range(N):
        a[i] = 10.0

@qd.kernel
def consume():
    for i in range(N):
        b[i] = a[i]

produce(qd_stream=s1)

e = qd.create_event()
e.record(s1)       # record when s1 finishes produce()
e.wait(qd_stream=s2)  # s2 waits for that event before proceeding

consume(qd_stream=s2)  # safe to read a[] — produce() is guaranteed complete
s2.synchronize()

e.destroy()
s1.destroy()
s2.destroy()

e.record(stream) captures the point in stream’s execution. e.wait(qd_stream=stream) makes stream wait until the recorded point is reached. If qd_stream is omitted, the default stream waits.

Context managers#

Streams and events support with blocks for automatic cleanup:

with qd.create_stream() as s:
    some_func1(qd_stream=s)
# s.destroy() called automatically — waits for in-flight work

Synchronization notes#

qd.sync() only waits on the default stream. It does not drain explicit streams. Call stream.synchronize() on each stream you need to wait for.
No automatic synchronization with explicit streams. When using explicit streams, you are responsible for inserting events or synchronize() calls when one stream’s output is another stream’s input. stream_parallel handles this automatically.

Limitations#

Not compatible with graphs. Do not pass qd_stream to a kernel decorated with graph=True (if you do, a RuntimeError will be raised).
Not compatible with autodiff. Do not pass qd_stream to a kernel that uses reverse-mode or forward-mode differentiation, or inside a qd.ad.Tape context (if you do, a RuntimeError will be raised).