Streams#

Streams allow concurrent execution of GPU operations. By default, all Quadrants kernels launch on the default stream, which serializes everything. With streams, you can run multiple top-level for loops in parallel.

Supported platforms#

Backend

Supported

CUDA

Yes

AMDGPU

Yes

CPU

No-op

Metal

No-op

Vulkan

No-op

On backends without native stream support, stream operations are no-ops and for loops run serially. Code using streams is portable across all backends — it will run without modifications, but serially.

Stream parallelism#

Inside a @qd.kernel, each with qd.stream_parallel(): block runs on its own GPU stream.

import quadrants as qd

qd.init(arch=qd.cuda)

N = 1024
a = qd.field(qd.f32, shape=(N,))
b = qd.field(qd.f32, shape=(N,))
c = qd.field(qd.f32, shape=(N,))

@qd.kernel
def compute_ab():
    with qd.stream_parallel():
        for i in range(N):
            a[i] = compute_a(i)
    with qd.stream_parallel():
        for j in range(N):
            b[j] = compute_b(j)

@qd.kernel
def combine():
    for i in range(N):
        c[i] = a[i] + b[i]

compute_ab()  # the two stream_parallel blocks run concurrently
combine()     # runs after compute_ab() returns — a[] and b[] are ready

Consecutive with qd.stream_parallel(): blocks run concurrently. Multiple for loops within a single block share a stream and run serially on it. All streams are synchronized before the kernel returns.

Restrictions#

  • All top-level statements in a kernel must be either all stream_parallel blocks or all regular statements. Mixing the two at the top level is a compile-time error.

  • Nesting stream_parallel blocks is not supported.

Explicit streams#

For cases that require manual control — such as launching separate kernels on different streams or interoperating with PyTorch — you can create and manage streams directly.

Creating and using streams#

Any @qd.kernel function accepts a special qd_stream keyword argument — you do not need to declare it in the kernel signature. The @qd.kernel decorator handles it automatically.

@qd.kernel
def my_kernel():
    for i in range(N):
        a[i] = i

s1 = qd.create_stream()
s2 = qd.create_stream()

my_kernel(qd_stream=s1)
my_kernel(qd_stream=s2)

s1.synchronize()
s2.synchronize()

s1.destroy()
s2.destroy()

Kernels on different streams may execute concurrently. Call synchronize() to block until all work on a stream completes.

Events#

Events let you express dependencies between streams without full synchronization.

s1 = qd.create_stream()
s2 = qd.create_stream()

@qd.kernel
def produce():
    for i in range(N):
        a[i] = 10.0

@qd.kernel
def consume():
    for i in range(N):
        b[i] = a[i]

produce(qd_stream=s1)

e = qd.create_event()
e.record(s1)       # record when s1 finishes produce()
e.wait(qd_stream=s2)  # s2 waits for that event before proceeding

consume(qd_stream=s2)  # safe to read a[] — produce() is guaranteed complete
s2.synchronize()

e.destroy()
s1.destroy()
s2.destroy()

e.record(stream) captures the point in stream’s execution. e.wait(qd_stream=stream) makes stream wait until the recorded point is reached. If qd_stream is omitted, the default stream waits.

Context managers#

Streams and events support with blocks for automatic cleanup:

with qd.create_stream() as s:
    some_func1(qd_stream=s)
# s.destroy() called automatically — waits for in-flight work

Synchronization notes#

  • qd.sync() only waits on the default stream. It does not drain explicit streams. Call stream.synchronize() on each stream you need to wait for.

  • No automatic synchronization with explicit streams. When using explicit streams, you are responsible for inserting events or synchronize() calls when one stream’s output is another stream’s input. stream_parallel handles this automatically.

Limitations#

  • Not compatible with graphs. Do not pass qd_stream to a kernel decorated with graph=True (if you do, a RuntimeError will be raised).

  • Not compatible with autodiff. Do not pass qd_stream to a kernel that uses reverse-mode or forward-mode differentiation, or inside a qd.ad.Tape context (if you do, a RuntimeError will be raised).