Streams#
Streams allow concurrent execution of GPU operations. By default, all Quadrants kernels launch on the default stream, which serializes everything. With streams, you can run multiple top-level for loops in parallel.
Supported platforms#
Backend |
Supported |
|---|---|
CUDA |
Yes |
AMDGPU |
Yes |
CPU |
No-op |
Metal |
No-op |
Vulkan |
No-op |
On backends without native stream support, stream operations are no-ops and for loops run serially. Code using streams is portable across all backends — it will run without modifications, but serially.
Stream parallelism#
Inside a @qd.kernel, each with qd.stream_parallel(): block runs on its own GPU stream.
import quadrants as qd
qd.init(arch=qd.cuda)
N = 1024
a = qd.field(qd.f32, shape=(N,))
b = qd.field(qd.f32, shape=(N,))
c = qd.field(qd.f32, shape=(N,))
@qd.kernel
def compute_ab():
with qd.stream_parallel():
for i in range(N):
a[i] = compute_a(i)
with qd.stream_parallel():
for j in range(N):
b[j] = compute_b(j)
@qd.kernel
def combine():
for i in range(N):
c[i] = a[i] + b[i]
compute_ab() # the two stream_parallel blocks run concurrently
combine() # runs after compute_ab() returns — a[] and b[] are ready
Consecutive with qd.stream_parallel(): blocks run concurrently. Multiple for loops within a single block share a stream and run serially on it. All streams are synchronized before the kernel returns.
Restrictions#
All top-level statements in a kernel must be either all
stream_parallelblocks or all regular statements. Mixing the two at the top level is a compile-time error.Nesting
stream_parallelblocks is not supported.
Explicit streams#
For cases that require manual control — such as launching separate kernels on different streams or interoperating with PyTorch — you can create and manage streams directly.
Creating and using streams#
Any @qd.kernel function accepts a special qd_stream keyword argument — you do not need to declare it in the kernel signature. The @qd.kernel decorator handles it automatically.
@qd.kernel
def my_kernel():
for i in range(N):
a[i] = i
s1 = qd.create_stream()
s2 = qd.create_stream()
my_kernel(qd_stream=s1)
my_kernel(qd_stream=s2)
s1.synchronize()
s2.synchronize()
s1.destroy()
s2.destroy()
Kernels on different streams may execute concurrently. Call synchronize() to block until all work on a stream completes.
Events#
Events let you express dependencies between streams without full synchronization.
s1 = qd.create_stream()
s2 = qd.create_stream()
@qd.kernel
def produce():
for i in range(N):
a[i] = 10.0
@qd.kernel
def consume():
for i in range(N):
b[i] = a[i]
produce(qd_stream=s1)
e = qd.create_event()
e.record(s1) # record when s1 finishes produce()
e.wait(qd_stream=s2) # s2 waits for that event before proceeding
consume(qd_stream=s2) # safe to read a[] — produce() is guaranteed complete
s2.synchronize()
e.destroy()
s1.destroy()
s2.destroy()
e.record(stream) captures the point in stream’s execution. e.wait(qd_stream=stream) makes stream wait until the recorded point is reached. If qd_stream is omitted, the default stream waits.
Context managers#
Streams and events support with blocks for automatic cleanup:
with qd.create_stream() as s:
some_func1(qd_stream=s)
# s.destroy() called automatically — waits for in-flight work
Synchronization notes#
qd.sync()only waits on the default stream. It does not drain explicit streams. Callstream.synchronize()on each stream you need to wait for.No automatic synchronization with explicit streams. When using explicit streams, you are responsible for inserting events or
synchronize()calls when one stream’s output is another stream’s input.stream_parallelhandles this automatically.
Limitations#
Not compatible with graphs. Do not pass
qd_streamto a kernel decorated withgraph=True(if you do, aRuntimeErrorwill be raised).Not compatible with autodiff. Do not pass
qd_streamto a kernel that uses reverse-mode or forward-mode differentiation, or inside aqd.ad.Tapecontext (if you do, aRuntimeErrorwill be raised).