# Streams Streams allow concurrent execution of GPU operations. By default, all Quadrants kernels launch on the default stream, which serializes everything. With streams, you can run multiple top-level for loops in parallel. ## Supported platforms | Backend | Supported | |---------|-----------| | CUDA | Yes | | AMDGPU | Yes | | CPU | No-op | | Metal | No-op | | Vulkan | No-op | On backends without native stream support, stream operations are no-ops and for loops run serially. Code using streams is portable across all backends — it will run without modifications, but serially. ## Stream parallelism Inside a `@qd.kernel`, each `with qd.stream_parallel():` block runs on its own GPU stream. ```python import quadrants as qd qd.init(arch=qd.cuda) N = 1024 a = qd.field(qd.f32, shape=(N,)) b = qd.field(qd.f32, shape=(N,)) c = qd.field(qd.f32, shape=(N,)) @qd.kernel def compute_ab(): with qd.stream_parallel(): for i in range(N): a[i] = compute_a(i) with qd.stream_parallel(): for j in range(N): b[j] = compute_b(j) @qd.kernel def combine(): for i in range(N): c[i] = a[i] + b[i] compute_ab() # the two stream_parallel blocks run concurrently combine() # runs after compute_ab() returns — a[] and b[] are ready ``` Consecutive `with qd.stream_parallel():` blocks run concurrently. Multiple for loops within a single block share a stream and run serially on it. All streams are synchronized before the kernel returns. ### Restrictions - All top-level statements in a kernel must be either all `stream_parallel` blocks or all regular statements. Mixing the two at the top level is a compile-time error. - Nesting `stream_parallel` blocks is not supported. ## Explicit streams For cases that require manual control — such as launching separate kernels on different streams or interoperating with PyTorch — you can create and manage streams directly. ### Creating and using streams Any `@qd.kernel` function accepts a special `qd_stream` keyword argument — you do not need to declare it in the kernel signature. The `@qd.kernel` decorator handles it automatically. ```python @qd.kernel def my_kernel(): for i in range(N): a[i] = i s1 = qd.create_stream() s2 = qd.create_stream() my_kernel(qd_stream=s1) my_kernel(qd_stream=s2) s1.synchronize() s2.synchronize() s1.destroy() s2.destroy() ``` Kernels on different streams may execute concurrently. Call `synchronize()` to block until all work on a stream completes. ### Events Events let you express dependencies between streams without full synchronization. ```python s1 = qd.create_stream() s2 = qd.create_stream() @qd.kernel def produce(): for i in range(N): a[i] = 10.0 @qd.kernel def consume(): for i in range(N): b[i] = a[i] produce(qd_stream=s1) e = qd.create_event() e.record(s1) # record when s1 finishes produce() e.wait(qd_stream=s2) # s2 waits for that event before proceeding consume(qd_stream=s2) # safe to read a[] — produce() is guaranteed complete s2.synchronize() e.destroy() s1.destroy() s2.destroy() ``` `e.record(stream)` captures the point in `stream`'s execution. `e.wait(qd_stream=stream)` makes `stream` wait until the recorded point is reached. If `qd_stream` is omitted, the default stream waits. ### Context managers Streams and events support `with` blocks for automatic cleanup: ```python with qd.create_stream() as s: some_func1(qd_stream=s) # s.destroy() called automatically — waits for in-flight work ``` ## Synchronization notes - **`qd.sync()` only waits on the default stream.** It does not drain explicit streams. Call `stream.synchronize()` on each stream you need to wait for. - **No automatic synchronization with explicit streams.** When using explicit streams, you are responsible for inserting events or `synchronize()` calls when one stream's output is another stream's input. `stream_parallel` handles this automatically. ## Limitations - **Not compatible with graphs.** Do not pass `qd_stream` to a kernel decorated with `graph=True` (if you do, a `RuntimeError` will be raised). - **Not compatible with autodiff.** Do not pass `qd_stream` to a kernel that uses reverse-mode or forward-mode differentiation, or inside a `qd.ad.Tape` context (if you do, a `RuntimeError` will be raised).