Graph#

Graphs reduce kernel launch overhead by capturing a sequence of GPU operations into a graph, then replaying it in a single launch.

Backend support#

Both features run on every backend. They are hardware accelerated on CUDA (via CUDA graphs) and AMDGPU (via HIP graphs); graph_do_while additionally requires CUDA SM 9.0+ / Hopper for its hardware-accelerated path. On other backends, graph=True is silently ignored and the kernel runs via the normal launch path, and graph_do_while falls back to a host-side do-while loop that copies the condition value GPU → host each iteration (causing a pipeline stall — see Caveats).

Feature	`qd.cuda` SM 9.0+	`qd.cuda` < SM 9.0	`qd.amdgpu`	`qd.metal`	`qd.vulkan`	`qd.cpu`
`graph=True`	hardware accelerated	hardware accelerated	hardware accelerated	runs (no acceleration)	runs (no acceleration)	runs (no acceleration)
`graph_do_while`	hardware accelerated	host fallback	host fallback	host fallback	host fallback	host fallback

AMDGPU graph_do_while falls back to the host-side loop because HIP does not currently expose conditional / while graph nodes (as of ROCm 7.2).

Basic usage#

Add graph=True to a @qd.kernel decorator:

@qd.kernel(graph=True)
def my_kernel(
    x: qd.types.ndarray(qd.f32, ndim=1),
    y: qd.types.ndarray(qd.f32, ndim=1),
):
    for i in range(x.shape[0]):
        x[i] = x[i] + 1.0
    for i in range(y.shape[0]):
        y[i] = y[i] + 2.0

The top level for-loops will be compiled into a single graph. The parallelism is the same as before, but the launch latency much reduced.

The kernel is used normally — no other API changes are needed:

x = qd.ndarray(qd.f32, shape=(1024,))
y = qd.ndarray(qd.f32, shape=(1024,))

my_kernel(x, y)  # first call: builds and caches the graph
my_kernel(x, y)  # subsequent calls: replays the cached graph

This works the same way on CUDA and AMDGPU. The cache is keyed per (compiled-kernel-specialization, launch-id), so different template instantiations (different field bindings, etc.) get their own cached graph.

Restrictions#

No struct return values. Kernels that return values (e.g. -> qd.i32) cannot use graphs. An error is raised if graph=True is set on such a kernel.
Primal kernels only. The graph=True flag is applied to the primal (forward) kernel only, not its adjoint. Autodiff kernels use the normal launch path.
Device-resident ndarrays. Graph mode bakes device pointers into the cached graph, so all ndarray arguments must be on the GPU. Passing a host-resident ndarray raises an error.
qd_stream is incompatible with graph=True. Choose one or the other.

Passing different arguments#

You can pass different ndarrays to the same kernel on subsequent calls. The cached graph is replayed with the updated arguments — no graph rebuild occurs:

x1 = qd.ndarray(qd.f32, shape=(1024,))
y1 = qd.ndarray(qd.f32, shape=(1024,))
my_kernel(x1, y1)  # builds graph

x2 = qd.ndarray(qd.f32, shape=(1024,))
y2 = qd.ndarray(qd.f32, shape=(1024,))
my_kernel(x2, y2)  # replays graph with new array pointers

Fields as arguments#

When different fields are passed as template arguments, each unique combination of fields produces a separately compiled kernel with its own graph cache entry. There is no interference between them.

GPU-side iteration with `graph_do_while`#

For iterative algorithms (physics solvers, convergence loops), you often want to repeat the kernel body until a condition is met, without returning to the host each iteration. Use while qd.graph_do_while(flag): inside a graph=True kernel:

@qd.kernel(graph=True)
def solve(x: qd.types.ndarray(qd.f32, ndim=1),
          counter: qd.types.ndarray(qd.i32, ndim=0)):
    while qd.graph_do_while(counter):
        for i in range(x.shape[0]):
            x[i] = x[i] + 1.0
        for i in range(1):
            counter[()] = counter[()] - 1

x = qd.ndarray(qd.f32, shape=(N,))
counter = qd.ndarray(qd.i32, shape=())
counter.from_numpy(np.array(10, dtype=np.int32))
solve(x, counter)
# x is now incremented 10 times; counter is 0

The argument to qd.graph_do_while() must be the name of a scalar qd.i32 ndarray parameter. The loop body repeats while this value is non-zero.

On CUDA SM 9.0+ (Hopper), this uses CUDA conditional while nodes — the entire iteration runs on the GPU with no host involvement.
On older CUDA GPUs, AMDGPU, and non-GPU backends, it falls back to a host-side do-while loop (see Caveats and the backend support table).

Patterns#

Counter-based: set the counter to N, decrement each iteration. The body runs exactly N times.

@qd.kernel(graph=True)
def iterate(x: qd.types.ndarray(qd.f32, ndim=1),
            counter: qd.types.ndarray(qd.i32, ndim=0)):
    while qd.graph_do_while(counter):
        for i in range(x.shape[0]):
            x[i] = x[i] + 1.0
        for i in range(1):
            counter[()] = counter[()] - 1

Boolean flag: set a keep_going flag to 1, have the kernel set it to 0 when a convergence criterion is met.

@qd.kernel(graph=True)
def converge(x: qd.types.ndarray(qd.f32, ndim=1),
             keep_going: qd.types.ndarray(qd.i32, ndim=0)):
    while qd.graph_do_while(keep_going):
        for i in range(x.shape[0]):
            # ... do work ...
            pass
        for i in range(1):
            if some_condition(x):
                keep_going[()] = 0

Do-while semantics#

graph_do_while has do-while semantics: the kernel body always executes at least once before the condition is checked. This matches the behavior of CUDA conditional while nodes. The flag value must be >= 1 at launch time. Passing 0 with a kernel that decrements the counter will cause an infinite loop.

ndarray vs field#

The parameter used by graph_do_while MUST be an ndarray.

However, other parameters can be any supported Quadrants kernel parameter type.

Restrictions#

The same physical ndarray must be used for the counter parameter on every call. Passing a different ndarray raises an error, because the counter’s device pointer is baked into the graph at creation time.

Caveats#

On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and AMDGPU (HIP has no conditional / while node API as of ROCm 7.2) — the value of the graph_do_while parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. At the end of each loop iteration:

wait for GPU async queue to finish processing
copy condition value to hostside
evaluate condition value on hostside
launch new kernels for next loop iteration, if not finished yet

Note: the basic graph=True path (without graph_do_while) does not stall the host like this on either CUDA or AMDGPU — the entire kernel sequence runs as a single GPU-side graph replay.

Therefore on unsupported platforms, you might consider creating a second implementation, which works differently. e.g.:

fixed number of loop iterations, so no dependency on gpu data for kernel launch; combined perhaps with:
make each kernel ‘short-circuit’, exit quickly, if the task has already been completed; to avoid running the GPU more than necessary

Graph#

Backend support#

Basic usage#

Restrictions#

Passing different arguments#

Fields as arguments#

GPU-side iteration with graph_do_while#

Patterns#

Do-while semantics#

ndarray vs field#

Restrictions#

Caveats#

GPU-side iteration with `graph_do_while`#