Graph#
Graphs reduce kernel launch overhead by capturing a sequence of GPU operations into a graph, then replaying it in a single launch.
Backend support#
Both features run on every backend. They are hardware accelerated on CUDA (via CUDA graphs) and AMDGPU (via HIP graphs); graph_do_while additionally requires CUDA SM 9.0+ / Hopper for its hardware-accelerated path. On other backends, graph=True is silently ignored and the kernel runs via the normal launch path, and graph_do_while falls back to a host-side do-while loop that copies the condition value GPU → host each iteration (causing a pipeline stall — see Caveats).
Feature |
|
|
|
|
|
|
|---|---|---|---|---|---|---|
|
hardware accelerated |
hardware accelerated |
hardware accelerated |
runs (no acceleration) |
runs (no acceleration) |
runs (no acceleration) |
|
hardware accelerated |
host fallback |
host fallback |
host fallback |
host fallback |
host fallback |
AMDGPU graph_do_while falls back to the host-side loop because HIP does not currently expose conditional / while graph nodes (as of ROCm 7.2).
Basic usage#
Add graph=True to a @qd.kernel decorator:
@qd.kernel(graph=True)
def my_kernel(
x: qd.types.ndarray(qd.f32, ndim=1),
y: qd.types.ndarray(qd.f32, ndim=1),
):
for i in range(x.shape[0]):
x[i] = x[i] + 1.0
for i in range(y.shape[0]):
y[i] = y[i] + 2.0
The top level for-loops will be compiled into a single graph. The parallelism is the same as before, but the launch latency much reduced.
The kernel is used normally — no other API changes are needed:
x = qd.ndarray(qd.f32, shape=(1024,))
y = qd.ndarray(qd.f32, shape=(1024,))
my_kernel(x, y) # first call: builds and caches the graph
my_kernel(x, y) # subsequent calls: replays the cached graph
This works the same way on CUDA and AMDGPU. The cache is keyed per (compiled-kernel-specialization, launch-id), so different template instantiations (different field bindings, etc.) get their own cached graph.
Restrictions#
No struct return values. Kernels that return values (e.g.
-> qd.i32) cannot use graphs. An error is raised ifgraph=Trueis set on such a kernel.Primal kernels only. The
graph=Trueflag is applied to the primal (forward) kernel only, not its adjoint. Autodiff kernels use the normal launch path.Device-resident ndarrays. Graph mode bakes device pointers into the cached graph, so all ndarray arguments must be on the GPU. Passing a host-resident ndarray raises an error.
qd_streamis incompatible withgraph=True. Choose one or the other.
Passing different arguments#
You can pass different ndarrays to the same kernel on subsequent calls. The cached graph is replayed with the updated arguments — no graph rebuild occurs:
x1 = qd.ndarray(qd.f32, shape=(1024,))
y1 = qd.ndarray(qd.f32, shape=(1024,))
my_kernel(x1, y1) # builds graph
x2 = qd.ndarray(qd.f32, shape=(1024,))
y2 = qd.ndarray(qd.f32, shape=(1024,))
my_kernel(x2, y2) # replays graph with new array pointers
Fields as arguments#
When different fields are passed as template arguments, each unique combination of fields produces a separately compiled kernel with its own graph cache entry. There is no interference between them.
GPU-side iteration with graph_do_while#
For iterative algorithms (physics solvers, convergence loops), you often want to repeat the kernel body until a condition is met, without returning to the host each iteration. Use while qd.graph_do_while(flag): inside a graph=True kernel:
@qd.kernel(graph=True)
def solve(x: qd.types.ndarray(qd.f32, ndim=1),
counter: qd.types.ndarray(qd.i32, ndim=0)):
while qd.graph_do_while(counter):
for i in range(x.shape[0]):
x[i] = x[i] + 1.0
for i in range(1):
counter[()] = counter[()] - 1
x = qd.ndarray(qd.f32, shape=(N,))
counter = qd.ndarray(qd.i32, shape=())
counter.from_numpy(np.array(10, dtype=np.int32))
solve(x, counter)
# x is now incremented 10 times; counter is 0
The argument to qd.graph_do_while() must be the name of a scalar qd.i32 ndarray parameter. The loop body repeats while this value is non-zero.
On CUDA SM 9.0+ (Hopper), this uses CUDA conditional while nodes — the entire iteration runs on the GPU with no host involvement.
On older CUDA GPUs, AMDGPU, and non-GPU backends, it falls back to a host-side do-while loop (see Caveats and the backend support table).
Patterns#
Counter-based: set the counter to N, decrement each iteration. The body runs exactly N times.
@qd.kernel(graph=True)
def iterate(x: qd.types.ndarray(qd.f32, ndim=1),
counter: qd.types.ndarray(qd.i32, ndim=0)):
while qd.graph_do_while(counter):
for i in range(x.shape[0]):
x[i] = x[i] + 1.0
for i in range(1):
counter[()] = counter[()] - 1
Boolean flag: set a keep_going flag to 1, have the kernel set it to 0 when a convergence criterion is met.
@qd.kernel(graph=True)
def converge(x: qd.types.ndarray(qd.f32, ndim=1),
keep_going: qd.types.ndarray(qd.i32, ndim=0)):
while qd.graph_do_while(keep_going):
for i in range(x.shape[0]):
# ... do work ...
pass
for i in range(1):
if some_condition(x):
keep_going[()] = 0
Do-while semantics#
graph_do_while has do-while semantics: the kernel body always executes at least once before the condition is checked. This matches the behavior of CUDA conditional while nodes. The flag value must be >= 1 at launch time. Passing 0 with a kernel that decrements the counter will cause an infinite loop.
ndarray vs field#
The parameter used by graph_do_while MUST be an ndarray.
However, other parameters can be any supported Quadrants kernel parameter type.
Restrictions#
The same physical ndarray must be used for the counter parameter on every call. Passing a different ndarray raises an error, because the counter’s device pointer is baked into the graph at creation time.
Caveats#
On platforms without native device-side conditional graph nodes — currently CUDA pre-SM 9.0 and AMDGPU (HIP has no conditional / while node API as of ROCm 7.2) — the value of the graph_do_while parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. At the end of each loop iteration:
wait for GPU async queue to finish processing
copy condition value to hostside
evaluate condition value on hostside
launch new kernels for next loop iteration, if not finished yet
Note: the basic graph=True path (without graph_do_while) does not stall the host like this on either CUDA or AMDGPU — the entire kernel sequence runs as a single GPU-side graph replay.
Therefore on unsupported platforms, you might consider creating a second implementation, which works differently. e.g.:
fixed number of loop iterations, so no dependency on gpu data for kernel launch; combined perhaps with:
make each kernel ‘short-circuit’, exit quickly, if the task has already been completed; to avoid running the GPU more than necessary