CUDA Graph#
CUDA graphs reduce kernel launch overhead by capturing a sequence of GPU operations into a graph, then replaying it in a single launch. On non-CUDA platforms, the cuda graph annotation is simply ignored, and code runs normally.
Basic usage#
Add gpu_graph=True to a @qd.kernel decorator:
@qd.kernel(gpu_graph=True)
def my_kernel(
x: qd.types.ndarray(qd.f32, ndim=1),
y: qd.types.ndarray(qd.f32, ndim=1),
):
for i in range(x.shape[0]):
x[i] = x[i] + 1.0
for i in range(y.shape[0]):
y[i] = y[i] + 2.0
The top level for-loops will be compiled into a single CUDA graph. The parallelism is the same as before, but the launch latency much reduced.
The kernel is used normally — no other API changes are needed:
x = qd.ndarray(qd.f32, shape=(1024,))
y = qd.ndarray(qd.f32, shape=(1024,))
my_kernel(x, y) # first call: builds and caches the graph
my_kernel(x, y) # subsequent calls: replays the cached graph
Restrictions#
No struct return values. Kernels that return values (e.g.
-> qd.i32) cannot use CUDA graphs. An error is raised ifgpu_graph=Trueis set on such a kernel.Primal kernels only. The
gpu_graph=Trueflag is applied to the primal (forward) kernel only, not its adjoint. Autodiff kernels use the normal launch path.
Passing different arguments#
You can pass different ndarrays to the same kernel on subsequent calls. The cached graph is replayed with the updated arguments — no graph rebuild occurs:
x1 = qd.ndarray(qd.f32, shape=(1024,))
y1 = qd.ndarray(qd.f32, shape=(1024,))
my_kernel(x1, y1) # builds graph
x2 = qd.ndarray(qd.f32, shape=(1024,))
y2 = qd.ndarray(qd.f32, shape=(1024,))
my_kernel(x2, y2) # replays graph with new array pointers
Fields as arguments#
When different fields are passed as template arguments, each unique combination of fields produces a separately compiled kernel with its own graph cache entry. There is no interference between them.
GPU-side iteration with graph_do_while#
For iterative algorithms (physics solvers, convergence loops), you often want to repeat the kernel body until a condition is met, without returning to the host each iteration. Use while qd.graph_do_while(flag): inside a gpu_graph=True kernel:
@qd.kernel(gpu_graph=True)
def solve(x: qd.types.ndarray(qd.f32, ndim=1),
counter: qd.types.ndarray(qd.i32, ndim=0)):
while qd.graph_do_while(counter):
for i in range(x.shape[0]):
x[i] = x[i] + 1.0
for i in range(1):
counter[()] = counter[()] - 1
x = qd.ndarray(qd.f32, shape=(N,))
counter = qd.ndarray(qd.i32, shape=())
counter.from_numpy(np.array(10, dtype=np.int32))
solve(x, counter)
# x is now incremented 10 times; counter is 0
The argument to qd.graph_do_while() must be the name of a scalar qd.i32 ndarray parameter. The loop body repeats while this value is non-zero.
On SM 9.0+ (Hopper), this uses CUDA conditional while nodes — the entire iteration runs on the GPU with no host involvement.
On older CUDA GPUs and non-CUDA backends, it falls back to a host-side do-while loop.
Patterns#
Counter-based: set the counter to N, decrement each iteration. The body runs exactly N times.
@qd.kernel(gpu_graph=True)
def iterate(x: qd.types.ndarray(qd.f32, ndim=1),
counter: qd.types.ndarray(qd.i32, ndim=0)):
while qd.graph_do_while(counter):
for i in range(x.shape[0]):
x[i] = x[i] + 1.0
for i in range(1):
counter[()] = counter[()] - 1
Boolean flag: set a keep_going flag to 1, have the kernel set it to 0 when a convergence criterion is met.
@qd.kernel(gpu_graph=True)
def converge(x: qd.types.ndarray(qd.f32, ndim=1),
keep_going: qd.types.ndarray(qd.i32, ndim=0)):
while qd.graph_do_while(keep_going):
for i in range(x.shape[0]):
# ... do work ...
pass
for i in range(1):
if some_condition(x):
keep_going[()] = 0
Do-while semantics#
graph_do_while has do-while semantics: the kernel body always executes at least once before the condition is checked. This matches the behavior of CUDA conditional while nodes. The flag value must be >= 1 at launch time. Passing 0 with a kernel that decrements the counter will cause an infinite loop.
ndarray vs field#
The parameter used by graph_do_while MUST be an ndarray.
However, other parameters can be any supported Quadrants kernel parameter type.
Restrictions#
The same physical ndarray must be used for the counter parameter on every call. Passing a different ndarray raises an error, because the counter’s device pointer is baked into the CUDA graph at creation time.
Caveats#
On currently unsupported GPU platforms, such as AMDGPU at the time of writing, the value of the graph_do_while parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. At the end of each loop iteration:
wait for GPU async queue to finish processing
copy condition value to hostside
evaluate condition value on hostside
launch new kernels for next loop iteration, if not finished yet
Therefore on unsupported platforms, you might consider creating a second implementation, which works differently. e.g.:
fixed number of loop iterations, so no dependency on gpu data for kernel launch; combined perhaps with:
make each kernel ‘short-circuit’, exit quickly, if the task has already been completed; to avoid running the GPU more than necessary