CUDA Graph#

CUDA graphs reduce kernel launch overhead by capturing a sequence of GPU operations into a graph, then replaying it in a single launch. On non-CUDA platforms, the cuda graph annotation is simply ignored, and code runs normally.

Basic usage#

Add gpu_graph=True to a @qd.kernel decorator:

@qd.kernel(gpu_graph=True)
def my_kernel(
    x: qd.types.ndarray(qd.f32, ndim=1),
    y: qd.types.ndarray(qd.f32, ndim=1),
):
    for i in range(x.shape[0]):
        x[i] = x[i] + 1.0
    for i in range(y.shape[0]):
        y[i] = y[i] + 2.0

The top level for-loops will be compiled into a single CUDA graph. The parallelism is the same as before, but the launch latency much reduced.

The kernel is used normally — no other API changes are needed:

x = qd.ndarray(qd.f32, shape=(1024,))
y = qd.ndarray(qd.f32, shape=(1024,))

my_kernel(x, y)  # first call: builds and caches the graph
my_kernel(x, y)  # subsequent calls: replays the cached graph

Restrictions#

  • No struct return values. Kernels that return values (e.g. -> qd.i32) cannot use CUDA graphs. An error is raised if gpu_graph=True is set on such a kernel.

  • Primal kernels only. The gpu_graph=True flag is applied to the primal (forward) kernel only, not its adjoint. Autodiff kernels use the normal launch path.

Passing different arguments#

You can pass different ndarrays to the same kernel on subsequent calls. The cached graph is replayed with the updated arguments — no graph rebuild occurs:

x1 = qd.ndarray(qd.f32, shape=(1024,))
y1 = qd.ndarray(qd.f32, shape=(1024,))
my_kernel(x1, y1)  # builds graph

x2 = qd.ndarray(qd.f32, shape=(1024,))
y2 = qd.ndarray(qd.f32, shape=(1024,))
my_kernel(x2, y2)  # replays graph with new array pointers

Fields as arguments#

When different fields are passed as template arguments, each unique combination of fields produces a separately compiled kernel with its own graph cache entry. There is no interference between them.

GPU-side iteration with graph_do_while#

For iterative algorithms (physics solvers, convergence loops), you often want to repeat the kernel body until a condition is met, without returning to the host each iteration. Use while qd.graph_do_while(flag): inside a gpu_graph=True kernel:

@qd.kernel(gpu_graph=True)
def solve(x: qd.types.ndarray(qd.f32, ndim=1),
          counter: qd.types.ndarray(qd.i32, ndim=0)):
    while qd.graph_do_while(counter):
        for i in range(x.shape[0]):
            x[i] = x[i] + 1.0
        for i in range(1):
            counter[()] = counter[()] - 1

x = qd.ndarray(qd.f32, shape=(N,))
counter = qd.ndarray(qd.i32, shape=())
counter.from_numpy(np.array(10, dtype=np.int32))
solve(x, counter)
# x is now incremented 10 times; counter is 0

The argument to qd.graph_do_while() must be the name of a scalar qd.i32 ndarray parameter. The loop body repeats while this value is non-zero.

  • On SM 9.0+ (Hopper), this uses CUDA conditional while nodes — the entire iteration runs on the GPU with no host involvement.

  • On older CUDA GPUs and non-CUDA backends, it falls back to a host-side do-while loop.

Patterns#

Counter-based: set the counter to N, decrement each iteration. The body runs exactly N times.

@qd.kernel(gpu_graph=True)
def iterate(x: qd.types.ndarray(qd.f32, ndim=1),
            counter: qd.types.ndarray(qd.i32, ndim=0)):
    while qd.graph_do_while(counter):
        for i in range(x.shape[0]):
            x[i] = x[i] + 1.0
        for i in range(1):
            counter[()] = counter[()] - 1

Boolean flag: set a keep_going flag to 1, have the kernel set it to 0 when a convergence criterion is met.

@qd.kernel(gpu_graph=True)
def converge(x: qd.types.ndarray(qd.f32, ndim=1),
             keep_going: qd.types.ndarray(qd.i32, ndim=0)):
    while qd.graph_do_while(keep_going):
        for i in range(x.shape[0]):
            # ... do work ...
            pass
        for i in range(1):
            if some_condition(x):
                keep_going[()] = 0

Do-while semantics#

graph_do_while has do-while semantics: the kernel body always executes at least once before the condition is checked. This matches the behavior of CUDA conditional while nodes. The flag value must be >= 1 at launch time. Passing 0 with a kernel that decrements the counter will cause an infinite loop.

ndarray vs field#

The parameter used by graph_do_while MUST be an ndarray.

However, other parameters can be any supported Quadrants kernel parameter type.

Restrictions#

  • The same physical ndarray must be used for the counter parameter on every call. Passing a different ndarray raises an error, because the counter’s device pointer is baked into the CUDA graph at creation time.

Caveats#

On currently unsupported GPU platforms, such as AMDGPU at the time of writing, the value of the graph_do_while parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. At the end of each loop iteration:

  • wait for GPU async queue to finish processing

  • copy condition value to hostside

  • evaluate condition value on hostside

  • launch new kernels for next loop iteration, if not finished yet

Therefore on unsupported platforms, you might consider creating a second implementation, which works differently. e.g.:

  • fixed number of loop iterations, so no dependency on gpu data for kernel launch; combined perhaps with:

  • make each kernel ‘short-circuit’, exit quickly, if the task has already been completed; to avoid running the GPU more than necessary