# CUDA Graph CUDA graphs reduce kernel launch overhead by capturing a sequence of GPU operations into a graph, then replaying it in a single launch. On non-CUDA platforms, the cuda graph annotation is simply ignored, and code runs normally. ## Basic usage Add `gpu_graph=True` to a `@qd.kernel` decorator: ```python @qd.kernel(gpu_graph=True) def my_kernel( x: qd.types.ndarray(qd.f32, ndim=1), y: qd.types.ndarray(qd.f32, ndim=1), ): for i in range(x.shape[0]): x[i] = x[i] + 1.0 for i in range(y.shape[0]): y[i] = y[i] + 2.0 ``` The top level for-loops will be compiled into a single CUDA graph. The parallelism is the same as before, but the launch latency much reduced. The kernel is used normally — no other API changes are needed: ```python x = qd.ndarray(qd.f32, shape=(1024,)) y = qd.ndarray(qd.f32, shape=(1024,)) my_kernel(x, y) # first call: builds and caches the graph my_kernel(x, y) # subsequent calls: replays the cached graph ``` ### Restrictions - **No struct return values.** Kernels that return values (e.g. `-> qd.i32`) cannot use CUDA graphs. An error is raised if `gpu_graph=True` is set on such a kernel. - **Primal kernels only.** The `gpu_graph=True` flag is applied to the primal (forward) kernel only, not its adjoint. Autodiff kernels use the normal launch path. ### Passing different arguments You can pass different ndarrays to the same kernel on subsequent calls. The cached graph is replayed with the updated arguments — no graph rebuild occurs: ```python x1 = qd.ndarray(qd.f32, shape=(1024,)) y1 = qd.ndarray(qd.f32, shape=(1024,)) my_kernel(x1, y1) # builds graph x2 = qd.ndarray(qd.f32, shape=(1024,)) y2 = qd.ndarray(qd.f32, shape=(1024,)) my_kernel(x2, y2) # replays graph with new array pointers ``` ### Fields as arguments When different fields are passed as template arguments, each unique combination of fields produces a separately compiled kernel with its own graph cache entry. There is no interference between them. ## GPU-side iteration with `graph_do_while` For iterative algorithms (physics solvers, convergence loops), you often want to repeat the kernel body until a condition is met, without returning to the host each iteration. Use `while qd.graph_do_while(flag):` inside a `gpu_graph=True` kernel: ```python @qd.kernel(gpu_graph=True) def solve(x: qd.types.ndarray(qd.f32, ndim=1), counter: qd.types.ndarray(qd.i32, ndim=0)): while qd.graph_do_while(counter): for i in range(x.shape[0]): x[i] = x[i] + 1.0 for i in range(1): counter[()] = counter[()] - 1 x = qd.ndarray(qd.f32, shape=(N,)) counter = qd.ndarray(qd.i32, shape=()) counter.from_numpy(np.array(10, dtype=np.int32)) solve(x, counter) # x is now incremented 10 times; counter is 0 ``` The argument to `qd.graph_do_while()` must be the name of a scalar `qd.i32` ndarray parameter. The loop body repeats while this value is non-zero. - On SM 9.0+ (Hopper), this uses CUDA conditional while nodes — the entire iteration runs on the GPU with no host involvement. - On older CUDA GPUs and non-CUDA backends, it falls back to a host-side do-while loop. ### Patterns **Counter-based**: set the counter to N, decrement each iteration. The body runs exactly N times. ```python @qd.kernel(gpu_graph=True) def iterate(x: qd.types.ndarray(qd.f32, ndim=1), counter: qd.types.ndarray(qd.i32, ndim=0)): while qd.graph_do_while(counter): for i in range(x.shape[0]): x[i] = x[i] + 1.0 for i in range(1): counter[()] = counter[()] - 1 ``` **Boolean flag**: set a `keep_going` flag to 1, have the kernel set it to 0 when a convergence criterion is met. ```python @qd.kernel(gpu_graph=True) def converge(x: qd.types.ndarray(qd.f32, ndim=1), keep_going: qd.types.ndarray(qd.i32, ndim=0)): while qd.graph_do_while(keep_going): for i in range(x.shape[0]): # ... do work ... pass for i in range(1): if some_condition(x): keep_going[()] = 0 ``` ### Do-while semantics `graph_do_while` has **do-while** semantics: the kernel body always executes at least once before the condition is checked. This matches the behavior of CUDA conditional while nodes. The flag value must be >= 1 at launch time. Passing 0 with a kernel that decrements the counter will cause an infinite loop. ### ndarray vs field The parameter used by `graph_do_while` MUST be an ndarray. However, other parameters can be any supported Quadrants kernel parameter type. ### Restrictions - The same physical ndarray must be used for the counter parameter on every call. Passing a different ndarray raises an error, because the counter's device pointer is baked into the CUDA graph at creation time. ### Caveats On currently unsupported GPU platforms, such as AMDGPU at the time of writing, the value of the `graph_do_while` parameter will be copied from the GPU to the host each iteration, in order to check whether we should continue iterating. This causes a GPU pipeline stall. At the end of each loop iteration: - wait for GPU async queue to finish processing - copy condition value to hostside - evaluate condition value on hostside - launch new kernels for next loop iteration, if not finished yet Therefore on unsupported platforms, you might consider creating a second implementation, which works differently. e.g.: - fixed number of loop iterations, so no dependency on gpu data for kernel launch; combined perhaps with: - make each kernel 'short-circuit', exit quickly, if the task has already been completed; to avoid running the GPU more than necessary