Performance Dispatch#
@qd.perf_dispatch is a decorator that automatically selects the fastest implementation of a function at runtime. When you have multiple implementations of the same operation (e.g. different algorithmic strategies, or platform-specific kernels), perf_dispatch benchmarks them and picks the winner.
Basic usage#
Define a meta-function with @qd.perf_dispatch. The function body should be empty — it serves only as a prototype declaring the signature and geometry hash:
@qd.perf_dispatch(get_geometry_hash=lambda a, b: hash(a.shape + b.shape))
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]): ...
Then register concrete implementations using @my_op.register. Each implementation must have the same parameter names as the prototype.
@my_op.register
@qd.kernel
def my_op_v1(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
for i in range(a.shape[0]):
b[i] = a[i] * 2
@my_op.register
@qd.kernel
def my_op_v2(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
i = 0
# will run the same speed as the other version, but just an example
while i < a.shape[0]:
b[i] = a[i] << 1
Call the meta-function as though calling a standard Quadrants kernel:
my_op(a, b)
On the first several calls, perf_dispatch will cycle through implementations (warming up, then timing). Once all implementations have been timed, the fastest is cached and used until re-evaluation is triggered (see Tuning parameters).
Decorator order#
When registering a @qd.kernel, the @my_op.register decorator must be the topmost decorator:
# Correct
@my_op.register
@qd.kernel
def impl(...) -> None: ...
# Wrong — will raise QuadrantsSyntaxError
@qd.kernel
@my_op.register
def impl(...) -> None: ...
Registering plain Python functions#
Implementations do not have to be @qd.kernel. Plain Python functions work too. You can mix kernel and Python implementations under the same meta-function:
@my_op.register
def my_op_python(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
for i in range(a.shape[0]):
b[i] = a[i] * 2
Return values#
Implementations are not limited to returning None. The return value of whichever implementation runs is passed through to the caller:
@qd.perf_dispatch(get_geometry_hash=lambda a, b: hash(a.shape + b.shape))
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> int: ...
@my_op.register
def my_op_python(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> int:
for i in range(a.shape[0]):
b[i] = a[i] * 2
return a.shape[0]
count = my_op(a, b) # receives the return value from whichever impl runs
Note that return types need to match across implementations, and the meta function.
Compatibility filtering#
Some implementations may only work under certain conditions (specific platforms, input shapes, etc.). Use the is_compatible parameter to declare when an implementation is eligible:
@my_op.register(is_compatible=lambda a, b: a.shape[0] >= 1024)
@qd.kernel
def my_op_large(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
# optimized for large inputs
...
is_compatible receives the same arguments as the meta-function and must return True or False. If an implementation cannot handle certain inputs, is_compatible must be provided and must return False for those inputs. Implementations without is_compatible are assumed to always be compatible.
If only one implementation is compatible for a given call, it is used immediately without benchmarking.
Geometry hash#
The get_geometry_hash function maps call arguments to a hash representing the “geometry” of the input. Different geometries are benchmarked independently, so perf_dispatch can select different winners for different input shapes or configurations.
# Different shapes benchmark independently
@qd.perf_dispatch(get_geometry_hash=lambda a, b: hash(a.shape + b.shape))
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]): ...
Guidelines for get_geometry_hash:
Return a constant (e.g.
0) if all inputs have the same performance characteristics — a single winner will be chosen.Hash input shapes when different shapes may favor different implementations.
Avoid reading GPU data in the hash function, as this creates a GPU sync point and will severely degrade performance. Prefer metadata like
.shapewhich is available on the CPU.
Tuning parameters#
@qd.perf_dispatch accepts several optional parameters to control the benchmarking process:
Parameter |
Default |
Description |
|---|---|---|
|
3 |
Number of untimed warmup calls per implementation before measuring. |
|
1 |
Number of timed calls per implementation. |
|
0 |
Re-run benchmarking after this many additional calls. 0 disables. |
|
1.0 |
Re-run benchmarking after this many seconds have elapsed. 0 disables. |
Example with custom tuning:
@qd.perf_dispatch(
get_geometry_hash=lambda a, b: hash(a.shape),
warmup=5,
active=2,
repeat_after_seconds=60.0,
)
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]): ...
How benchmarking works#
Warmup phase: Each compatible implementation is called
warmuptimes in round-robin order. These calls are not timed.Active phase: Each compatible implementation is called
activetimes in round-robin order. The GPU is synchronized before and after each call to get accurate wall-clock measurements.Selection: The implementation with the lowest active-phase time is cached as the winner for that geometry hash.
Steady state: Subsequent calls with the same geometry go directly to the cached winner with no overhead.
Re-evaluation (optional): After
repeat_after_countcalls orrepeat_after_secondsseconds, the entire warmup + active cycle restarts from scratch, allowing the dispatcher to adapt if conditions change.
Forcing a specific implementation#
For debugging or profiling, you can bypass the auto-tuning and force a specific implementation using the QD_PERFDISPATCH_FORCE environment variable:
QD_PERFDISPATCH_FORCE=my_op:my_op_v2 python my_script.py
The format is dispatcher_name:implementation_name, where dispatcher_name is the name of the meta-function and implementation_name is the name of the registered function.
To force implementations for multiple dispatchers, separate entries with commas:
QD_PERFDISPATCH_FORCE=my_op:my_op_v2,transform:transform_v1 python my_script.py
Dispatchers not listed in the env var will benchmark normally.
Discovering available names#
When QD_PERFDISPATCH_FORCE is set, all dispatchers automatically print their name and registered implementations at startup. To discover valid values, set the env var to any value and run the program:
QD_PERFDISPATCH_FORCE=? python my_script.py
This will produce output like:
perf_dispatch 'my_op': registered 'my_op_v1'
perf_dispatch 'my_op': registered 'my_op_v2'
perf_dispatch 'my_op': available implementations: ['my_op_v1', 'my_op_v2']
perf_dispatch 'transform': registered 'transform_v1'
perf_dispatch 'transform': registered 'transform_v2'
perf_dispatch 'transform': available implementations: ['transform_v1', 'transform_v2']
If the requested implementation name doesn’t match any registered function, a warning is printed and the dispatcher falls back to normal benchmarking.
Important notes#
All registered implementations must produce identical results, including side effects.
perf_dispatchdoes not verify this — incorrect results will be silently returned if implementations disagree.Only one implementation runs per call. Implementations do not need to be idempotent.
perf_dispatchis not thread-safe. Do not call the same meta-function concurrently from multiple threads.Set
QD_PERFDISPATCH_PRINT_DEBUG=1to print debug messages showing which implementation was registered and which was selected.
Complete example#
import quadrants as qd
@qd.perf_dispatch(
get_geometry_hash=lambda data, out: hash(data.shape),
repeat_after_seconds=0,
)
def transform(data: qd.types.NDArray[qd.f32, 1], out: qd.types.NDArray[qd.f32, 1]): ...
@transform.register
@qd.kernel
def transform_v1(data: qd.types.NDArray[qd.f32, 1], out: qd.types.NDArray[qd.f32, 1]) -> None:
for i in range(data.shape[0]):
out[i] = data[i] * 2
@transform.register
@qd.kernel
def transform_v2(data: qd.types.NDArray[qd.f32, 1], out: qd.types.NDArray[qd.f32, 1]) -> None:
i = 0
# will run the same speed as the other version, but just an example
while i < data.shape[0]:
out[i] = data[i] << 1
data = qd.ndarray(qd.f32, (1024,))
out = qd.ndarray(qd.f32, (1024,))
for _ in range(100):
transform(data, out)