Performance Dispatch#

@qd.perf_dispatch is a decorator that automatically selects the fastest implementation of a function at runtime. When you have multiple implementations of the same operation (e.g. different algorithmic strategies, or platform-specific kernels), perf_dispatch benchmarks them and picks the winner.

Basic usage#

Define a meta-function with @qd.perf_dispatch. The function body should be empty — it serves only as a prototype declaring the signature and geometry hash:

@qd.perf_dispatch(get_geometry_hash=lambda a, b: hash(a.shape + b.shape))
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]): ...

Then register concrete implementations using @my_op.register. Each implementation must have the same parameter names as the prototype.

@my_op.register
@qd.kernel
def my_op_v1(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
    for i in range(a.shape[0]):
        b[i] = a[i] * 2

@my_op.register
@qd.kernel
def my_op_v2(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
    i = 0
    # will run the same speed as the other version, but just an example
    while i < a.shape[0]:
        b[i] = a[i] << 1

Call the meta-function as though calling a standard Quadrants kernel:

my_op(a, b)

On the first several calls, perf_dispatch will cycle through implementations (warming up, then timing). Once all implementations have been timed, the fastest is cached and used until re-evaluation is triggered (see Tuning parameters).

Decorator order#

When registering a @qd.kernel, the @my_op.register decorator must be the topmost decorator:

# Correct
@my_op.register
@qd.kernel
def impl(...) -> None: ...

# Wrong — will raise QuadrantsSyntaxError
@qd.kernel
@my_op.register
def impl(...) -> None: ...

Registering plain Python functions#

Implementations do not have to be @qd.kernel. Plain Python functions work too. You can mix kernel and Python implementations under the same meta-function:

@my_op.register
def my_op_python(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
    for i in range(a.shape[0]):
        b[i] = a[i] * 2

Return values#

Implementations are not limited to returning None. The return value of whichever implementation runs is passed through to the caller:

@qd.perf_dispatch(get_geometry_hash=lambda a, b: hash(a.shape + b.shape))
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> int: ...

@my_op.register
def my_op_python(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> int:
    for i in range(a.shape[0]):
        b[i] = a[i] * 2
    return a.shape[0]

count = my_op(a, b)  # receives the return value from whichever impl runs

Note that return types need to match across implementations, and the meta function.

Compatibility filtering#

Some implementations may only work under certain conditions (specific platforms, input shapes, etc.). Use the is_compatible parameter to declare when an implementation is eligible:

@my_op.register(is_compatible=lambda a, b: a.shape[0] >= 1024)
@qd.kernel
def my_op_large(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
    # optimized for large inputs
    ...

is_compatible receives the same arguments as the meta-function and must return True or False. If an implementation cannot handle certain inputs, is_compatible must be provided and must return False for those inputs. Implementations without is_compatible are assumed to always be compatible.

If only one implementation is compatible for a given call, it is used immediately without benchmarking.

Geometry hash#

The get_geometry_hash function maps call arguments to a hash representing the “geometry” of the input. Different geometries are benchmarked independently, so perf_dispatch can select different winners for different input shapes or configurations.

# Different shapes benchmark independently
@qd.perf_dispatch(get_geometry_hash=lambda a, b: hash(a.shape + b.shape))
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]): ...

Guidelines for get_geometry_hash:

  • Return a constant (e.g. 0) if all inputs have the same performance characteristics — a single winner will be chosen.

  • Hash input shapes when different shapes may favor different implementations.

  • Avoid reading GPU data in the hash function, as this creates a GPU sync point and will severely degrade performance. Prefer metadata like .shape which is available on the CPU.

Tuning parameters#

@qd.perf_dispatch accepts several optional parameters to control the benchmarking process:

Parameter

Default

Description

warmup

3

Number of untimed warmup calls per implementation before measuring.

active

1

Number of timed calls per implementation.

repeat_after_count

0

Re-run benchmarking after this many additional calls. 0 disables.

repeat_after_seconds

1.0

Re-run benchmarking after this many seconds have elapsed. 0 disables.

Example with custom tuning:

@qd.perf_dispatch(
    get_geometry_hash=lambda a, b: hash(a.shape),
    warmup=5,
    active=2,
    repeat_after_seconds=60.0,
)
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]): ...

How benchmarking works#

  1. Warmup phase: Each compatible implementation is called warmup times in round-robin order. These calls are not timed.

  2. Active phase: Each compatible implementation is called active times in round-robin order. The GPU is synchronized before and after each call to get accurate wall-clock measurements.

  3. Selection: The implementation with the lowest active-phase time is cached as the winner for that geometry hash.

  4. Steady state: Subsequent calls with the same geometry go directly to the cached winner with no overhead.

  5. Re-evaluation (optional): After repeat_after_count calls or repeat_after_seconds seconds, the entire warmup + active cycle restarts from scratch, allowing the dispatcher to adapt if conditions change.

Forcing a specific implementation#

For debugging or profiling, you can bypass the auto-tuning and force a specific implementation using the QD_PERFDISPATCH_FORCE environment variable:

QD_PERFDISPATCH_FORCE=my_op:my_op_v2 python my_script.py

The format is dispatcher_name:implementation_name, where dispatcher_name is the name of the meta-function and implementation_name is the name of the registered function.

To force implementations for multiple dispatchers, separate entries with commas:

QD_PERFDISPATCH_FORCE=my_op:my_op_v2,transform:transform_v1 python my_script.py

Dispatchers not listed in the env var will benchmark normally.

Discovering available names#

When QD_PERFDISPATCH_FORCE is set, all dispatchers automatically print their name and registered implementations at startup. To discover valid values, set the env var to any value and run the program:

QD_PERFDISPATCH_FORCE=? python my_script.py

This will produce output like:

perf_dispatch 'my_op': registered 'my_op_v1'
perf_dispatch 'my_op': registered 'my_op_v2'
perf_dispatch 'my_op': available implementations: ['my_op_v1', 'my_op_v2']
perf_dispatch 'transform': registered 'transform_v1'
perf_dispatch 'transform': registered 'transform_v2'
perf_dispatch 'transform': available implementations: ['transform_v1', 'transform_v2']

If the requested implementation name doesn’t match any registered function, a warning is printed and the dispatcher falls back to normal benchmarking.

Important notes#

  • All registered implementations must produce identical results, including side effects. perf_dispatch does not verify this — incorrect results will be silently returned if implementations disagree.

  • Only one implementation runs per call. Implementations do not need to be idempotent.

  • perf_dispatch is not thread-safe. Do not call the same meta-function concurrently from multiple threads.

  • Set QD_PERFDISPATCH_PRINT_DEBUG=1 to print debug messages showing which implementation was registered and which was selected.

Complete example#

import quadrants as qd

@qd.perf_dispatch(
    get_geometry_hash=lambda data, out: hash(data.shape),
    repeat_after_seconds=0,
)
def transform(data: qd.types.NDArray[qd.f32, 1], out: qd.types.NDArray[qd.f32, 1]): ...

@transform.register
@qd.kernel
def transform_v1(data: qd.types.NDArray[qd.f32, 1], out: qd.types.NDArray[qd.f32, 1]) -> None:
    for i in range(data.shape[0]):
        out[i] = data[i] * 2

@transform.register
@qd.kernel
def transform_v2(data: qd.types.NDArray[qd.f32, 1], out: qd.types.NDArray[qd.f32, 1]) -> None:
    i = 0
    # will run the same speed as the other version, but just an example
    while i < data.shape[0]:
        out[i] = data[i] << 1

data = qd.ndarray(qd.f32, (1024,))
out = qd.ndarray(qd.f32, (1024,))

for _ in range(100):
    transform(data, out)