# Performance Dispatch

`@qd.perf_dispatch` is a decorator that automatically selects the fastest implementation of a function at runtime. When you have multiple implementations of the same operation (e.g. different algorithmic strategies, or platform-specific kernels), `perf_dispatch` benchmarks them and picks the winner.

## Basic usage

Define a meta-function with `@qd.perf_dispatch`. The function body should be empty — it serves only as a prototype declaring the signature and geometry hash:

```python
@qd.perf_dispatch(get_geometry_hash=lambda a, b: hash(a.shape + b.shape))
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]): ...
```

Then register concrete implementations using `@my_op.register`. Each implementation must have the same parameter names as the prototype.

```python
@my_op.register
@qd.kernel
def my_op_v1(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
    for i in range(a.shape[0]):
        b[i] = a[i] * 2

@my_op.register
@qd.kernel
def my_op_v2(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
    i = 0
    # will run the same speed as the other version, but just an example
    while i < a.shape[0]:
        b[i] = a[i] << 1
```

Call the meta-function as though calling a standard Quadrants kernel:

```python
my_op(a, b)
```

On the first several calls, `perf_dispatch` will cycle through implementations (warming up, then timing). Once all implementations have been timed, the fastest is cached and used until re-evaluation is triggered (see [Tuning parameters](#tuning-parameters)).

## Decorator order

When registering a `@qd.kernel`, the `@my_op.register` decorator must be the **topmost** decorator:

```python
# Correct
@my_op.register
@qd.kernel
def impl(...) -> None: ...

# Wrong — will raise QuadrantsSyntaxError
@qd.kernel
@my_op.register
def impl(...) -> None: ...
```

## Registering plain Python functions

Implementations do not have to be `@qd.kernel`. Plain Python functions work too. You can mix kernel and Python implementations under the same meta-function:

```python
@my_op.register
def my_op_python(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
    for i in range(a.shape[0]):
        b[i] = a[i] * 2
```

## Return values

Implementations are not limited to returning `None`. The return value of whichever implementation runs is passed through to the caller:

```python
@qd.perf_dispatch(get_geometry_hash=lambda a, b: hash(a.shape + b.shape))
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> int: ...

@my_op.register
def my_op_python(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> int:
    for i in range(a.shape[0]):
        b[i] = a[i] * 2
    return a.shape[0]

count = my_op(a, b)  # receives the return value from whichever impl runs
```

Note that return types need to match across implementations, and the meta function.

## Compatibility filtering

Some implementations may only work under certain conditions (specific platforms, input shapes, etc.). Use the `is_compatible` parameter to declare when an implementation is eligible:

```python
@my_op.register(is_compatible=lambda a, b: a.shape[0] >= 1024)
@qd.kernel
def my_op_large(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]) -> None:
    # optimized for large inputs
    ...
```

`is_compatible` receives the same arguments as the meta-function and must return `True` or `False`. If an implementation cannot handle certain inputs, `is_compatible` **must** be provided and must return `False` for those inputs. Implementations without `is_compatible` are assumed to always be compatible.

If only one implementation is compatible for a given call, it is used immediately without benchmarking.

## Geometry hash

The `get_geometry_hash` function maps call arguments to a hash representing the "geometry" of the input. Different geometries are benchmarked independently, so `perf_dispatch` can select different winners for different input shapes or configurations.

```python
# Different shapes benchmark independently
@qd.perf_dispatch(get_geometry_hash=lambda a, b: hash(a.shape + b.shape))
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]): ...
```

Guidelines for `get_geometry_hash`:

- Return a constant (e.g. `0`) if all inputs have the same performance characteristics — a single winner will be chosen.
- Hash input shapes when different shapes may favor different implementations.
- **Avoid reading GPU data** in the hash function, as this creates a GPU sync point and will severely degrade performance. Prefer metadata like `.shape` which is available on the CPU.

## Tuning parameters

`@qd.perf_dispatch` accepts several optional parameters to control the benchmarking process:

| Parameter | Default | Description |
|---|---|---|
| `warmup` | 3 | Number of untimed warmup calls per implementation before measuring. |
| `active` | 1 | Number of timed calls per implementation. |
| `repeat_after_count` | 0 | Re-run benchmarking after this many additional calls. 0 disables. |
| `repeat_after_seconds` | 1.0 | Re-run benchmarking after this many seconds have elapsed. 0 disables. |

Example with custom tuning:

```python
@qd.perf_dispatch(
    get_geometry_hash=lambda a, b: hash(a.shape),
    warmup=5,
    active=2,
    repeat_after_seconds=60.0,
)
def my_op(a: qd.types.NDArray[qd.f32, 1], b: qd.types.NDArray[qd.f32, 1]): ...
```

## How benchmarking works

1. **Warmup phase**: Each compatible implementation is called `warmup` times in round-robin order. These calls are not timed.
2. **Active phase**: Each compatible implementation is called `active` times in round-robin order. The GPU is synchronized before and after each call to get accurate wall-clock measurements.
3. **Selection**: The implementation with the lowest active-phase time is cached as the winner for that geometry hash.
4. **Steady state**: Subsequent calls with the same geometry go directly to the cached winner with no overhead.
5. **Re-evaluation** (optional): After `repeat_after_count` calls or `repeat_after_seconds` seconds, the entire warmup + active cycle restarts from scratch, allowing the dispatcher to adapt if conditions change.

## Forcing a specific implementation

For debugging or profiling, you can bypass the auto-tuning and force a specific implementation using the `QD_PERFDISPATCH_FORCE` environment variable:

```bash
QD_PERFDISPATCH_FORCE=my_op:my_op_v2 python my_script.py
```

The format is `dispatcher_name:implementation_name`, where `dispatcher_name` is the name of the meta-function and `implementation_name` is the name of the registered function.

To force implementations for multiple dispatchers, separate entries with commas:

```bash
QD_PERFDISPATCH_FORCE=my_op:my_op_v2,transform:transform_v1 python my_script.py
```

Dispatchers not listed in the env var will benchmark normally.

### Discovering available names

When `QD_PERFDISPATCH_FORCE` is set, all dispatchers automatically print their name and registered implementations at startup. To discover valid values, set the env var to any value and run the program:

```bash
QD_PERFDISPATCH_FORCE=? python my_script.py
```

This will produce output like:

```
perf_dispatch 'my_op': registered 'my_op_v1'
perf_dispatch 'my_op': registered 'my_op_v2'
perf_dispatch 'my_op': available implementations: ['my_op_v1', 'my_op_v2']
perf_dispatch 'transform': registered 'transform_v1'
perf_dispatch 'transform': registered 'transform_v2'
perf_dispatch 'transform': available implementations: ['transform_v1', 'transform_v2']
```

If the requested implementation name doesn't match any registered function, a warning is printed and the dispatcher falls back to normal benchmarking.

## Important notes

- All registered implementations **must produce identical results**, including side effects. `perf_dispatch` does not verify this — incorrect results will be silently returned if implementations disagree.
- Only one implementation runs per call. Implementations do not need to be idempotent.
- `perf_dispatch` is **not thread-safe**. Do not call the same meta-function concurrently from multiple threads.
- Set `QD_PERFDISPATCH_PRINT_DEBUG=1` to print debug messages showing which implementation was registered and which was selected.

## Complete example

```python
import quadrants as qd

@qd.perf_dispatch(
    get_geometry_hash=lambda data, out: hash(data.shape),
    repeat_after_seconds=0,
)
def transform(data: qd.types.NDArray[qd.f32, 1], out: qd.types.NDArray[qd.f32, 1]): ...

@transform.register
@qd.kernel
def transform_v1(data: qd.types.NDArray[qd.f32, 1], out: qd.types.NDArray[qd.f32, 1]) -> None:
    for i in range(data.shape[0]):
        out[i] = data[i] * 2

@transform.register
@qd.kernel
def transform_v2(data: qd.types.NDArray[qd.f32, 1], out: qd.types.NDArray[qd.f32, 1]) -> None:
    i = 0
    # will run the same speed as the other version, but just an example
    while i < data.shape[0]:
        out[i] = data[i] << 1

data = qd.ndarray(qd.f32, (1024,))
out = qd.ndarray(qd.f32, (1024,))

for _ in range(100):
    transform(data, out)
```