# Getting started

## Installation

### Pre-requisites
- a supported platform (MacOS ARM64, Linux x64, Windows x64), see [Supported systems](./supported_systems.md)
- a supported Python version installed, see [Supported systems](./supported_systems.md)
- optionally, a supported GPU, see [Supported systems](./supported_systems.md)

### Procedure
```bash
pip install quadrants
```
### Sanity checking the installation
```bash
python -c 'import quadrants as qd; qd.init(arch=qd.gpu)'
```
(should not show any error messages)

## A first Quadrants kernel

Let's use a linear congruential generator - a pseudo-random number generator - since they are not easily possible for a compiler to optimize out. First, in normal python:

```python
def lcg_np(B: int, lcg_its: int, a: npt.NDArray) -> None:
    for i in range(B):
        x = a[i]
        for j in range(lcg_its):
            x = (1664525 * x + 1013904223) % 2147483647
        a[i] = x
```
We are taking in a numpy array, of size B, looping over it. For each value in the array, we run 1000 iterations of LCG, then update the original value.

Let's write out the full code, including creating a numpy array, and timing this method:

```python
import numpy as np
import numpy.typing as npt
import time


def lcg_np(B: int, lcg_its: int, a: npt.NDArray) -> None:
    for i in range(B):
        x = a[i]
        for j in range(lcg_its):
            x = (1664525 * x + 1013904223) % 2147483647
        a[i] = x

B = 16000
a = np.ndarray((B,), np.int32)

start = time.time()
lcg_np(B, 1000, a)
end = time.time()
print("elapsed", end - start)
```

You can find the full code also at [lcg_python.py](../../../python/quadrants/examples/lcg_python.py)

On a Macbook Air M4, this gives the following output:
```text
# elapsed 5.552601099014282
```

Now let's convert it to quadrants

Here is the function, written as a Quadrants kernel:

```python
@qd.kernel
def lcg_ti(B: int, lcg_its: int, a: qd.types.NDArray[qd.i32, 1]) -> None:
    for i in range(B):
        x = a[i]
        for j in range(lcg_its):
            x = (1664525 * x + 1013904223) % 2147483647
        a[i] = x
```

Yes, it's the same except:
- added `@qd.kernel` annotation
- changed type from `npt.NDArray` to `qd.types.NDArray[qd.i32, 1]`

Before we run this we need to import quadrants, and initialize it:

```python
import quadrants as qd

qd.init(arch=qd.gpu)
```
The `arch` parameter lets you choose between `gpu`, `cpu`, `metal`, `cuda`, `vulkan`, `amdgpu`, `x64`, `arm64`.
- using `qd.gpu` will use the first GPU it finds

We'll also need to create a quadrants ndarray:
```python
a = qd.ndarray(qd.i32, (B,))
```
By comparison with numpy array:
- the parameters are reversed
- we use `qd.i32` instead of `np.int32`

When we time the kernel we have to be careful:
- running the kernel function `lcg_ti()` starts the kernel
- ... but it does not wait for it to finish

We'll only wait for it to finish when we access data from the kernel, or we call an explicit synchronization function, like `qd.sync()`. So let's do that:
```python
qd.sync()
end = time.time()
```

In addition, whilst it looks like we aren't using the gpu before this, in fact we are: when we create the NDArray, the ndarray needs to be created in GPU memory, and again this happens asynchronously. So before calling start we also add qd.sync():

```python
qd.sync()
start = time.time()
```

The full program then becomes:

```python
import quadrants as qd
import time


@qd.kernel
def lcg_ti(B: int, lcg_its: int, a: qd.types.NDArray[qd.i32, 1]) -> None:
    for i in range(B):
        x = a[i]
        for j in range(lcg_its):
            x = (1664525 * x + 1013904223) % 2147483647
        a[i] = x

qd.init(arch=qd.gpu)

B = 16000
a = qd.ndarray(qd.int32, (B,))

qd.sync()
start = time.time()

lcg_ti(B, 1000, a)

qd.sync()
end = time.time()
print("elapsed", end - start)
```

When run on a Macbook Air M4, the output is something like:
```text
# [Quadrants] version 1.8.0, llvm 15.0.7, commit 5afed1c9, osx, python 3.10.16
# [Quadrants] Starting on arch=metal
# elapsed 0.04660296440124512
```
Around 120x faster.

On one of our linux boxes with a 5090 GPU, the results are:
- numpy: 6.90 seconds
- quadrants: 0.0199 seconds
- => 346 times faster

### What does Quadrants do with the kernel function?

- any top level for loops are parallelized across the GPU cores (or CPU, if you run on CPU)
    - in our case, there will be 16,000 threads
    - compared to just a single thread in the numpy case

### fields: even faster

Quadrants ndarrays are easy to use, and flexible, but we can increase speed by another ~30% or so (depending on the kernel), by using fields.

The kernel above doesn't load or store data except at the start and end: it's just exercising the GPU ALU. To see the difference between Quadrants ndarray and Quadrants field runtime speed, we need a kernel that does more loads and stores.

We'll do a simple kernel that copies from one tensor to another. To avoid simply measuring the latency to read and write from/to global memory, we'll read and write the same values repeatedly.

```python
import argparse
import time
import quadrants as qd

qd.init(arch=qd.gpu)

parser = argparse.ArgumentParser()
parser.add_argument("--use-field", action="store_true")
args = parser.parse_args()

use_field = args.use_field
if use_field:
    V = qd.field
    ParamType = qd.Template
else:
    V = qd.ndarray
    ParamType = qd.types.NDArray[qd.i32, 1]

@qd.kernel
def copy_memory(N: int, a: ParamType, b: ParamType) -> None:
    for n in range(N):
        b[n % 100] = a[n % 100]

N = 20_000
a = V(qd.i32, (100,))
b = V(qd.i32, (100,))

# warmup
copy_memory(N, a, b)

num_its = 1000

qd.sync()
start = time.time()
for it in range(num_its):
    copy_memory(N, a, b)
qd.sync()
end = time.time()
iteration_time = (end - start) / num_its * 1_000_000
print("iteration time", iteration_time, "us")
```

Here are the outputs, using a 5090 gpu, on ubuntu:
```text
$ python doc/mem_copy.py
[Quadrants] version 1.8.0, llvm 15.0.4, commit b4755383, linux, python 3.10.15
[Quadrants] Starting on arch=cuda
iteration time 29.45709228515625 us

$ python doc/mem_copy.py --use-field
[Quadrants] version 1.8.0, llvm 15.0.4, commit b4755383, linux, python 3.10.15
[Quadrants] Starting on arch=cuda
iteration time 21.44002914428711 us
```
=> in this test, fields are around 28% faster than ndarrays.

Technical note: the exact ratio depends on the kernel. In addition it's possible to construct toy examples like this where ndarray appears to be faster than fields, but in many commonly used kernels, such as Genesis func narrow phase kernel, for collisions, we observe fields are around ~30% faster