Getting started#

Installation#

Pre-requisites#

Procedure#

pip install quadrants

Sanity checking the installation#

python -c 'import quadrants as qd; qd.init(arch=qd.gpu)'

(should not show any error messages)

A first Quadrants kernel#

Let’s use a linear congruential generator - a pseudo-random number generator - since they are not easily possible for a compiler to optimize out. First, in normal python:

def lcg_np(B: int, lcg_its: int, a: npt.NDArray) -> None:
    for i in range(B):
        x = a[i]
        for j in range(lcg_its):
            x = (1664525 * x + 1013904223) % 2147483647
        a[i] = x

We are taking in a numpy array, of size B, looping over it. For each value in the array, we run 1000 iterations of LCG, then update the original value.

Let’s write out the full code, including creating a numpy array, and timing this method:

import numpy as np
import numpy.typing as npt
import time


def lcg_np(B: int, lcg_its: int, a: npt.NDArray) -> None:
    for i in range(B):
        x = a[i]
        for j in range(lcg_its):
            x = (1664525 * x + 1013904223) % 2147483647
        a[i] = x

B = 16000
a = np.ndarray((B,), np.int32)

start = time.time()
lcg_np(B, 1000, a)
end = time.time()
print("elapsed", end - start)

You can find the full code also at lcg_python.py

On a Macbook Air M4, this gives the following output:

# elapsed 5.552601099014282

Now let’s convert it to quadrants

Here is the function, written as a Quadrants kernel:

@qd.kernel
def lcg_ti(B: int, lcg_its: int, a: qd.types.NDArray[qd.i32, 1]) -> None:
    for i in range(B):
        x = a[i]
        for j in range(lcg_its):
            x = (1664525 * x + 1013904223) % 2147483647
        a[i] = x

Yes, it’s the same except:

  • added @qd.kernel annotation

  • changed type from npt.NDArray to qd.types.NDArray[qd.i32, 1]

Before we run this we need to import quadrants, and initialize it:

import quadrants as qd

qd.init(arch=qd.gpu)

The arch parameter lets you choose between gpu, cpu, metal, cuda, vulkan, amdgpu, x64, arm64.

  • using qd.gpu will use the first GPU it finds

We’ll also need to create a quadrants ndarray:

a = qd.ndarray(qd.i32, (B,))

By comparison with numpy array:

  • the parameters are reversed

  • we use qd.i32 instead of np.int32

When we time the kernel we have to be careful:

  • running the kernel function lcg_ti() starts the kernel

  • … but it does not wait for it to finish

We’ll only wait for it to finish when we access data from the kernel, or we call an explicit synchronization function, like qd.sync(). So let’s do that:

qd.sync()
end = time.time()

In addition, whilst it looks like we aren’t using the gpu before this, in fact we are: when we create the NDArray, the ndarray needs to be created in GPU memory, and again this happens asynchronously. So before calling start we also add qd.sync():

qd.sync()
start = time.time()

The full program then becomes:

import quadrants as qd
import time


@qd.kernel
def lcg_ti(B: int, lcg_its: int, a: qd.types.NDArray[qd.i32, 1]) -> None:
    for i in range(B):
        x = a[i]
        for j in range(lcg_its):
            x = (1664525 * x + 1013904223) % 2147483647
        a[i] = x

qd.init(arch=qd.gpu)

B = 16000
a = qd.ndarray(qd.int32, (B,))

qd.sync()
start = time.time()

lcg_ti(B, 1000, a)

qd.sync()
end = time.time()
print("elapsed", end - start)

When run on a Macbook Air M4, the output is something like:

# [Quadrants] version 1.8.0, llvm 15.0.7, commit 5afed1c9, osx, python 3.10.16
# [Quadrants] Starting on arch=metal
# elapsed 0.04660296440124512

Around 120x faster.

On one of our linux boxes with a 5090 GPU, the results are:

  • numpy: 6.90 seconds

  • quadrants: 0.0199 seconds

  • => 346 times faster

What does Quadrants do with the kernel function?#

  • any top level for loops are parallelized across the GPU cores (or CPU, if you run on CPU)

    • in our case, there will be 16,000 threads

    • compared to just a single thread in the numpy case

fields: even faster#

Quadrants ndarrays are easy to use, and flexible, but we can increase speed by another ~30% or so (depending on the kernel), by using fields.

The kernel above doesn’t load or store data except at the start and end: it’s just exercising the GPU ALU. To see the difference between Quadrants ndarray and Quadrants field runtime speed, we need a kernel that does more loads and stores.

We’ll do a simple kernel that copies from one tensor to another. To avoid simply measuring the latency to read and write from/to global memory, we’ll read and write the same values repeatedly.

import argparse
import time
import quadrants as qd

qd.init(arch=qd.gpu)

parser = argparse.ArgumentParser()
parser.add_argument("--use-field", action="store_true")
args = parser.parse_args()

use_field = args.use_field
if use_field:
    V = qd.field
    ParamType = qd.Template
else:
    V = qd.ndarray
    ParamType = qd.types.NDArray[qd.i32, 1]

@qd.kernel
def copy_memory(N: int, a: ParamType, b: ParamType) -> None:
    for n in range(N):
        b[n % 100] = a[n % 100]

N = 20_000
a = V(qd.i32, (100,))
b = V(qd.i32, (100,))

# warmup
copy_memory(N, a, b)

num_its = 1000

qd.sync()
start = time.time()
for it in range(num_its):
    copy_memory(N, a, b)
qd.sync()
end = time.time()
iteration_time = (end - start) / num_its * 1_000_000
print("iteration time", iteration_time, "us")

Here are the outputs, using a 5090 gpu, on ubuntu:

$ python doc/mem_copy.py
[Quadrants] version 1.8.0, llvm 15.0.4, commit b4755383, linux, python 3.10.15
[Quadrants] Starting on arch=cuda
iteration time 29.45709228515625 us

$ python doc/mem_copy.py --use-field
[Quadrants] version 1.8.0, llvm 15.0.4, commit b4755383, linux, python 3.10.15
[Quadrants] Starting on arch=cuda
iteration time 21.44002914428711 us

=> in this test, fields are around 28% faster than ndarrays.

Technical note: the exact ratio depends on the kernel. In addition it’s possible to construct toy examples like this where ndarray appears to be faster than fields, but in many commonly used kernels, such as Genesis func narrow phase kernel, for collisions, we observe fields are around ~30% faster