Per-thread linear algebra#

Small matrix decompositions and linear solvers that run per thread, in registers, with no cross-thread cooperation — every thread independently decomposes / solves its own 2×2 .. 12×12 matrix. No shared memory, no syncs, no atomics on shared state, no warp / subgroup primitives. A 1000-element kernel runs 1000 copies of the algorithm in parallel, each on its own data.

This is a different category from the element-wise / arithmetic matrix operations covered in matrix_vector_per_thread (also per-thread, but closed-form rather than iterative), and from the cross-thread / sparse linear algebra under qd.linalg.* (CG, sparse direct solvers — covered separately).

Each entry point here implements an iterative or multi-step numerical algorithm (Jacobi sweeps, Gauss elimination, Givens rotations) rather than a single closed-form formula.

All ops live at the top level (qd.svd, qd.sym_eig, qd.make_spd, qd.polar_decompose, qd.eig, qd.solve) and are intended to be called from inside a @qd.kernel or @qd.func.

What’s available#

Op

Operates on

Shapes

Returns

qd.svd(A)

square real matrix

2×2, 3×3

(U, S, V) such that A = U @ S @ V.transpose()

qd.sym_eig(A)

symmetric real matrix

2×2 .. 12×12

(eigenvalues, eigenvectors) (real)

qd.make_spd(A)

symmetric real matrix

4×4 .. 12×12

M — the closest positive semi-definite matrix to A

qd.polar_decompose(A)

square real matrix

2×2, 3×3

(R, S) such that A = R @ S, R orthogonal, S SPD

qd.eig(A)

square real matrix

2×2

(eigenvalues, eigenvectors) (complex, packed)

qd.solve(A, b)

square A + vector b

2×2, 3×3

x such that A @ x = b

A few patterns to note:

  • Shapes are fixed. Calling any of these on a matrix outside the supported shapes raises an exception at compile time (e.g. "SVD only supports 2×2 and 3×3 matrices."). There is no fallback for larger shapes.

  • All ops accept an optional dt argument. When unspecified, it defaults to impl.get_runtime().default_fp — usually qd.f32 unless overridden in qd.init(). Pass dt=qd.f64 for the high-precision variant.

  • Output shape matches the input shape. A 3×3 input yields 3×3 outputs (and a length-3 vector for solve / eigenvalues); a 2×2 input yields 2×2 outputs.

  • Real matrices only. qd.eig returns complex results in a packed real layout (see below); the others all assume real-valued input and return real-valued output.

Semantics#

qd.svd(A, dt=None)#

Singular value decomposition — produces (U, S, V) such that A = U @ S @ V.transpose(), with:

  • U orthogonal (U @ U.transpose() = I).

  • V orthogonal.

  • S diagonal, with non-negative singular values.

Shapes 2×2 and 3×3 use closed-form / Jacobi-style implementations specialized per shape.

Singular values come back sorted descending (S[0,0] >= S[1,1] >= ...) for both 2×2 and 3×3. The 3×3 path uses the Sifakis algorithm and absorbs sign(det(A)) into σ rather than into U / V, so the smallest entry of S may be negative when det(A) < 0; the descending sort is on direct numeric value (matches the 2×2 path and NumPy / LAPACK conventions).

For 3×3 the implementation enforces det(U) = det(V) = +1 regardless of input — useful for ARAP-style rotations R = U @ V.transpose() (which is then guaranteed to be a proper rotation). The 2×2 path does not enforce this convention; if you need a particular handedness on 2×2, check it explicitly and flip a column if needed.

qd.sym_eig(A, dt=None)#

Symmetric eigendecomposition — for a real symmetric A, returns (eigenvalues, eigenvectors) with:

  • eigenvalues: a Vector(n) of real eigenvalues.

  • eigenvectors: a Matrix(n, n) whose columns are the corresponding orthonormal eigenvectors.

Eigenvalues come back sorted ascending (eigvals[0] <= eigvals[1] <= ...) for every shape, with column i of eigenvectors being the eigenvector for eigvals[i]. This matches NumPy / LAPACK’s eigh (note that qd.svd sorts σ descending, also matching its NumPy / LAPACK counterpart — the cross-op disagreement is the standard convention, not an inconsistency).

Three implementations dispatch by size:

  • 2×2 — closed-form via the trace / determinant identity.

  • 3×3 — closed-form Cardano method (Eigen3 computeDirect).

  • 4×4 .. 12×12 — cyclic Jacobi: 12 sweeps of Givens rotations zeroing every off-diagonal (p, q) pair, with Q := Q · J accumulated as eigenvectors. ~6 digits of accuracy in f32, ~12 digits in f64.

Calling at N >= 13 raises ("Symmetric eigen solver currently supports sizes up to 12×12.").

A is assumed symmetric; the implementation does not symmetrize first. If your matrix is only approximately symmetric (e.g. accumulated floating-point error), explicitly compute (A + A.transpose()) * 0.5 before calling.

qd.make_spd(A, dt=None)#

Project a symmetric matrix A to the closest positive semi-definite matrix in the Frobenius-norm sense. Implemented as Q · diag(max(λ, 0)) · Qᵀ where A = Q · diag(λ) · Qᵀ is the symmetric eigendecomposition computed by qd.sym_eig.

Available for shapes 4×4 .. 12×12 (it shares the cyclic-Jacobi path with qd.sym_eig; for 2×2 / 3×3 you can write the same projection by hand using qd.sym_eig directly — see the example below).

Use cases:

  • Projecting an indefinite contact / element Hessian to its closest SPD approximation before assembly (qipc-style).

  • Regularizing a covariance estimate that may have small negative eigenvalues from rounding.

  • Producing a usable preconditioner from a not-quite-SPD matrix.

make_spd is a Frobenius-projector onto the SPD cone: make_spd(make_spd(A)) == make_spd(A). If A is already SPD it is returned essentially unchanged (up to sym_eig round-trip error); if A is negative-definite the result is the zero matrix.

qd.polar_decompose(A, dt=None)#

Polar decomposition — produces (R, S) such that A = R @ S, with:

  • R orthogonal (the closest-rotation factor, modulo handedness).

  • S symmetric positive semi-definite (the stretch factor).

Built on top of SVD: A = U @ Σ @ VᵀR = U @ Vᵀ, S = V @ Σ @ Vᵀ. The same caveats about sign convention as qd.svd apply — R is the closest orthogonal factor to A but is not guaranteed to be a proper rotation (positive determinant) without a manual fix-up.

qd.eig(A, dt=None)#

General eigendecomposition for 2×2 only. Returns:

  • eigenvalues: a Matrix(n, 2) where row i is (real_part, imaginary_part) of the i-th eigenvalue.

  • eigenvectors: a Matrix(n*2, n) where each column is an eigenvector with each entry expanded into two scalars (real, imaginary). For a 2×2 input, the eigenvector matrix is 4×2.

Eigenvalues of a real 2×2 matrix come in complex-conjugate pairs when the discriminant is negative; the packed layout keeps the function single-return-shape. Calling with n=3 raises — for 3×3 general (non-symmetric) eigendecomposition, there is no Quadrants entry point today; the canonical path is qd.sym_eig if your matrix happens to be symmetric.

qd.solve(A, b, dt=None)#

Direct solve of A @ x = b via Gauss elimination with partial pivoting. Returns the solution vector x.

  • Shapes 2×2 and 3×3.

  • The implementation asserts A.n == A.m and A.m == b.n.

  • Singular A is checked by a kernel assert ("Matrix is singular in linear solve.") inside the Gauss-elimination path. Kernel asserts only fire when the runtime is initialised with qd.init(debug=True) (see debug) — under the default debug=False a singular input silently produces a divide-by-zero / NaN result with no diagnostic. If you need a signal in production, check singularity explicitly before calling qd.solve (e.g. abs(A.determinant()) against a tolerance), or run development workloads with debug=True to catch the case.

  • Each call factorises A from scratch and back-substitutes for the given b.

Examples#

Closest rotation to a 3×3 matrix (ARAP)#

@qd.func
def closest_rotation(A: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3, qd.f64):
    U, S, V = qd.svd(A, dt=qd.f64)
    R = U @ V.transpose()
    if R.determinant() < 0.0:
        V[:, 2] *= -1.0
        R = U @ V.transpose()
    return R

The R.determinant() < 0.0 branch fixes the handedness when SVD’s sign convention produces a reflection rather than a rotation.

Project to symmetric positive semi-definite#

For shapes 4×4 .. 12×12 (a typical qipc 12×12 contact Hessian), use qd.make_spd directly:

mat12 = qd.types.matrix(12, 12, qd.f64)


@qd.kernel
def project_each(
    H_field: qd.types.NDArray[mat12, 1],
    H_spd_field: qd.types.NDArray[mat12, 1],
) -> None:
    for i in range(H_field.shape[0]):
        H_spd_field[i] = qd.make_spd(H_field[i], dt=qd.f64)

Each thread eigen-decomposes its own 12×12 Hessian, clamps negative eigenvalues to zero, and reconstructs.

For 2×2 / 3×3 there is no qd.make_spd entry point (it shares the cyclic-Jacobi path that only kicks in at N≥4). Inline the projection using qd.sym_eig directly:

@qd.func
def make_spd_3x3(H: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3, qd.f64):
    eigvals, Q = qd.sym_eig(H, dt=qd.f64)
    for i in qd.static(range(3)):
        if eigvals[i] < 0.0:
            eigvals[i] = 0.0
    Lambda = qd.Matrix.zero(qd.f64, 3, 3)
    for i in qd.static(range(3)):
        Lambda[i, i] = eigvals[i]
    return Q @ Lambda @ Q.transpose()

Per-thread linear solve#

@qd.kernel
def solve_each(A_field: qd.types.NDArray[qd.types.matrix(2, 2, qd.f32), 1],
               b_field: qd.types.NDArray[qd.types.vector(2, qd.f32), 1],
               x_field: qd.types.NDArray[qd.types.vector(2, qd.f32), 1]) -> None:
    for i in range(A_field.shape[0]):
        x_field[i] = qd.solve(A_field[i], b_field[i])

Each thread does an independent Gauss elimination on its own 2×2 system. For larger systems a CG / PCG iteration over the whole array is the standard Quadrants pattern; qd.solve is for the per-element case.

2×2 polar decomposition for shape matching#

@qd.func
def shape_match(A: qd.types.matrix(2, 2, qd.f32)) -> qd.types.matrix(2, 2, qd.f32):
    R, _ = qd.polar_decompose(A)
    return R

The rotation factor R from A = R @ S is the rigid alignment that minimises ‖R - A‖_F — the building block of position-based dynamics shape-matching.

Shapes, performance, portability#

  • Compile time.

    • Closed-form ops (qd.svd, qd.sym_eig 2×2/3×3, qd.polar_decompose, qd.eig, qd.solve) — each call is unrolled per thread into a moderately large block of straight-line code; compile time is generally fine at these shapes.

    • Cyclic Jacobi (qd.sym_eig ≥4×4, qd.make_spd) — the per-pair Givens step is unrolled but the outer sweep loop is a runtime range, so compile time is roughly proportional to (number of (p, q) pairs per sweep) rather than · MAX_SWEEPS. Concrete numbers on CUDA + LLVM 22.1: ~3 s at N=4, ~30 s at N=6, ~3 min at N=9, ~2 min at N=12 (yes, faster than N=9 — the per-pair body is dominated by if static(p < q): filtering).

  • Runtime cost. Cyclic Jacobi at N=12 with MAX_SWEEPS=12 does roughly 12 · 66 · 12 9500 per-thread arithmetic ops — fast on any modern GPU, but if you’re calling it inside a hot kernel for a million elements that’s still ~10 GFLOP-equivalent. For larger matrices use a different algorithm (or call quadrants linalg.* for a sparse-aware path).

  • Backend portability. All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. The qd.sym_eig ≥4×4 / qd.make_spd paths have been verified at N ∈ {4,5,6,9,12} × {f32, f64} × five symmetric-matrix factories on CUDA + Vulkan + AMDGPU; Metal coverage is via the same parametrized tests.