# Per-thread linear algebra Small matrix decompositions and linear solvers that run **per thread, in registers, with no cross-thread cooperation** — every thread independently decomposes / solves its own 2×2 .. 12×12 matrix. No shared memory, no syncs, no atomics on shared state, no warp / subgroup primitives. A 1000-element kernel runs 1000 copies of the algorithm in parallel, each on its own data. This is a different category from the element-wise / arithmetic matrix operations covered in [matrix_vector_per_thread](matrix_vector_per_thread.md) (also per-thread, but closed-form rather than iterative), and from the cross-thread / sparse linear algebra under `qd.linalg.*` (CG, sparse direct solvers — covered separately). Each entry point here implements an *iterative or multi-step numerical algorithm* (Jacobi sweeps, Gauss elimination, Givens rotations) rather than a single closed-form formula. All ops live at the top level (`qd.svd`, `qd.sym_eig`, `qd.make_spd`, `qd.polar_decompose`, `qd.eig`, `qd.solve`) and are intended to be called from inside a `@qd.kernel` or `@qd.func`. ## What's available | Op | Operates on | Shapes | Returns | |---------------------------------|------------------------|----------------------|--------------------------------------------------------| | `qd.svd(A)` | square real matrix | 2×2, 3×3 | `(U, S, V)` such that `A = U @ S @ V.transpose()` | | `qd.sym_eig(A)` | symmetric real matrix | 2×2 .. 12×12 | `(eigenvalues, eigenvectors)` (real) | | `qd.make_spd(A)` | symmetric real matrix | 4×4 .. 12×12 | `M` — the closest positive semi-definite matrix to `A` | | `qd.polar_decompose(A)` | square real matrix | 2×2, 3×3 | `(R, S)` such that `A = R @ S`, `R` orthogonal, `S` SPD | | `qd.eig(A)` | square real matrix | 2×2 | `(eigenvalues, eigenvectors)` (complex, packed) | | `qd.solve(A, b)` | square `A` + vector `b`| 2×2, 3×3 | `x` such that `A @ x = b` | A few patterns to note: - **Shapes are fixed.** Calling any of these on a matrix outside the supported shapes raises an exception at compile time (e.g. `"SVD only supports 2×2 and 3×3 matrices."`). There is no fallback for larger shapes. - **All ops accept an optional `dt` argument.** When unspecified, it defaults to `impl.get_runtime().default_fp` — usually `qd.f32` unless overridden in `qd.init()`. Pass `dt=qd.f64` for the high-precision variant. - **Output shape matches the input shape.** A 3×3 input yields 3×3 outputs (and a length-3 vector for `solve` / eigenvalues); a 2×2 input yields 2×2 outputs. - **Real matrices only.** `qd.eig` returns complex results in a packed real layout (see below); the others all assume real-valued input and return real-valued output. ## Semantics ### `qd.svd(A, dt=None)` Singular value decomposition — produces `(U, S, V)` such that `A = U @ S @ V.transpose()`, with: - `U` orthogonal (`U @ U.transpose() = I`). - `V` orthogonal. - `S` diagonal, with non-negative singular values. Shapes 2×2 and 3×3 use closed-form / Jacobi-style implementations specialized per shape. Singular values come back sorted **descending** (`S[0,0] >= S[1,1] >= ...`) for both 2×2 and 3×3. The 3×3 path uses the Sifakis algorithm and absorbs `sign(det(A))` into σ rather than into `U` / `V`, so the smallest entry of `S` may be negative when `det(A) < 0`; the descending sort is on direct numeric value (matches the 2×2 path and NumPy / LAPACK conventions). For 3×3 the implementation enforces `det(U) = det(V) = +1` regardless of input — useful for ARAP-style rotations `R = U @ V.transpose()` (which is then guaranteed to be a proper rotation). The 2×2 path does not enforce this convention; if you need a particular handedness on 2×2, check it explicitly and flip a column if needed. ### `qd.sym_eig(A, dt=None)` Symmetric eigendecomposition — for a real symmetric `A`, returns `(eigenvalues, eigenvectors)` with: - `eigenvalues`: a `Vector(n)` of real eigenvalues. - `eigenvectors`: a `Matrix(n, n)` whose columns are the corresponding orthonormal eigenvectors. Eigenvalues come back sorted **ascending** (`eigvals[0] <= eigvals[1] <= ...`) for every shape, with column `i` of `eigenvectors` being the eigenvector for `eigvals[i]`. This matches NumPy / LAPACK's `eigh` (note that `qd.svd` sorts σ *descending*, also matching its NumPy / LAPACK counterpart — the cross-op disagreement is the standard convention, not an inconsistency). Three implementations dispatch by size: - **2×2** — closed-form via the trace / determinant identity. - **3×3** — closed-form Cardano method (Eigen3 `computeDirect`). - **4×4 .. 12×12** — cyclic Jacobi: 12 sweeps of Givens rotations zeroing every off-diagonal `(p, q)` pair, with `Q := Q · J` accumulated as eigenvectors. ~6 digits of accuracy in `f32`, ~12 digits in `f64`. Calling at `N >= 13` raises (`"Symmetric eigen solver currently supports sizes up to 12×12."`). `A` is *assumed* symmetric; the implementation does not symmetrize first. If your matrix is only approximately symmetric (e.g. accumulated floating-point error), explicitly compute `(A + A.transpose()) * 0.5` before calling. ### `qd.make_spd(A, dt=None)` Project a symmetric matrix `A` to the closest positive semi-definite matrix in the Frobenius-norm sense. Implemented as `Q · diag(max(λ, 0)) · Qᵀ` where `A = Q · diag(λ) · Qᵀ` is the symmetric eigendecomposition computed by `qd.sym_eig`. Available for shapes 4×4 .. 12×12 (it shares the cyclic-Jacobi path with `qd.sym_eig`; for 2×2 / 3×3 you can write the same projection by hand using `qd.sym_eig` directly — see the example below). Use cases: - Projecting an indefinite contact / element Hessian to its closest SPD approximation before assembly (qipc-style). - Regularizing a covariance estimate that may have small negative eigenvalues from rounding. - Producing a usable preconditioner from a not-quite-SPD matrix. `make_spd` is a Frobenius-projector onto the SPD cone: `make_spd(make_spd(A)) == make_spd(A)`. If `A` is already SPD it is returned essentially unchanged (up to `sym_eig` round-trip error); if `A` is negative-definite the result is the zero matrix. ### `qd.polar_decompose(A, dt=None)` Polar decomposition — produces `(R, S)` such that `A = R @ S`, with: - `R` orthogonal (the closest-rotation factor, modulo handedness). - `S` symmetric positive semi-definite (the stretch factor). Built on top of SVD: `A = U @ Σ @ Vᵀ` ⇒ `R = U @ Vᵀ`, `S = V @ Σ @ Vᵀ`. The same caveats about sign convention as `qd.svd` apply — `R` is the closest orthogonal factor to `A` but is not guaranteed to be a proper rotation (positive determinant) without a manual fix-up. ### `qd.eig(A, dt=None)` General eigendecomposition for **2×2 only**. Returns: - `eigenvalues`: a `Matrix(n, 2)` where row `i` is `(real_part, imaginary_part)` of the i-th eigenvalue. - `eigenvectors`: a `Matrix(n*2, n)` where each column is an eigenvector with each entry expanded into two scalars (real, imaginary). For a 2×2 input, the eigenvector matrix is 4×2. Eigenvalues of a real 2×2 matrix come in complex-conjugate pairs when the discriminant is negative; the packed layout keeps the function single-return-shape. Calling with `n=3` raises — for 3×3 general (non-symmetric) eigendecomposition, there is no Quadrants entry point today; the canonical path is `qd.sym_eig` if your matrix happens to be symmetric. ### `qd.solve(A, b, dt=None)` Direct solve of `A @ x = b` via Gauss elimination with partial pivoting. Returns the solution vector `x`. - Shapes 2×2 and 3×3. - The implementation asserts `A.n == A.m` and `A.m == b.n`. - Singular `A` is checked by a kernel `assert` (`"Matrix is singular in linear solve."`) inside the Gauss-elimination path. Kernel asserts only fire when the runtime is initialised with `qd.init(debug=True)` (see [debug](debug.md)) — under the default `debug=False` a singular input silently produces a divide-by-zero / NaN result with no diagnostic. If you need a signal in production, check singularity explicitly before calling `qd.solve` (e.g. `abs(A.determinant())` against a tolerance), or run development workloads with `debug=True` to catch the case. - Each call factorises `A` from scratch and back-substitutes for the given `b`. ## Examples ### Closest rotation to a 3×3 matrix (ARAP) ```python @qd.func def closest_rotation(A: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3, qd.f64): U, S, V = qd.svd(A, dt=qd.f64) R = U @ V.transpose() if R.determinant() < 0.0: V[:, 2] *= -1.0 R = U @ V.transpose() return R ``` The `R.determinant() < 0.0` branch fixes the handedness when SVD's sign convention produces a reflection rather than a rotation. ### Project to symmetric positive semi-definite For shapes 4×4 .. 12×12 (a typical qipc 12×12 contact Hessian), use `qd.make_spd` directly: ```python mat12 = qd.types.matrix(12, 12, qd.f64) @qd.kernel def project_each( H_field: qd.types.NDArray[mat12, 1], H_spd_field: qd.types.NDArray[mat12, 1], ) -> None: for i in range(H_field.shape[0]): H_spd_field[i] = qd.make_spd(H_field[i], dt=qd.f64) ``` Each thread eigen-decomposes its own 12×12 Hessian, clamps negative eigenvalues to zero, and reconstructs. For 2×2 / 3×3 there is no `qd.make_spd` entry point (it shares the cyclic-Jacobi path that only kicks in at N≥4). Inline the projection using `qd.sym_eig` directly: ```python @qd.func def make_spd_3x3(H: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3, qd.f64): eigvals, Q = qd.sym_eig(H, dt=qd.f64) for i in qd.static(range(3)): if eigvals[i] < 0.0: eigvals[i] = 0.0 Lambda = qd.Matrix.zero(qd.f64, 3, 3) for i in qd.static(range(3)): Lambda[i, i] = eigvals[i] return Q @ Lambda @ Q.transpose() ``` ### Per-thread linear solve ```python @qd.kernel def solve_each(A_field: qd.types.NDArray[qd.types.matrix(2, 2, qd.f32), 1], b_field: qd.types.NDArray[qd.types.vector(2, qd.f32), 1], x_field: qd.types.NDArray[qd.types.vector(2, qd.f32), 1]) -> None: for i in range(A_field.shape[0]): x_field[i] = qd.solve(A_field[i], b_field[i]) ``` Each thread does an independent Gauss elimination on its own 2×2 system. For larger systems a CG / PCG iteration over the whole array is the standard Quadrants pattern; `qd.solve` is for the per-element case. ### 2×2 polar decomposition for shape matching ```python @qd.func def shape_match(A: qd.types.matrix(2, 2, qd.f32)) -> qd.types.matrix(2, 2, qd.f32): R, _ = qd.polar_decompose(A) return R ``` The rotation factor `R` from `A = R @ S` is the rigid alignment that minimises `‖R - A‖_F` — the building block of position-based dynamics shape-matching. ## Shapes, performance, portability - **Compile time.** - **Closed-form ops** (`qd.svd`, `qd.sym_eig` 2×2/3×3, `qd.polar_decompose`, `qd.eig`, `qd.solve`) — each call is unrolled per thread into a moderately large block of straight-line code; compile time is generally fine at these shapes. - **Cyclic Jacobi** (`qd.sym_eig` ≥4×4, `qd.make_spd`) — the per-pair Givens step is unrolled but the outer sweep loop is a runtime `range`, so compile time is roughly proportional to `N²` (number of `(p, q)` pairs per sweep) rather than `N² · MAX_SWEEPS`. Concrete numbers on CUDA + LLVM 22.1: ~3 s at N=4, ~30 s at N=6, ~3 min at N=9, ~2 min at N=12 (yes, faster than N=9 — the per-pair body is dominated by `if static(p < q):` filtering). - **Runtime cost.** Cyclic Jacobi at N=12 with `MAX_SWEEPS=12` does roughly `12 · 66 · 12 ≈ 9500` per-thread arithmetic ops — fast on any modern GPU, but if you're calling it inside a hot kernel for a million elements that's still ~10 GFLOP-equivalent. For larger matrices use a different algorithm (or call quadrants `linalg.*` for a sparse-aware path). - **Backend portability.** All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. The `qd.sym_eig` ≥4×4 / `qd.make_spd` paths have been verified at N ∈ {4,5,6,9,12} × {f32, f64} × five symmetric-matrix factories on CUDA + Vulkan + AMDGPU; Metal coverage is via the same parametrized tests. ## Related - [matrix_vector](matrix_vector.md) — `qd.Matrix` / `qd.Vector` data types, fields, ndarrays, type annotations. - [matrix_vector_per_thread](matrix_vector_per_thread.md) — element-wise / arithmetic matrix operations (`@`, `inverse`, `determinant`, `transpose`, `dot`, `cross`, `norm`, `outer_product`, `frobenius_inner`). Per-thread, but closed-form rather than iterative. - `qd.math.*` — scalar math helpers (`qd.math.dot`, `qd.math.cross`, `qd.math.length`, etc.) that operate on vectors / matrices but are not decompositions. - `qd.linalg.*` — cross-thread / sparse-matrix linear algebra (CG, sparse solvers); a different namespace and a different problem class.