Per-thread linear algebra#
Small matrix decompositions and linear solvers that run per thread, in registers, with no cross-thread cooperation — every thread independently decomposes / solves its own 2×2 .. 12×12 matrix. No shared memory, no syncs, no atomics on shared state, no warp / subgroup primitives. A 1000-element kernel runs 1000 copies of the algorithm in parallel, each on its own data.
This is a different category from the element-wise / arithmetic matrix operations covered in matrix_vector_per_thread (also per-thread, but closed-form rather than iterative), and from the cross-thread / sparse linear algebra under qd.linalg.* (CG, sparse direct solvers — covered separately).
Each entry point here implements an iterative or multi-step numerical algorithm (Jacobi sweeps, Gauss elimination, Givens rotations) rather than a single closed-form formula.
All ops live at the top level (qd.svd, qd.sym_eig, qd.make_spd, qd.polar_decompose, qd.eig, qd.solve) and are intended to be called from inside a @qd.kernel or @qd.func.
What’s available#
Op |
Operates on |
Shapes |
Returns |
|---|---|---|---|
|
square real matrix |
2×2, 3×3 |
|
|
symmetric real matrix |
2×2 .. 12×12 |
|
|
symmetric real matrix |
4×4 .. 12×12 |
|
|
square real matrix |
2×2, 3×3 |
|
|
square real matrix |
2×2 |
|
|
square |
2×2, 3×3 |
|
A few patterns to note:
Shapes are fixed. Calling any of these on a matrix outside the supported shapes raises an exception at compile time (e.g.
"SVD only supports 2×2 and 3×3 matrices."). There is no fallback for larger shapes.All ops accept an optional
dtargument. When unspecified, it defaults toimpl.get_runtime().default_fp— usuallyqd.f32unless overridden inqd.init(). Passdt=qd.f64for the high-precision variant.Output shape matches the input shape. A 3×3 input yields 3×3 outputs (and a length-3 vector for
solve/ eigenvalues); a 2×2 input yields 2×2 outputs.Real matrices only.
qd.eigreturns complex results in a packed real layout (see below); the others all assume real-valued input and return real-valued output.
Semantics#
qd.svd(A, dt=None)#
Singular value decomposition — produces (U, S, V) such that A = U @ S @ V.transpose(), with:
Uorthogonal (U @ U.transpose() = I).Vorthogonal.Sdiagonal, with non-negative singular values.
Shapes 2×2 and 3×3 use closed-form / Jacobi-style implementations specialized per shape.
Singular values come back sorted descending (S[0,0] >= S[1,1] >= ...) for both 2×2 and 3×3. The 3×3 path uses the Sifakis algorithm and absorbs sign(det(A)) into σ rather than into U / V, so the smallest entry of S may be negative when det(A) < 0; the descending sort is on direct numeric value (matches the 2×2 path and NumPy / LAPACK conventions).
For 3×3 the implementation enforces det(U) = det(V) = +1 regardless of input — useful for ARAP-style rotations R = U @ V.transpose() (which is then guaranteed to be a proper rotation). The 2×2 path does not enforce this convention; if you need a particular handedness on 2×2, check it explicitly and flip a column if needed.
qd.sym_eig(A, dt=None)#
Symmetric eigendecomposition — for a real symmetric A, returns (eigenvalues, eigenvectors) with:
eigenvalues: aVector(n)of real eigenvalues.eigenvectors: aMatrix(n, n)whose columns are the corresponding orthonormal eigenvectors.
Eigenvalues come back sorted ascending (eigvals[0] <= eigvals[1] <= ...) for every shape, with column i of eigenvectors being the eigenvector for eigvals[i]. This matches NumPy / LAPACK’s eigh (note that qd.svd sorts σ descending, also matching its NumPy / LAPACK counterpart — the cross-op disagreement is the standard convention, not an inconsistency).
Three implementations dispatch by size:
2×2 — closed-form via the trace / determinant identity.
3×3 — closed-form Cardano method (Eigen3
computeDirect).4×4 .. 12×12 — cyclic Jacobi: 12 sweeps of Givens rotations zeroing every off-diagonal
(p, q)pair, withQ := Q · Jaccumulated as eigenvectors. ~6 digits of accuracy inf32, ~12 digits inf64.
Calling at N >= 13 raises ("Symmetric eigen solver currently supports sizes up to 12×12.").
A is assumed symmetric; the implementation does not symmetrize first. If your matrix is only approximately symmetric (e.g. accumulated floating-point error), explicitly compute (A + A.transpose()) * 0.5 before calling.
qd.make_spd(A, dt=None)#
Project a symmetric matrix A to the closest positive semi-definite matrix in the Frobenius-norm sense. Implemented as Q · diag(max(λ, 0)) · Qᵀ where A = Q · diag(λ) · Qᵀ is the symmetric eigendecomposition computed by qd.sym_eig.
Available for shapes 4×4 .. 12×12 (it shares the cyclic-Jacobi path with qd.sym_eig; for 2×2 / 3×3 you can write the same projection by hand using qd.sym_eig directly — see the example below).
Use cases:
Projecting an indefinite contact / element Hessian to its closest SPD approximation before assembly (qipc-style).
Regularizing a covariance estimate that may have small negative eigenvalues from rounding.
Producing a usable preconditioner from a not-quite-SPD matrix.
make_spd is a Frobenius-projector onto the SPD cone: make_spd(make_spd(A)) == make_spd(A). If A is already SPD it is returned essentially unchanged (up to sym_eig round-trip error); if A is negative-definite the result is the zero matrix.
qd.polar_decompose(A, dt=None)#
Polar decomposition — produces (R, S) such that A = R @ S, with:
Rorthogonal (the closest-rotation factor, modulo handedness).Ssymmetric positive semi-definite (the stretch factor).
Built on top of SVD: A = U @ Σ @ Vᵀ ⇒ R = U @ Vᵀ, S = V @ Σ @ Vᵀ. The same caveats about sign convention as qd.svd apply — R is the closest orthogonal factor to A but is not guaranteed to be a proper rotation (positive determinant) without a manual fix-up.
qd.eig(A, dt=None)#
General eigendecomposition for 2×2 only. Returns:
eigenvalues: aMatrix(n, 2)where rowiis(real_part, imaginary_part)of the i-th eigenvalue.eigenvectors: aMatrix(n*2, n)where each column is an eigenvector with each entry expanded into two scalars (real, imaginary). For a 2×2 input, the eigenvector matrix is 4×2.
Eigenvalues of a real 2×2 matrix come in complex-conjugate pairs when the discriminant is negative; the packed layout keeps the function single-return-shape. Calling with n=3 raises — for 3×3 general (non-symmetric) eigendecomposition, there is no Quadrants entry point today; the canonical path is qd.sym_eig if your matrix happens to be symmetric.
qd.solve(A, b, dt=None)#
Direct solve of A @ x = b via Gauss elimination with partial pivoting. Returns the solution vector x.
Shapes 2×2 and 3×3.
The implementation asserts
A.n == A.mandA.m == b.n.Singular
Ais checked by a kernelassert("Matrix is singular in linear solve.") inside the Gauss-elimination path. Kernel asserts only fire when the runtime is initialised withqd.init(debug=True)(see debug) — under the defaultdebug=Falsea singular input silently produces a divide-by-zero / NaN result with no diagnostic. If you need a signal in production, check singularity explicitly before callingqd.solve(e.g.abs(A.determinant())against a tolerance), or run development workloads withdebug=Trueto catch the case.Each call factorises
Afrom scratch and back-substitutes for the givenb.
Examples#
Closest rotation to a 3×3 matrix (ARAP)#
@qd.func
def closest_rotation(A: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3, qd.f64):
U, S, V = qd.svd(A, dt=qd.f64)
R = U @ V.transpose()
if R.determinant() < 0.0:
V[:, 2] *= -1.0
R = U @ V.transpose()
return R
The R.determinant() < 0.0 branch fixes the handedness when SVD’s sign convention produces a reflection rather than a rotation.
Project to symmetric positive semi-definite#
For shapes 4×4 .. 12×12 (a typical qipc 12×12 contact Hessian), use qd.make_spd directly:
mat12 = qd.types.matrix(12, 12, qd.f64)
@qd.kernel
def project_each(
H_field: qd.types.NDArray[mat12, 1],
H_spd_field: qd.types.NDArray[mat12, 1],
) -> None:
for i in range(H_field.shape[0]):
H_spd_field[i] = qd.make_spd(H_field[i], dt=qd.f64)
Each thread eigen-decomposes its own 12×12 Hessian, clamps negative eigenvalues to zero, and reconstructs.
For 2×2 / 3×3 there is no qd.make_spd entry point (it shares the cyclic-Jacobi path that only kicks in at N≥4). Inline the projection using qd.sym_eig directly:
@qd.func
def make_spd_3x3(H: qd.types.matrix(3, 3, qd.f64)) -> qd.types.matrix(3, 3, qd.f64):
eigvals, Q = qd.sym_eig(H, dt=qd.f64)
for i in qd.static(range(3)):
if eigvals[i] < 0.0:
eigvals[i] = 0.0
Lambda = qd.Matrix.zero(qd.f64, 3, 3)
for i in qd.static(range(3)):
Lambda[i, i] = eigvals[i]
return Q @ Lambda @ Q.transpose()
Per-thread linear solve#
@qd.kernel
def solve_each(A_field: qd.types.NDArray[qd.types.matrix(2, 2, qd.f32), 1],
b_field: qd.types.NDArray[qd.types.vector(2, qd.f32), 1],
x_field: qd.types.NDArray[qd.types.vector(2, qd.f32), 1]) -> None:
for i in range(A_field.shape[0]):
x_field[i] = qd.solve(A_field[i], b_field[i])
Each thread does an independent Gauss elimination on its own 2×2 system. For larger systems a CG / PCG iteration over the whole array is the standard Quadrants pattern; qd.solve is for the per-element case.
2×2 polar decomposition for shape matching#
@qd.func
def shape_match(A: qd.types.matrix(2, 2, qd.f32)) -> qd.types.matrix(2, 2, qd.f32):
R, _ = qd.polar_decompose(A)
return R
The rotation factor R from A = R @ S is the rigid alignment that minimises ‖R - A‖_F — the building block of position-based dynamics shape-matching.
Shapes, performance, portability#
Compile time.
Closed-form ops (
qd.svd,qd.sym_eig2×2/3×3,qd.polar_decompose,qd.eig,qd.solve) — each call is unrolled per thread into a moderately large block of straight-line code; compile time is generally fine at these shapes.Cyclic Jacobi (
qd.sym_eig≥4×4,qd.make_spd) — the per-pair Givens step is unrolled but the outer sweep loop is a runtimerange, so compile time is roughly proportional toN²(number of(p, q)pairs per sweep) rather thanN² · MAX_SWEEPS. Concrete numbers on CUDA + LLVM 22.1: ~3 s at N=4, ~30 s at N=6, ~3 min at N=9, ~2 min at N=12 (yes, faster than N=9 — the per-pair body is dominated byif static(p < q):filtering).
Runtime cost. Cyclic Jacobi at N=12 with
MAX_SWEEPS=12does roughly12 · 66 · 12 ≈ 9500per-thread arithmetic ops — fast on any modern GPU, but if you’re calling it inside a hot kernel for a million elements that’s still ~10 GFLOP-equivalent. For larger matrices use a different algorithm (or call quadrantslinalg.*for a sparse-aware path).Backend portability. All ops compile cleanly on CUDA, AMDGPU, Vulkan, and Metal — they are pure register arithmetic with no SIMT primitives, so there is no codegen split. The
qd.sym_eig≥4×4 /qd.make_spdpaths have been verified at N ∈ {4,5,6,9,12} × {f32, f64} × five symmetric-matrix factories on CUDA + Vulkan + AMDGPU; Metal coverage is via the same parametrized tests.