# Tensors

Quadrants offers two underlying tensor implementations, [`qd.field` and `qd.ndarray`](tensor_types.md). They have different runtime/compile-time trade-offs, and different physical memory layouts can suit different kernels.

The tensor API lets you pick both the **backend** and the **physical memory layout** on a per-tensor basis at allocation time. The rest of the system (kernels, fastcache, autograd) stays out of the way.

See [`tensor_types`](tensor_types.md), [`scalar_tensors`](scalar_tensors.md), and [`matrix_vector`](matrix_vector.md) for the underlying tensor primitives.

## Choosing a backend: `qd.Backend`

`qd.Backend` is an `IntEnum` with two members:

| Member | Underlying type | When to prefer |
|---|---|---|
| `qd.Backend.FIELD` | `qd.field` | Faster at runtime; recompiles when any dimension size changes. |
| `qd.Backend.NDARRAY` | `qd.ndarray` | Slower at runtime but avoids recompilation when sizes change. |

The choice is per tensor: a single program can freely mix backends.

## Allocating a tensor with `qd.tensor()`

`qd.tensor(dtype, shape, backend=...)` is a thin dispatcher over `qd.field` and `qd.ndarray`. It selects the underlying allocator based on the `backend=` keyword:

```python
import quadrants as qd

qd.init(arch=qd.x64)

a = qd.tensor(qd.f32, shape=(4, 5))                                 # ndarray (default)
b = qd.tensor(qd.f32, shape=(4, 5), backend=qd.Backend.FIELD)       # field

assert isinstance(a, qd.Tensor)
assert isinstance(b, qd.Tensor)
```

`qd.tensor()` (and the `qd.Vector.tensor` / `qd.Matrix.tensor` siblings) returns a `qd.Tensor` wrapper that uniformly forwards a fixed surface (`shape`, `dtype`, `layout`, `to_numpy`, `from_numpy`, `to_torch`, `from_torch`, `to_dlpack`, `fill`, `copy_from`, `grad`, host-side `__getitem__` / `__setitem__`, pickle) regardless of which backend it wraps. Drop down to the bare impl with `t._unwrap()` (returns the underlying `qd.Ndarray` or `qd.ScalarField`) only if you need a backend-specific knob.

The default backend is `qd.Backend.NDARRAY`: it avoids recompilation when sizes change.

## Vector and matrix tensors

For tensors whose elements are vectors or matrices, use `qd.Vector.tensor` or `qd.Matrix.tensor`. They dispatch over `qd.Vector.field` / `qd.Vector.ndarray` and `qd.Matrix.field` / `qd.Matrix.ndarray` respectively, with the same `backend=` keyword:

```python
import quadrants as qd

qd.init(arch=qd.x64)

# A 1-D tensor of 4 length-3 vectors (ndarray backend, default).
v = qd.Vector.tensor(3, qd.f32, shape=(4,))

# Same shape, on the field backend.
u = qd.Vector.tensor(3, qd.f32, shape=(4,), backend=qd.Backend.FIELD)

# A 1-D tensor of 3 (2x2) matrices, ndarray backend.
m = qd.Matrix.tensor(2, 2, qd.f32, shape=(3,))
```

## Gradients

`needs_grad=True` works on every tensor factory and on every backend, by passing the keyword through to the underlying `qd.field` / `qd.ndarray` call:

```python
import quadrants as qd

qd.init(arch=qd.x64)

# Ndarray-backed primal + grad (default backend).
a = qd.tensor(qd.f32, shape=(4,), needs_grad=True)
assert a.grad is not None

# Same on the field backend.
b = qd.tensor(qd.f32, shape=(4,), backend=qd.Backend.FIELD, needs_grad=True)
assert b.grad is not None

# Kernels write through canonical indices on both primal and grad.
@qd.kernel
def write_grad(x: qd.Tensor):
    for i in range(4):
        x.grad[i] = i * 100.0

write_grad(a)
print(a.grad.to_numpy())   # [0., 100., 200., 300.]
```

Gradient buffers always share the canonical shape of the primal, on both backends. The `needs_grad` keyword also passes through `qd.Vector.tensor` and `qd.Matrix.tensor` for compound element types.

## Controlling physical layout

Tweaking the memory layout on a per-tensor basis is commonly used to improve runtime performance. In practice, tuning axis order is sufficient in most cases. For advanced users seeking finer-grained control over the memory layout, see the SNode API (`qd.root`).

The `layout=` keyword lets you pick per-tensor:

```python
import quadrants as qd

qd.init(arch=qd.x64)

# Default (canonical) layout: same order as the canonical shape.
a = qd.tensor(qd.f32, shape=(N, B))

# Transposed storage: axis 1 (batch) becomes the outer SNode, axis 0 inner.
b = qd.tensor(qd.f32, shape=(N, B), layout=(1, 0))
```

`layout` is a tuple of `int` listing the **canonical axis index at each successive memory-nesting level, outermost first**. It must be a permutation of `range(len(shape))`. The canonical (logical) shape that you pass and that `tensor.shape` returns is *not* affected by `layout`:

```python
b = qd.tensor(qd.f32, shape=(N, B), layout=(1, 0))
assert b.shape == (N, B)        # canonical shape, unchanged
b[i, j] = ...                   # canonical indexing in kernels still works
```

Any permutation is supported, up to Quadrants' `quadrants_max_num_indices` (currently 12). `layout=None` and the identity permutation (`(0, 1, ..., N-1)`) are equivalent and forward no permutation to the underlying allocator.

Quadrants rejects mismatched / invalid layouts up front:

```python
qd.tensor(qd.f32, shape=(4, 5), layout=(0, 1, 2))   # ValueError: wrong length
qd.tensor(qd.f32, shape=(4, 5), layout=(0, 0))      # ValueError: not a permutation
qd.tensor(qd.f32, shape=(4, 5), order="ji")         # TypeError: use layout=
```

## Interop with NumPy and PyTorch

Every Python-side accessor — `tensor.shape`, `tensor.layout`, `tensor.to_numpy()`, `tensor.to_numpy(dtype=...)`, `tensor.from_numpy(...)`, `tensor.to_torch(device=...)`, `tensor.from_torch(...)`, `tensor.to_dlpack()` (and therefore anything built on top of it like `torch.utils.dlpack.from_dlpack`) — returns the **canonical view**: the shape you passed at allocation time, indexed in canonical axis order.

`layout=` is purely an internal performance hint. The data lives in permuted physical storage, but Python callers never have to reason about that:

```python
a = qd.tensor(qd.f32, shape=(N, B), layout=(1, 0))
assert a.shape == (N, B)                 # canonical
assert a.layout == (1, 0)                # introspectable
assert a.to_numpy().shape == (N, B)      # canonical view of the same data

# Round-trips work in canonical-shape terms.
src = np.zeros((N, B), dtype=np.float32)
a.from_numpy(src)
assert (a.to_numpy() == src).all()

# DLPack carries the canonical shape with permuted strides; the resulting torch tensor is a transposed view of the underlying buffer (no data movement until you call ``.contiguous()``).
import torch
t = torch.utils.dlpack.from_dlpack(a.to_dlpack())
assert tuple(t.shape) == (N, B)

# ``to_torch`` / ``from_torch`` are equivalent on either backend.
out = a.to_torch()
assert tuple(out.shape) == (N, B)
a.from_torch(out)
```

The exact same surface is available on both backends — switching `qd.tensor(..., backend=qd.Backend.FIELD/NDARRAY)` does not require any other code change at the call site.

### Zero-copy with `copy=False`

`to_numpy()` and `to_torch()` accept a keyword-only `copy` argument:

```python
a = qd.tensor(qd.f32, shape=(1024,))
a.fill(1.0)

view  = a.to_torch(copy=False)   # zero-copy: aliases a's memory, or ValueError
auto  = a.to_torch(copy=None)    # zero-copy if possible, otherwise copy
clone = a.to_torch(copy=True)    # independent copy (default)
```

| Value | Behaviour |
|---|---|
| `True` (default) | Independent copy via kernel. Safe to mutate freely. |
| `None` | Zero-copy when available, otherwise falls back to a copy silently. |
| `False` | Zero-copy DLPack view, or `ValueError` if unsupported for this backend/dtype. |

`copy=False` and `copy=None` avoid both the buffer allocation and the copy kernel when zero-copy is available — the returned numpy array or torch tensor points directly at Quadrants' existing memory. For a large tensor this eliminates a potentially expensive memcpy and a device-side kernel launch. Writes through the view are immediately visible to subsequent Quadrants kernels (and vice versa), removing the need for `to_torch` → modify → `from_torch` round-trips.

The difference between `False` and `None`: `copy=False` raises `ValueError` when zero-copy is not supported (e.g. unsupported dtype or GPU-to-numpy), while `copy=None` silently falls back to a kernel copy in those cases. Use `copy=None` when you want zero-copy as a best-effort optimisation without having to handle exceptions.

The tradeoff of zero-copy is lifetime coupling: the view is invalidated on `qd.reset()` or `qd.init()`, and on GPU you must be mindful of stream synchronisation when both frameworks write to the same buffer.

This works identically on both backends. For the full support matrix (which backends/dtypes qualify, lifetime caveats, Metal synchronisation) see [`interop`](interop.md#zero-copy-interop-via-dlpack).

Gradient buffers behave identically: `a.grad.to_numpy()` returns the canonical view of the gradient.

## Annotating kernel arguments: `qd.Tensor`

Kernel parameter annotations use `qd.Tensor` regardless of backend. The same class doubles as the wrapper class returned by `qd.tensor()`, so the annotation and the runtime values agree:

```python
import quadrants as qd

qd.init(arch=qd.x64)

@qd.kernel
def fill(x: qd.Tensor):
    for i in range(x.shape[0]):
        x[i] = i

a = qd.tensor(qd.f32, shape=(4,), backend=qd.Backend.FIELD)
b = qd.tensor(qd.f32, shape=(4,), backend=qd.Backend.NDARRAY)

fill(a)   # field branch
fill(b)   # ndarray branch
```

The kernel argument is unwrapped to the bare impl before the template-mapper / AST sees it, so kernel bodies still write `x[i, j]` and pay no per-call cost for the wrapper.

## Pickle

`qd.Tensor` objects are picklable on **both** backends, including under non-identity layouts. Round-trip (pickle then unpickle) preserves the canonical data, the dtype, the shape, and the layout:

```python
import pickle
import quadrants as qd

qd.init(arch=qd.x64)

a = qd.tensor(qd.f32, shape=(3, 4), backend=qd.Backend.FIELD, layout=(1, 0))
a.from_numpy(np.arange(12, dtype=np.float32).reshape(3, 4))

restored = pickle.loads(pickle.dumps(a))
assert isinstance(restored, qd.Tensor)
assert restored.shape == (3, 4)
assert restored.layout == (1, 0)
assert (restored.to_numpy() == a.to_numpy()).all()
```

## Wrapping a bare tensor: `qd.wrap`

If you have a bare `qd.field` / `qd.ndarray` / `qd.Vector.field` / `qd.Matrix.field` / `qd.Vector.ndarray` / `qd.Matrix.ndarray` impl (e.g. from older code or library boundaries) and want the unified `qd.Tensor` surface around it, use `qd.wrap(impl)`. It picks the most specific subclass (`Tensor`, `VectorTensor`, `MatrixTensor`):

```python
import quadrants as qd

qd.init(arch=qd.x64)

a = qd.ndarray(qd.f32, shape=(4, 5))
t = qd.wrap(a)
assert isinstance(t, qd.Tensor)
assert t._unwrap() is a   # same underlying impl
```

`qd.wrap` is the only sanctioned way to construct a wrapper around a bare impl after the fact. The `qd.Tensor(impl)` constructor itself rejects double-wrapping so you can't accidentally end up with a `Tensor` containing a `Tensor`.

## Cross-backend `copy_from` is not supported

`tensor.copy_from(other)` requires both tensors to share the same backend. Mixed-backend copies are not supported:

```python
a = qd.tensor(qd.f32, shape=(4,), backend=qd.Backend.FIELD)
b = qd.tensor(qd.f32, shape=(4,), backend=qd.Backend.NDARRAY)
a.copy_from(b)   # raises: cross-backend copy unsupported
```

If you genuinely need to move data across backends, route it through Torch: `a.from_torch(b.to_torch())`.

## Known asymmetry: real-dtype `.grad` stub on the field backend

For tensors of a real (`f32` / `f64`) dtype allocated **without** `needs_grad=True`, the field backend currently allocates a zombie gradient stub anyway, so `t.grad` returns a wrapper around it. The ndarray backend correctly reports `t.grad is None` in the same case:

```python
t_field = qd.tensor(qd.f32, shape=(4,), backend=qd.Backend.FIELD)
t_nd    = qd.tensor(qd.f32, shape=(4,), backend=qd.Backend.NDARRAY)

t_field.grad   # currently a Tensor wrapper around a zombie field
t_nd.grad      # None
```

Use `needs_grad=True` if you intend to read `.grad`. Integer dtypes are symmetric (`grad is None` on both backends regardless of `needs_grad`).