Backends¶
skein’s DesignMatrix trait abstracts the storage of \(X\) from the
solver. The trait has five methods; once you implement them, every
estimator works with your backend, and every wrapper composes with
it. v0.1 ships four storage backends and two structural wrappers.
The trait¶
pub trait DesignMatrix: Sync + Send {
fn n_samples(&self) -> usize;
fn n_features(&self) -> usize;
fn matvec(&self, beta: ArrayView1<f64>) -> Array1<f64>; // X β
fn rmatvec(&self, r: ArrayView1<f64>) -> Array1<f64>; // Xᵀ r
fn col_dot(&self, j: usize, v: ArrayView1<f64>) -> f64; // ⟨X[:,j], v⟩
fn col_sq_norm(&self, j: usize) -> f64; // ‖X[:,j]‖²
fn columns(&self, cols: &[usize]) -> Array2<f64>; // X[:, cols] block
}
col_dot and col_sq_norm are the hot path for coordinate descent;
columns(cols) is what group block-CD calls to materialize a group’s
columns. matvec and rmatvec are used by gap-safe screening and
by the GLM working-response calculation.
The CD solver only ever touches \(X\) through these five methods. That’s the leverage — every concrete backend gets every algorithm for free.
Storage backends¶
DenseMatrix (in-memory, row-major f64)¶
The reference implementation. Backed by ndarray::Array2<f64>.
Used automatically when you pass a numpy 2D array to any estimator.
model = skein_glm.MCPPathRegressor(...).fit(X, y) # X is np.ndarray
col_sq_norms is precomputed at construction. Hot path is direct
ndarray indexing — fast, no allocation per call.
SparseCSC (in-memory, scipy.sparse.csc_matrix)¶
Compressed sparse column. The hot path (col_dot, columns) walks
each column’s data / indices arrays, skipping the implicit
zero entries. col_sq_norms is precomputed.
import scipy.sparse as sp
X_sparse = sp.csc_matrix(X)
model = skein_glm.MCPPathRegressor(...).fit(X_sparse, y)
Used automatically when you pass a scipy.sparse.csc_matrix. CSR
inputs are converted to CSC at the boundary (one copy).
matvec skips entries with β_j = 0, which makes warm-started
paths very fast on sparse problems.
MmapMatrix (out-of-RAM, column-major f64 on disk)¶
Memory-mapped raw f64 files. The OS page cache transparently
materializes the columns as the solver reads them; nothing is
loaded into RAM until accessed.
# Write a column-major f64 file (note: tofile() always writes in
# C order regardless of array layout, so use tobytes(order='F')).
X_disk = ... # large numpy array
buf = np.ascontiguousarray(X_disk, dtype=np.float64).tobytes(order="F")
with open("X.bin", "wb") as f:
f.write(buf)
design = skein_glm.MmapDesignF64("X.bin", n_rows=n, n_cols=p)
model = skein_glm.MCPPathRegressor(...).fit(design, y)
Why column-major: the CD hot path col_dot(j, v) reads each column
as a contiguous slice of the file. With row-major storage, each
col_dot would scatter \(n\) non-contiguous bytes — pathological for
the page cache.
col_sq_norms is computed once at open() time (one full pass
through the file, paid by the page cache); from then on the hot
path matches DenseMatrix performance.
MmapMatrixF32 (out-of-RAM, column-major f32 on disk)¶
Same as MmapMatrix but the file is f32. Half the disk footprint
and half the page-cache pressure for the same (n, p). The solver
stays f64; each column is cast f32 → f64 on read. Equivalent to
the f64 path up to f32 truncation error (~1e-7 relative), in
exchange for halving the I/O.
buf32 = np.ascontiguousarray(X_disk, dtype=np.float32).tobytes(order="F")
with open("X32.bin", "wb") as f:
f.write(buf32)
design = skein_glm.MmapDesignF32("X32.bin", n_rows=n, n_cols=p)
This is “f32-on-disk” — not “true mixed precision” (which would also do arithmetic in f32). True mixed precision is a separate roadmap bullet that requires parameterizing the solver core over a floating-point type parameter.
Chunked<C> (out-of-RAM, list of row-block files)¶
For when one mmap’d file is unwieldy: multiple shards from an
upstream pipeline, files larger than your filesystem’s per-file
limit, or natural per-batch chunking. Each chunk is a separate
column-major file with the same n_cols; the wrapper concatenates
them logically.
chunks = [
("chunk_0.bin", 10_000_000),
("chunk_1.bin", 10_000_000),
("chunk_2.bin", 7_345_678),
]
design = skein_glm.ChunkedDesignF64(chunks, n_cols=50_000)
# total n = 27_345_678
Generic over the chunk backend in Rust: Chunked<MmapMatrix> is the
f64 case (above), Chunked<MmapMatrixF32> is f32 (via
ChunkedDesignF32 in Python).
Routes the trait methods chunk-by-chunk:
col_dot(j, v)slicesvinto per-chunk segments and sums.matvec(β)calls each chunk’smatvec(β)and concatenates.rmatvec(r)slicesrand sums per-feature.col_sq_norm(j)is precomputed at construction by summing across chunks.
v0.1 is serial. Per-chunk Rayon parallelism is a one-line follow-up once benchmarks justify it.
Wrapper layers¶
Two zero-cost wrappers that compose with any storage backend.
Augmented<D> — virtual intercept column¶
Adds a single all-ones column at index n_features (the augmented
intercept slot) without touching the underlying storage:
n_features = base.n_features + 1col_dot(p, v) = Σ v_icol_sq_norm(p) = nmatvec(β) = base.matvec(β[:p]) + β[p]
This is what fit_intercept=True uses on mmap and chunked backends:
appending a 1s column to a 100GB file would be insane, so we just
declare one virtually.
For dense and sparse backends, fit_intercept=True physically
appends a 1s column instead. Both routes converge to the same β.
Standardized<D> — lazy column scaling¶
Wraps any backend with per-column scales \(s_j\) to expose \(\tilde X = X \cdot \mathrm{diag}(1/s)\):
col_dot(j, v) = base.col_dot(j, v) / s_jcol_sq_norm(j) = base.col_sq_norm(j) / s_j²matvec(β) = base.matvec(β / s)
No materialization, no centering — and centering would densify a
sparse / mmap’d \(X\). The intercept column (when wrapped via
Augmented first) gets s_p = 1.0 so it’s never scaled.
This is what standardize=True uses on every non-dense-LS backend.
For dense LS, the original glmnet centering+scaling is still in
play (centering \(X\) is fine when \(X\) already lives in RAM as
f64).
How the layers compose¶
A typical mmap + intercept + standardize fit on logistic MCP:
DesignMatrix tower at solve time:
Standardized<Augmented<MmapMatrix>>
x_scale = [s_1, s_2, ..., s_p, 1.0] ← intercept unscaled
Augmented adds a virtual 1s column
MmapMatrix opens X.bin (col-major f64)
The solver doesn’t see the layers — it sees a &dyn DesignMatrix
with n_features = p + 1 and trait methods that route through
the entire stack on each call. After convergence, the user-facing
β is recovered as \(\beta_{\text{user}} = \tilde\beta / s\) for the
non-intercept features, and the intercept is the last entry of
\(\tilde\beta\) as-is.
The same composition works for sparse, chunked, etc. — the wrappers don’t know or care which backend is at the bottom.
Choosing a backend¶
|
RAM available |
Choose |
|---|---|---|
|
≫ data |
|
Mostly zeros (>50%) |
≫ data |
|
|
≪ data |
|
|
between |
|
Multi-file shards |
any |
|
The estimators dispatch on the input type; you don’t need to pass a
flag. The same MCPPathRegressor works on all five.
Adding a new backend¶
To add a new backend (HDF5 reader, Parquet column store, GPU buffer, etc.):
Implement the five
DesignMatrixtrait methods in Rust.Add a thin PyO3 wrapper if you want Python access (mirror what
MmapMatrixdoes).Add a Python helper class similar to
MmapDesignF64that estimators recognize viaisinstancedispatch.The estimators that should support your backend get an
elif _is_yourbackend(x):branch in theirfitmethod.
The “extending skein” page (coming in commit 2) walks through this end-to-end with a worked example.