Datafits¶
The datafit term is the loss function: how the model is graded
against the training data. skein ships four datafit families
covering the canonical generalized linear models — Gaussian
(least squares), binomial (logistic), Poisson, and Cox proportional
hazards — plus a clean trait surface for adding more.
The Datafit trait¶
In Rust:
pub trait Datafit: Sync + Send {
fn n_samples(&self) -> usize;
fn value(&self, design: &dyn DesignMatrix, beta: ArrayView1<f64>) -> f64;
fn coord_grad(&self, design: &dyn DesignMatrix, j: usize, residual: ArrayView1<f64>) -> f64;
fn coord_lipschitz(&self, design: &dyn DesignMatrix, j: usize) -> f64;
fn full_grad(&self, design: &dyn DesignMatrix, residual: ArrayView1<f64>) -> Array1<f64>;
fn sample_weights(&self) -> Option<ArrayView1<f64>>;
fn residual_at(&self, design: &dyn DesignMatrix, beta: ArrayView1<f64>) -> Array1<f64>;
}
In Python (skein_glm.datafits.Datafit ABC, mirrors the Rust trait for
prototyping new datafits without porting to Rust first):
from skein_glm.datafits import Datafit
class MyDatafit(Datafit):
def value(self, design, beta): ...
def coord_grad(self, design, j, residual): ...
# etc.
Least squares (Gaussian)¶
The simplest datafit:
with optional per-sample weights \(w_i \geq 0\). This is what
MCPRegressor, SCADRegressor, GroupLassoPathRegressor, etc. use.
Coordinate descent on LS is fast: each coordinate update is closed-form (prox of the soft-threshold operator on \(\langle x_j, r \rangle / n\), where \(r = y - X\beta\) is the residual). The residual is patched in place after each update — this is the core of the M1 production CD solver.
import skein_glm
model = skein_glm.MCPPathRegressor(gamma=3.0, n_lambdas=50).fit(X, y)
Generalized linear models via prox-Newton¶
Logistic, Poisson, and Cox don’t have a closed-form coordinate
update — they’re not quadratic in \(\beta\). skein handles them
with a proximal-Newton outer loop:
At iterate \(\beta^{(t)}\), compute the linear predictor \(\eta = X\beta^{(t)}\).
Build a weighted least-squares surrogate by linearizing the GLM gradient and approximating the Hessian as diagonal:
working response \(z_i^{(t)} = \eta_i + (y_i - \mu_i^{(t)}) / w_i^{(t)}\)
per-sample weights \(w_i^{(t)} = \mu_i^{(t)}(1 - \mu_i^{(t)})\) (logistic), \(w_i^{(t)} = \mu_i^{(t)}\) (Poisson), etc.
Solve the weighted-LS subproblem using the M1 CD solver (no code changes — the trait dispatch handles it).
Update \(\eta\) from the new \(\beta\), repeat. Typically converges in 5-15 outer iterations.
This means every penalty available for LS works for every GLM via the same prox-Newton scaffolding, without per-GLM re-implementation.
Binomial logistic¶
with \(y_i \in \{0, 1\}\). Numerically stable cross-entropy via
softplus(η). Per-sample weights honored throughout.
clf = skein_glm.LogisticMCPPathRegressor(gamma=3.0, n_lambdas=50).fit(X, y_bin)
clf.predict(X) # class labels {0, 1}, shape (n, n_lambdas)
clf.predict_proba(X) # P(y=1), shape (n, n_lambdas)
clf.decision_function(X) # η = Xβ + α, shape (n, n_lambdas)
Poisson (log link)¶
with \(y_i \geq 0\) (integer counts in practice, but the loss is defined for any non-negative real). The solver clamps \(\eta\) to \([-30, 30]\) before exponentiation for numerical stability.
model = skein_glm.PoissonMCPPathRegressor(gamma=3.0, n_lambdas=50).fit(X, y_count)
model.predict(X) # μ = exp(η), the conditional mean
model.decision_function(X) # η = log-rate
Cox proportional hazards¶
The partial likelihood:
with \((t_i, \delta_i)\) being right-censored survival data: \(t_i\) is the observed time, \(\delta_i \in \{0, 1\}\) is the event indicator. No intercept — the baseline hazard absorbs it. Breslow ties handling (events at the same observed time share the risk-set sum); Efron is on the M3.7 roadmap.
The fit signature is fit(X, time, event) — three arguments instead
of two:
cox = skein_glm.CoxMCPPathRegressor(gamma=3.0, n_lambdas=50).fit(X, time, event)
cox.predict(X) # prognostic index η = Xβ; higher → shorter survival
cox.decision_function(X) # same as predict() for Cox
Standardization vs centering¶
For LS, the dense path uses glmnet’s centering+scaling: \(X\) is centered to have zero column means, \(y\) is centered to have zero mean, and the intercept is recovered from \(\alpha = \bar y - \bar x^\top \beta\).
For sparse and mmap LS, centering would densify \(X\), so the backend uses scaling-only plus a column-augmented intercept (a 1s column with penalty weight 0). The two parameterizations are mathematically equivalent at convergence; the sparse path’s β matches the dense path’s β at every λ to ~1e-7.
For GLMs (logistic, Poisson, Cox), centering \(X\) is unnecessary because the prox-Newton inner LS already absorbs the working response’s mean shift. All GLM paths use scaling-only + column-augmented intercept, on every backend.
This matters when you pass standardize=True — skein picks the
right strategy automatically.
Per-sample weights¶
Every datafit takes optional per-sample weights \(w_i \geq 0\) through its constructor. Use cases:
Frequency weights: aggregated data where row \(i\) represents \(w_i\) independent observations.
Inverse-propensity weights for causal inference.
Importance weights for class imbalance.
Survey weights — the sampling-design correction.
The weights enter every solver-side dispatch (coord_grad,
coord_lipschitz, full_grad) — there’s no path that silently
ignores them. See Weights for the three-axis story.
Choosing a datafit¶
A rough decision tree:
Outcome type |
Datafit |
|---|---|
Continuous |
|
Binary |
|
Count (rate or counts) |
|
Right-censored survival |
|
For multinomial / softmax (more than 2 classes), see the M3.6 roadmap entry — currently deferred until multi-task infrastructure (M7) lands. For Cox with Efron ties, weights, or offsets, see M3.7.