Quick start¶
A tour of the headline features through worked snippets. Every
example assumes you’ve imported numpy as np and skein, plus
have a small synthetic problem set up:
import numpy as np
import skein_glm
rng = np.random.default_rng(0)
n, p = 200, 50
X = rng.standard_normal((n, p))
true_beta = np.zeros(p)
true_beta[:3] = [1.5, -2.0, 0.8]
y = X @ true_beta + 0.1 * rng.standard_normal(n)
Single-λ MCP regression¶
model = skein_glm.MCPRegressor(lambda_=0.05, gamma=3.0).fit(X, y)
print(model.coef_[:5]) # near [1.5, -2.0, 0.8, 0, 0]
print(model.intercept_) # near 0
print(model.score(X, y)) # R² on training data
gamma controls the MCP nonconvexity. gamma=∞ is lasso; smaller
gamma is more aggressive sparsification. gamma >= 3 is a common
default; gamma=1e6 is a numerically convex stand-in for lasso.
Full λ-path with warm starts¶
*PathRegressor solves the entire regularization path in one call,
reusing each λ’s β as the warm-start for the next. This is what you
want for any analysis that picks λ post-hoc (cross-validation,
information criteria, plotting the path).
model = skein_glm.MCPPathRegressor(
gamma=3.0, n_lambdas=50, lambda_min_ratio=1e-3,
standardize=True,
).fit(X, y)
print(model.coefs_.shape) # (50, 50): n_lambdas × n_features
print(model.intercepts_.shape) # (50,)
print(model.lambdas_[0:3]) # [λ_max, λ_max·r, λ_max·r²], r = ratio**(1/(K-1))
Use lambdas= to supply an explicit grid; otherwise skein derives
λ_max from the KKT conditions at β=0 and builds a geometric grid.
Cross-validation¶
cv = skein_glm.MCPPathCV(gamma=3.0, n_lambdas=50, cv=5, random_state=0).fit(X, y)
print(cv.lambda_best_) # CV-selected λ
print(cv.coef_, cv.intercept_) # refit β at λ_best on the full data
print(cv.cv_mean_scores_.shape) # (50,) — mean test MSE per λ
cv accepts an integer (KFold) or any sklearn cross-validator. Higher-
is-better scorers (binomial deviance, c-index) are auto-detected for
GLM / Cox families.
Information criteria (AIC / BIC / EBIC)¶
path = skein_glm.MCPPathRegressor(gamma=3.0, n_lambdas=50).fit(X, y)
best_idx, scores = skein_glm.select_by_ic(path, X, y, criterion="bic")
print(best_idx, path.lambdas_[best_idx])
print(path.coefs_[best_idx])
Works with all four GLM families. EBIC uses gamma_ebic=0.5 by
default (the high-dimensional recommendation from ncvreg::BIC).
Logistic regression with a group penalty¶
groups = np.repeat(np.arange(p // 5), 5) # 10 groups of 5 features each
y_bin = (X[:, :3].sum(axis=1) > 0).astype(float)
clf = skein_glm.LogisticGroupMCPPathRegressor(
groups=groups, gamma=3.0, n_lambdas=20,
).fit(X, y_bin)
proba = clf.predict_proba(X) # shape (n, n_lambdas)
labels = clf.predict(X) # shape (n, n_lambdas), values in {0, 1}
print(clf.coefs_[-1].nonzero()) # which features are active at smallest λ
The path solver runs Local Linear Approximation outer iterations around a group block-CD inner solver — the headline algorithm. Only groups of features turn on/off; within an active group, every coefficient is non-zero.
Cox proportional hazards¶
time = rng.exponential(1.0 / np.exp(X[:, :3].sum(axis=1)))
event = (rng.uniform(size=n) < 0.7).astype(float)
cox = skein_glm.CoxMCPPathRegressor(gamma=3.0, n_lambdas=50).fit(X, time, event)
risk = cox.predict(X) # shape (n, n_lambdas)
# prognostic index η = Xβ;
# higher → shorter survival
print(cox.coefs_[-1].nonzero())
No intercept (the baseline hazard absorbs it). Right-censored data; Breslow ties handling.
Sparse design matrices¶
scipy.sparse.csc_matrix flows through every estimator transparently
— no separate API:
import scipy.sparse as sp
# Inflate the column scales so standardize=True actually does work.
X_dense = rng.standard_normal((n, p))
X_dense[X_dense < 0.5] = 0.0
X_sparse = sp.csc_matrix(X_dense)
model = skein_glm.MCPPathRegressor(
gamma=3.0, n_lambdas=20, standardize=True,
).fit(X_sparse, y)
# Same coefficients (up to numerical tolerance) as fitting on X_dense.
Group penalties, GLMs, and Cox all dispatch to the same _sparse
PyO3 entry points without user-visible API differences.
Memory-mapped (out-of-RAM) inputs¶
For problems where X is too large to load into memory:
# Write a column-major f64 file. NOTE: np.tofile() always writes in
# C order regardless of array flags, so we use tobytes(order='F'):
X_disk = rng.standard_normal((100_000, 1_000))
buf = np.ascontiguousarray(X_disk, dtype=np.float64).tobytes(order="F")
with open("X.bin", "wb") as f:
f.write(buf)
design = skein_glm.MmapDesignF64("X.bin", n_rows=100_000, n_cols=1_000)
y_big = rng.standard_normal(100_000)
model = skein_glm.MCPPathRegressor(gamma=3.0, n_lambdas=20).fit(design, y_big)
For half the disk footprint (and half the page-cache pressure),
write the matrix as f32 and use MmapDesignF32. The solver stays
f64; each column is cast f32 → f64 on read.
Row-block-chunked inputs¶
For problems where even one mmap’d file is unwieldy (multi-shard data pipelines, files larger than your filesystem’s per-file limit, etc.):
chunks = [
("chunk_0.bin", 10_000_000),
("chunk_1.bin", 10_000_000),
("chunk_2.bin", 7_345_678),
]
design = skein_glm.ChunkedDesignF64(chunks, n_cols=50_000)
model = skein_glm.MCPPathRegressor(...).fit(design, y_big)
Each chunk is a separate column-major file with the same n_cols.
The solver streams chunk-by-chunk; total n is the sum.
ChunkedDesignF32 is the f32 sibling.
Standardization¶
standardize=True divides each column by its glmnet-style standard
deviation (s_j = sqrt((‖X[:,j]‖² − n·x̄_j²)/n)) before fitting,
returning coef_ in original-feature scale. Important for any
design where columns have wildly different magnitudes — without it,
the L1 / group-L2 penalty falls disproportionately on small-scale
columns:
X_inflated = rng.standard_normal((n, p))
X_inflated[:, 0] *= 100.0 # column 0 is on a different scale
# With standardize=True, all columns are penalized comparably.
model = skein_glm.MCPPathRegressor(
gamma=3.0, n_lambdas=50, standardize=True,
).fit(X_inflated, y)
Works for dense, sparse, mmap, and chunked backends, across every
estimator family. See Concepts: Backends for
how the lazy Standardized<D> wrapper is composed under the hood.
Where next¶
Concepts for the conceptual model behind these knobs.
Porting from glmnet/ncvreg/grpreg (coming in commit 2) for side-by-side R → Python translation tables.
Extending skein (coming in commit 2) for implementing custom penalties or datafits against the trait surface.