8. Adaptive estimators¶
Plain lasso and MCP have a known weakness: they shrink active
coefficients toward zero. Even when feature j is unambiguously in
the model, the L1 penalty pulls β_j away from its true value. MCP
fixes this asymptotically (no shrinkage once |β_j| > γλ), but for
moderate n the bias persists.
Adaptive estimators (Zou 2006) attack this directly. The recipe is two stages:
Pilot fit. Run a plain lasso on
(X, y)and read off rough coefficient magnitudes.Adaptive refit. Run the final lasso/MCP/SCAD again, but this time with per-feature weights
w_j = 1 / max(|β̂_pilot_j|, ε)^η. Features with large pilot magnitudes get small penalty weights (less shrinkage); features withβ̂_pilot_j ≈ 0get huge weights (and stay zero).
Under regularity conditions (Zou’s Theorem 1), this two-stage procedure achieves the oracle property: the active features are selected with probability → 1, and their estimates are asymptotically as good as if you’d known the active set ahead of time.
Walk-through¶
import numpy as np
import skein_glm
rng = np.random.default_rng(0)
n, p = 200, 30
X = rng.standard_normal((n, p))
true_beta = np.zeros(p)
true_beta[[0, 5, 12]] = [2.0, -1.5, 1.2]
y = X @ true_beta + 0.3 * rng.standard_normal(n)
Compare plain MCP to adaptive MCP:
plain = skein_glm.MCPPathRegressor(
gamma=3.0, n_lambdas=20, lambda_min_ratio=1e-3,
).fit(X, y)
adaptive = skein_glm.AdaptiveMCPPathRegressor(
gamma=3.0, n_lambdas=20, lambda_min_ratio=1e-3,
).fit(X, y)
last_plain = plain.coefs_[-1, [0, 5, 12]]
last_adapt = adaptive.coefs_[-1, [0, 5, 12]]
print("plain: ", last_plain.round(3))
print("adaptive: ", last_adapt.round(3))
print("true: ", true_beta[[0, 5, 12]])
The adaptive estimator should be visibly closer to the true values on the active features — it’s unbiased on the active set.
Inspecting the pilot¶
Adaptive estimators expose two extra attributes:
adaptive.coef_pilot_ # β at pilot_position along the pilot path
adaptive.weights_ # per-feature weights w_j fed to the final fit
weights_[j] is huge for features with coef_pilot_[j] ≈ 0 and small
for active features. That’s how the inactive features stay zero in
the final fit: their effective penalty is λ · w_j ≈ ∞.
Tuning the recipe¶
Three knobs:
m = skein_glm.AdaptiveMCPPathRegressor(
gamma=3.0,
eta=1.0, # weight exponent — higher = more aggressive
eps_pilot=1e-6, # floor on |β_pilot| (avoid divide-by-zero)
n_pilot_lambdas=10, # length of the pilot path
pilot_position="mid", # which λ on the pilot path to read β from
n_lambdas=20,
)
eta— Zou’s paper suggestseta = 1fornmoderate,eta = 2for very largen. Higher values sparsify more aggressively.pilot_position—'mid'is a reasonable bias/variance compromise;'last'(smallest λ) is closest to OLS but unstable whenn < p; an integer index gives full control.
When adaptive helps — and when it doesn’t¶
Adaptive shines when:
The truth is sparse with strong signals on the active set.
You have enough
nfor a halfway-decent pilot fit (nlarger than the active-set size, ideallynmuch larger thanp).Bias on the active set is the cost you’re trying to reduce.
Adaptive can hurt when:
n < pand the pilot can’t recover anything — the weights are noise.The truth has many weak signals (the pilot misses them, and they get penalized out of existence by the adaptive weights).
You care more about prediction than about coefficient accuracy.
CV with adaptive¶
cv = skein_glm.AdaptiveMCPPathCV(
gamma=3.0, eta=1.0, cv=5, random_state=0,
n_lambdas=20, lambda_min_ratio=1e-3,
).fit(X, y)
cv.lambda_best_ # CV-selected λ for the final fit
cv.coef_pilot_ # pilot β at full data (not per-fold)
cv.weights_ # adaptive weights derived from full-data pilot
The pilot runs once on the full data, not per-fold. The pilot weights are a hyperparameter (a data-derived input to the final fit), not a model parameter — refitting them per fold would be a different procedure.
Adaptive group / GLM variants¶
Same recipe applies at the group level and for GLMs. 30 adaptive classes total; a few highlights:
# Group-adaptive: per-group inverse-norm weights.
groups = np.repeat(np.arange(6), 5).astype(np.int64)
m = skein_glm.AdaptiveGroupMCPPathRegressor(
groups=groups, gamma=3.0, eta=1.0,
n_lambdas=20, lambda_min_ratio=1e-3,
).fit(X, y)
m.weights_ # length-6 (per group)
# Adaptive logistic.
y_bin = (y > 0).astype(np.float64)
m = skein_glm.AdaptiveLogisticMCPPathRegressor(
gamma=3.0, eta=1.0, n_lambdas=20,
).fit(X, y_bin)
# Adaptive Cox.
time = rng.exponential(1.0, n)
event = (rng.uniform(size=n) < 0.7).astype(np.float64)
m = skein_glm.AdaptiveCoxMCPPathRegressor(
gamma=3.0, eta=1.0, n_lambdas=20,
).fit(X, time, event)
Every penalty (Lasso, MCP, SCAD) and every GLM datafit (LS, logistic, Poisson, Cox, plus group variants) has a matching adaptive class.
Recap¶
Goal |
Tool |
|---|---|
Reduce shrinkage on active features |
|
Inspect the pilot fit |
|
Group-level adaptive selection |
|
GLM adaptive |
|
Adaptive is the headline use of skein’s per-feature-weights axis.
The two-stage recipe is pure Python composition — every adaptive
class is a wrapper around an existing path estimator that just
plumbs weights= correctly.
Next¶
→ 9. Multinomial and multi-task — K-class softmax and multi-response regression.