8. Adaptive estimators¶

Plain lasso and MCP have a known weakness: they shrink active coefficients toward zero. Even when feature j is unambiguously in the model, the L1 penalty pulls β_j away from its true value. MCP fixes this asymptotically (no shrinkage once |β_j| > γλ), but for moderate n the bias persists.

Adaptive estimators (Zou 2006) attack this directly. The recipe is two stages:

Pilot fit. Run a plain lasso on (X, y) and read off rough coefficient magnitudes.
Adaptive refit. Run the final lasso/MCP/SCAD again, but this time with per-feature weights w_j = 1 / max(|β̂_pilot_j|, ε)^η. Features with large pilot magnitudes get small penalty weights (less shrinkage); features with β̂_pilot_j ≈ 0 get huge weights (and stay zero).

Under regularity conditions (Zou’s Theorem 1), this two-stage procedure achieves the oracle property: the active features are selected with probability → 1, and their estimates are asymptotically as good as if you’d known the active set ahead of time.

Walk-through¶

import numpy as np
import skein_glm

rng = np.random.default_rng(0)
n, p = 200, 30
X = rng.standard_normal((n, p))

true_beta = np.zeros(p)
true_beta[[0, 5, 12]] = [2.0, -1.5, 1.2]
y = X @ true_beta + 0.3 * rng.standard_normal(n)

Compare plain MCP to adaptive MCP:

plain = skein_glm.MCPPathRegressor(
    gamma=3.0, n_lambdas=20, lambda_min_ratio=1e-3,
).fit(X, y)
adaptive = skein_glm.AdaptiveMCPPathRegressor(
    gamma=3.0, n_lambdas=20, lambda_min_ratio=1e-3,
).fit(X, y)

last_plain = plain.coefs_[-1, [0, 5, 12]]
last_adapt = adaptive.coefs_[-1, [0, 5, 12]]
print("plain:    ", last_plain.round(3))
print("adaptive: ", last_adapt.round(3))
print("true:     ", true_beta[[0, 5, 12]])

The adaptive estimator should be visibly closer to the true values on the active features — it’s unbiased on the active set.

Inspecting the pilot¶

Adaptive estimators expose two extra attributes:

adaptive.coef_pilot_   # β at pilot_position along the pilot path
adaptive.weights_      # per-feature weights w_j fed to the final fit

weights_[j] is huge for features with coef_pilot_[j] ≈ 0 and small for active features. That’s how the inactive features stay zero in the final fit: their effective penalty is λ · w_j ≈ ∞.

Tuning the recipe¶

Three knobs:

m = skein_glm.AdaptiveMCPPathRegressor(
    gamma=3.0,
    eta=1.0,                # weight exponent — higher = more aggressive
    eps_pilot=1e-6,         # floor on |β_pilot| (avoid divide-by-zero)
    n_pilot_lambdas=10,     # length of the pilot path
    pilot_position="mid",   # which λ on the pilot path to read β from
    n_lambdas=20,
)

eta — Zou’s paper suggests eta = 1 for n moderate, eta = 2 for very large n. Higher values sparsify more aggressively.
pilot_position — 'mid' is a reasonable bias/variance compromise; 'last' (smallest λ) is closest to OLS but unstable when n < p; an integer index gives full control.

When adaptive helps — and when it doesn’t¶

Adaptive shines when:

The truth is sparse with strong signals on the active set.
You have enough n for a halfway-decent pilot fit (n larger than the active-set size, ideally n much larger than p).
Bias on the active set is the cost you’re trying to reduce.

Adaptive can hurt when:

n < p and the pilot can’t recover anything — the weights are noise.
The truth has many weak signals (the pilot misses them, and they get penalized out of existence by the adaptive weights).
You care more about prediction than about coefficient accuracy.

CV with adaptive¶

cv = skein_glm.AdaptiveMCPPathCV(
    gamma=3.0, eta=1.0, cv=5, random_state=0,
    n_lambdas=20, lambda_min_ratio=1e-3,
).fit(X, y)
cv.lambda_best_       # CV-selected λ for the final fit
cv.coef_pilot_        # pilot β at full data (not per-fold)
cv.weights_           # adaptive weights derived from full-data pilot

The pilot runs once on the full data, not per-fold. The pilot weights are a hyperparameter (a data-derived input to the final fit), not a model parameter — refitting them per fold would be a different procedure.

Adaptive group / GLM variants¶

Same recipe applies at the group level and for GLMs. 30 adaptive classes total; a few highlights:

# Group-adaptive: per-group inverse-norm weights.
groups = np.repeat(np.arange(6), 5).astype(np.int64)
m = skein_glm.AdaptiveGroupMCPPathRegressor(
    groups=groups, gamma=3.0, eta=1.0,
    n_lambdas=20, lambda_min_ratio=1e-3,
).fit(X, y)
m.weights_     # length-6 (per group)

# Adaptive logistic.
y_bin = (y > 0).astype(np.float64)
m = skein_glm.AdaptiveLogisticMCPPathRegressor(
    gamma=3.0, eta=1.0, n_lambdas=20,
).fit(X, y_bin)

# Adaptive Cox.
time = rng.exponential(1.0, n)
event = (rng.uniform(size=n) < 0.7).astype(np.float64)
m = skein_glm.AdaptiveCoxMCPPathRegressor(
    gamma=3.0, eta=1.0, n_lambdas=20,
).fit(X, time, event)

Every penalty (Lasso, MCP, SCAD) and every GLM datafit (LS, logistic, Poisson, Cox, plus group variants) has a matching adaptive class.

Recap¶

Goal	Tool
Reduce shrinkage on active features	`Adaptive*PathRegressor`
Inspect the pilot fit	`coef_pilot_`, `weights_`
Group-level adaptive selection	`AdaptiveGroup{Lasso,MCP,SCAD}PathRegressor`
GLM adaptive	`Adaptive{Logistic,Poisson,Cox}*PathRegressor`

Adaptive is the headline use of skein’s per-feature-weights axis. The two-stage recipe is pure Python composition — every adaptive class is a wrapper around an existing path estimator that just plumbs weights= correctly.

Next¶

→ 9. Multinomial and multi-task — K-class softmax and multi-response regression.