9. Multinomial and multi-task¶
So far every model has had a single output: one y per sample (LS,
logistic, Poisson, Cox). Two situations break that:
Multinomial classification. Each
yis one of K class labels.Multi-task regression. Each
yis a K-vector of related continuous outcomes.
Both reduce algebraically to a row-grouped problem on a virtual
block-replicated design — the same M2 block-CD machinery used
elsewhere in skein, just with a different design wrapper. Concretely:
the coefficient matrix B ∈ ℝ^(p×K) gets penalized at the row
level (the row of feature j’s K coefficients), so a feature is
either active for every class/task or inactive for all of them.
Multinomial: K-class softmax¶
Build a 3-class problem:
import numpy as np
import skein_glm
rng = np.random.default_rng(0)
n, p, K = 300, 30, 3
X = rng.standard_normal((n, p))
# Coefficient matrix B (p × K). Only features 0, 5, 12 are active.
true_B = np.zeros((p, K))
true_B[0] = [1.5, -1.5, 0.0]
true_B[5] = [-1.0, 0.0, 1.0]
true_B[12] = [0.0, 1.2, -1.2]
# Sample class labels via softmax.
eta = X @ true_B
labels = np.argmax(eta + 0.1 * rng.standard_normal((n, K)), axis=1)
Two ways to get prediction surfaces, depending on how you’ve picked λ. The path classifier exposes the full sweep but is intentionally prediction-free — it’s for inspecting the trajectory:
path = skein_glm.MultinomialMCPPathClassifier(
gamma=3.0, n_lambdas=15, lambda_min_ratio=1e-2,
).fit(X, labels)
path.coefs_.shape # (15, K, p) — coefficients along the path
path.intercepts_.shape # (15, K)
path.lambdas_.shape # (15,)
path.classes_ # ndarray (K,) — the unique label values
For predictions, use either a single-λ classifier (MultinomialMCPClassifier)
or the CV variant, both of which expose predict_proba /
predict / decision_function:
clf = skein_glm.MultinomialMCPClassifier(
lambda_=path.lambdas_[10], # pick a λ from the path
gamma=3.0,
).fit(X, labels)
clf.coef_.shape # (K, p) — sklearn convention
clf.intercept_.shape # (K,)
clf.predict_proba(X).shape # (n, K) — softmax probabilities
clf.predict(X).shape # (n,) — argmax class labels
clf.decision_function(X).shape # (n, K) — η values
Symmetric parameterization¶
skein follows glmnet’s symmetric / “grouped” multinomial: all K
columns of B are estimated freely. The softmax is invariant to
adding a constant to every column, but the L2-row penalty breaks
that redundancy by shrinking everything toward zero. Predictions are
invariant to the redundancy — you don’t have to think about it.
This matches glmnet(family="multinomial", type.multinomial="grouped").
Labels can be any sortable type¶
fit(X, y) accepts integers, strings, or anything np.unique can
sort. The estimator stores the sorted unique labels on classes_
and decodes predictions back to the original dtype:
labels_str = np.array(["cat", "dog", "fish"])[labels]
clf = skein_glm.MultinomialMCPPathClassifier(
gamma=3.0, n_lambdas=10, lambda_min_ratio=1e-2,
).fit(X, labels_str)
clf.classes_ # array(['cat', 'dog', 'fish'])
clf.predict(X[:3]) # array(['dog', 'cat', 'fish'])
Multinomial CV¶
MultinomialMCPPathCV scores by multinomial deviance (log-loss)
and uses StratifiedKFold by class so heavy class imbalance
doesn’t produce class-empty folds:
cv = skein_glm.MultinomialMCPPathCV(
gamma=3.0, cv=5, random_state=0,
n_lambdas=15, lambda_min_ratio=1e-2,
).fit(X, labels)
cv.lambda_best_
cv.predict(X[:5])
Multi-task LS: K continuous outcomes¶
Multi-task LS fits K related linear regressions jointly under
a row-grouped penalty: feature j is selected if it’s active in
any task.
n, p, K = 200, 30, 4
X = rng.standard_normal((n, p))
# Truth: features 0, 5, 12 active; the rest are noise.
true_B = np.zeros((p, K))
true_B[0] = [1.5, -1.0, 0.8, -1.2]
true_B[5] = [0.7, -0.5, 1.1, 0.4]
true_B[12] = [0.0, 0.6, -0.5, 0.9]
Y = X @ true_B + 0.3 * rng.standard_normal((n, K)) # (n, K)
path = skein_glm.MultiTaskLassoPathRegressor(
n_lambdas=20, lambda_min_ratio=1e-2,
).fit(X, Y)
path.coefs_.shape # (n_lambdas, K, p) — full path
path.intercepts_.shape # (n_lambdas, K)
path.lambdas_.shape # (n_lambdas,)
# Single-λ regressor matches sklearn's `MultiTaskLasso` shape — use it
# when you want a fitted model to call `predict` on.
m = skein_glm.MultiTaskLassoRegressor(
lambda_=path.lambdas_[10],
).fit(X, Y)
m.coef_.shape # (K, p)
m.intercept_.shape # (K,)
m.predict(X).shape # (n, K)
Useful when you have multiple correlated outcomes — e.g., gene expression in different tissues, sales across regions, sensor readings at multiple time points — and you believe the sparse support is shared.
Convention vs sklearn¶
skein and sklearn use different objective scalings for multi-task LS:
Library |
Data fidelity factor |
Penalty constant |
|---|---|---|
skein |
|
|
sklearn |
|
|
Same minimizer at λ_skein = α_sklearn / K. The pytest suite has a
parity test against sklearn.linear_model.MultiTaskLasso at the
matching (α, λ) pair.
Other multi-task penalties¶
# Multi-task MCP via LLA (unbiased on active features) — full path.
path_mcp = skein_glm.MultiTaskMCPPathRegressor(
gamma=3.0, n_lambdas=20, lambda_min_ratio=1e-2,
).fit(X, Y)
path_mcp.coefs_.shape # (n_lambdas, K, p)
# Multi-task elastic net (convex, useful for correlated tasks).
path_en = skein_glm.MultiTaskElasticNetPathRegressor(
alpha=0.5, n_lambdas=20, lambda_min_ratio=1e-2,
).fit(X, Y)
# CV scoring by mean per-task MSE — exposes `coef_ (K, p)` after refit.
cv = skein_glm.MultiTaskMCPPathCV(
gamma=3.0, cv=5, random_state=0,
n_lambdas=20, lambda_min_ratio=1e-2,
).fit(X, Y)
cv.lambda_best_
cv.coef_.shape # (K, p)
cv.predict(X).shape # (n, K)
Why these two are the same problem¶
Both multinomial and multi-task LS share a row-grouped coefficient
matrix B ∈ ℝ^(p×K) with the penalty λ · Σ_j w_j ‖B[j, :]‖_2.
The only difference is the loss:
Multinomial: cross-entropy on softmax(XB).
Multi-task LS: squared-error on
Y - XB.
Both reduce to a multi-task LS on a virtual (nK × pK) design
that the M2 block-CD machinery handles unchanged. For multinomial,
the prox-Newton outer loop linearizes the cross-entropy via Böhning’s
diagonal majorization — every iteration becomes a multi-task LS
sub-problem. The same MultiTaskDesign<X> wrapper drives both.
Stability selection on multi-task¶
StabilitySelection works on multi-task estimators too:
ss = skein_glm.StabilitySelection(
skein_glm.MultiTaskLassoPathRegressor(
n_lambdas=15, lambda_min_ratio=1e-2,
),
n_bootstraps=50, threshold=0.7, random_state=0, n_jobs=-1,
).fit(X, Y)
ss.stable_features_
The 3D coefs_ (n_lambdas, K, p) is collapsed via the any-class
active rule: feature j counts as selected if any of its K
coefficients is nonzero.
Recap¶
Setting |
Class |
Coef shape |
predict |
|---|---|---|---|
K-class single λ |
|
|
|
K-class path |
|
|
path attrs only — no |
K-class CV |
|
|
|
Multi-response single λ |
|
|
|
Multi-response path |
|
|
path attrs only |
Multi-task CV |
|
|
|
Both families share the same (datafit, row-grouped penalty) shape
underneath. Once you understand this tutorial, you understand
everything in skein_glm.multinomial and skein_glm.multitask.
Where to go next¶
You now know all nine tutorials. The natural next reading:
The applied case studies in examples/ — full analyses combining multiple features.
The concepts pages for the conceptual model behind the abstractions.
The extending guides for building your own penalty / datafit / backend.
The API reference for the complete surface.