Multi-task / multi-response

Multi-task LS fits K related linear regressions jointly under a penalty that selects features at the row level of the coefficient matrix B ℝ^(p × K). Whereas running K independent scalar fits would let each task pick a different active set, the multi-task formulation lets each feature be either active in every task or inactive in every task — useful when you believe the underlying support is shared.

The data is X ℝ^(n × p) and Y ℝ^(n × K). The objective is

\[ \min_{B} \;\; \frac{1}{2nK}\, \|Y - XB\|_F^2 \;+\; \lambda \sum_{j=1}^{p} w_j \, \|B[j, :]\|_2 \]

for the multi-task lasso (the convex case). Reading off the penalty: B[j, :] is the row of feature j across the K tasks, and we penalize its L2 norm. If the row goes to exactly zero, the feature is dropped from every task simultaneously.

The four shapes

Estimator

Penalty

Convex?

Use when

MultiTaskLassoRegressor

row-L2

yes

baseline; matches sklearn

MultiTaskMCPRegressor

row-MCP via LLA

non-convex

want unbiased active features

MultiTaskSCADRegressor

row-SCAD via LLA

non-convex

alternative to MCP, less aggressive

MultiTaskElasticNetRegressor

row-L2 + row-L2²

yes

correlated features, want stability

Each has a *PathRegressor sibling (full λ-path with warm starts) and a *PathCV wrapper (K-fold CV scoring by mean per-task MSE). The estimators live in skein_glm.multitask and are also re-exported at the package root, so skein_glm.MultiTaskLassoRegressor works.

How it works: stack-and-reuse

Multi-task LS reduces exactly to a group-lasso problem on a virtual design — no new solver code in skein. The trick: lay B out row-major as a length-pK vector bvec[jK + k] = B[j, k], stack Y task-outer as yvec[k·n + i] = Y[i, k], and define a virtual design ℝ^(nK × pK) where column jK + k holds the base column X[:, j] lifted into row block k (rows k·n..(k+1)·n) and zeros elsewhere.

With this layout the row-grouping of B becomes the standard contiguous-block grouping {jK, jK+1, …, jK+K-1} for each feature. The block-CD machinery from M2 handles the resulting group-lasso problem unchanged. The wrapper that materializes lazily is called MultiTaskDesign<D> in the Rust core; it carries no state beyond the inner design and the task count K, and every DesignMatrix op (matvec, col_dot, columns block) is O(n) — the same cost as a scalar column op on the base.

The same wrapper composes with Augmented (for sparse intercept augmentation) and Standardized (for per-feature scaling), so MultiTaskDesign<Standardized<Augmented<SparseCSC>>> is a real (and supported) backend stack.

Convention vs sklearn

skein and sklearn use different objective scalings:

Library

Data fidelity factor

Penalty constant

skein

1/(2nK)

λ

sklearn

1/(2n)

α

The same minimizer is reached at λ_skein = α_sklearn / K. So when porting code from sklearn:

# sklearn
sklearn.linear_model.MultiTaskLasso(alpha=0.05).fit(X, Y)

# skein equivalent (Y has K columns)
K = Y.shape[1]
skein_glm.MultiTaskLassoRegressor(lambda_=0.05 / K).fit(X, Y)

skein keeps the natural 1/(2nK) convention so the MCP/SCAD shape parameters (gamma, a) have the same meaning across the rest of the library — those control where the per-row penalty kink falls on the L2 norm scale, which is dimensionally consistent with the natural formulation.

The pytest suite has a regression test that fits both sklearn’s MultiTaskLasso and skein’s MultiTaskLassoRegressor at the matching (alpha, lambda) pair and asserts coefficient agreement to 1e-5.

Sparse and standardize

Both work for multi-task. The Python estimators dispatch on scipy.sparse.issparse(X); the underlying Rust path then routes through MultiTaskDesign<SparseCSC>. The intercept is handled two different ways depending on the input:

  • Dense: per-task center Y and per-feature center X (physically), fit a no-intercept model, recover per-task intercepts via \(\alpha_k = \bar{y}_k - \sum_j \bar{x}_j B[j, k]\).

  • Sparse: append a single 1s column to the inner CSC (the MultiTaskDesign wrapper replicates it K times into K virtual intercept columns living in disjoint row blocks), give the resulting row-group weight 0, and read each per-task intercept directly off bvec[p·K + k].

Both are equivalent at convergence — the dense pytest suite includes a parity test on a problem with standardize=True turned on for both backends.

Limitations

skein v0.2 ships multi-task LS and multinomial / softmax classification (M3.6 — see Multinomial). Both ride the same MultiTaskDesign<X> reduction. Multi-task GLMs for Poisson and Cox aren’t there yet (M7.3); the wrapper supports them at the design layer, but the prox-Newton outer loop has to be adapted per family.

coord_weights (per-task L1 weights independent of the per-row L2 weights) aren’t exposed yet; the per-feature weights= argument is uniform across tasks. The Rust core could support this via the existing SparseGroupLasso::with_coord_weights machinery, but the sklearn-facing API for it isn’t designed yet.

See also