Speed — headline benchmarks¶

One-page consolidation of skein-glm’s wall-clock headline numbers with explicit provenance. The authoritative raw data lives in paper/tables/T2_headline_timings.md (machine-generated by the v2 benchmark suite); this page is the human-readable summary backed by that table.

TL;DR¶

On the canonical medium cell (n=10000 samples × p=1000 features, 5 seeds, deep-tail λ grid, apple-m1 + Accelerate) skein is the fastest package on every LS-family scenario and on every group/structured scenario tested. It wins on logistic_mcp against ncvreg (the only ncvreg-comparable GLM cell). It loses against glmnet on logistic_lasso, poisson_lasso, and cox_lasso — see Caveats below; this snapshot is from skein 0.10.0 and predates the M13.8 celer-style screening cascade that the README’s “2.2–8.2× wall-clock on logistic_lasso” claim was measured against.

Provenance¶

The numbers in Headline timings come from a single snapshot run of benches/v2/:

field	value
host_id	`3c43bb844695`
CPU	Apple M1
BLAS	`blas-accelerate` (Accelerate framework)
cores	8
OS	macOS-26.4.1-arm64
Python	3.12.7
skein-glm	0.10.0
numpy	2.4.4
scipy	1.17.1
sklearn	1.8.0
skglm	0.5
celer	0.7.4
snapshot	2026-05-19
git rev	`474c68a1` (per-cell) / `08b1d378` (bundle assembly, dirty WT)

Source: paper/tables/T6_environment.md and paper/BUNDLE.md.

Headline timings¶

All values are median wall-clock seconds across 5 seeds, single threaded inner CD (CV / stability selection parallelise across folds at a higher layer). Ratios are comparator_median / skein_median — values > 1 mean skein is faster.

Least-squares family¶

scenario	skein (s)	comparator	comp (s)	skein speedup
`ls_lasso`	1.13	glmnet	1.64	1.45×
`ls_lasso`	1.13	celer	3.05	2.71×
`ls_lasso`	1.13	skglm	4.80	4.26×
`ls_lasso`	1.13	sklearn	0.20	0.18× (sklearn faster)
`ls_mcp`	1.70	ncvreg	7.78	4.59×
`ls_mcp`	1.70	skglm	4.61	2.72×
`ls_scad`	1.57	ncvreg	7.82	4.97×
`ls_elasticnet`	1.40	glmnet	1.71	1.22×
`ls_elasticnet`	1.40	sklearn	0.28	0.20× (sklearn faster)

sklearn’s coordinate_descent is a tight Cython kernel for the L1-only case — out of scope for skein to chase. The relevant comparisons are against the structured / nonconvex peer set (glmnet, celer, skglm, ncvreg).

Group / structured family¶

scenario	skein (s)	comparator	comp (s)	skein speedup
`ls_group_lasso`	5.33	grpreg	11.39	2.14×
`ls_group_mcp`	6.57	grpreg	12.56	1.91×

No external package has a published sparse-group-MCP implementation, so ls_sparse_group_mcp (skein: 7.08 s median) runs without a comparator.

GLM family¶

scenario	skein (s)	comparator	comp (s)	skein speedup
`logistic_mcp`	19.62	ncvreg	95.10	4.85×
`logistic_lasso`	108.42	glmnet	7.88	0.07× (glmnet faster — see Caveats)
`poisson_lasso`	41.73	glmnet	2.50	0.06× (glmnet faster — see Caveats)
`cox_lasso`	3.82	glmnet	2.24	0.59× (glmnet faster — see Caveats)

poisson_mcp (skein: 7.82 s), cox_mcp (skein: 4.13 s) have no ncvreg/glmnet comparator at v0.10’s bench-cell coverage.

Caveats¶

The snapshot is from skein 0.10.0, dated 2026-05-19. The post-0.10 perf work — particularly M13.8’s celer-style gap-safe screening on the GLM prox-Newton surrogate — landed after this snapshot and the v2 suite has not been re-run on a v1.0+ build. The CHANGELOG entry for M13.8 reports “2.2–8.2× wall-clock on logistic_lasso v2 cells”, which if applied to the 108 s number above puts skein in the 13–49 s range — still slower than glmnet’s 7.9 s on this cell, but closer. The poisson_lasso and cox_lasso cells have similar pending re-snapshot status.

Re-running the headline matrix is a maintainer-overnight job: the full cibuildwheel-grade fits across 5 seeds × 16 scenarios × 4 BLAS configs is ~6–10 h on a single laptop. The H1 closeout (at_scale.md) describes the relevant xlarge tier extension and the maintainer-overnight cadence.

These numbers are deep-tail (λ_min/λ_max = 1e-3). Sparse-regime (5e-2) numbers track the same direction qualitatively but the magnitudes shift — sparse cells finish 3–10× faster across the board because the active set stays at k_active. See paper/tables/T2_headline_timings.md for the per-regime breakdown.

Reproducing¶

The numbers above are regeneratable from a clean checkout via the v2 Snakemake suite:

pip install -e '.[bench]'
maturin develop --release --features=blas-accelerate   # macOS
# or: --features=blas-openblas   on Linux
cd benches/v2 && snakemake --profile profiles/m1-headline

The artifacts land under paper/figures/ and paper/tables/. See benches/v2/README.md for the design rationale and the host-id provenance contract that prevents committing snapshots from one machine over snapshots from another.

A lightweight per-PR canary (.github/workflows/bench-smoke.yml) runs two cells per PR to catch pipeline breakage; it doesn’t compare against committed snapshots.

What this page is NOT¶

Not a per-scenario tutorial. The mcp_ls, scad_ls, and lasso_ls_correctness pages each cover one (datafit × penalty) combination in detail with the v1 bench comparison output.
Not a profiling write-up. The historical perf notes in docs/perf/ cover the per-milestone investigations — see celer_skglm_study, lasso_ls_profile, and m13_4_profile.
Not a stability surface. The numbers above shift with each maintainer re-snapshot; the only durable contract is the methodology (benches/v2/README.md).