Speed — headline benchmarks¶
One-page consolidation of skein-glm’s wall-clock headline numbers with
explicit provenance. The authoritative raw data lives in
paper/tables/T2_headline_timings.md (machine-generated by the v2
benchmark suite); this page is the human-readable summary backed by
that table.
TL;DR¶
On the canonical medium cell (n=10000 samples × p=1000 features, 5
seeds, deep-tail λ grid, apple-m1 + Accelerate) skein is the fastest
package on every LS-family scenario and on every group/structured
scenario tested. It wins on logistic_mcp against ncvreg (the only
ncvreg-comparable GLM cell). It loses against glmnet on
logistic_lasso, poisson_lasso, and cox_lasso — see Caveats
below; this snapshot is from skein 0.10.0 and predates the M13.8
celer-style screening cascade that the README’s “2.2–8.2× wall-clock
on logistic_lasso” claim was measured against.
Provenance¶
The numbers in Headline timings come from a single
snapshot run of benches/v2/:
field |
value |
|---|---|
host_id |
|
CPU |
Apple M1 |
BLAS |
|
cores |
8 |
OS |
macOS-26.4.1-arm64 |
Python |
3.12.7 |
skein-glm |
0.10.0 |
numpy |
2.4.4 |
scipy |
1.17.1 |
sklearn |
1.8.0 |
skglm |
0.5 |
celer |
0.7.4 |
snapshot |
2026-05-19 |
git rev |
|
Source: paper/tables/T6_environment.md and paper/BUNDLE.md.
Headline timings¶
All values are median wall-clock seconds across 5 seeds, single
threaded inner CD (CV / stability selection parallelise across folds at
a higher layer). Ratios are comparator_median / skein_median —
values > 1 mean skein is faster.
Least-squares family¶
scenario |
skein (s) |
comparator |
comp (s) |
skein speedup |
|---|---|---|---|---|
|
1.13 |
glmnet |
1.64 |
1.45× |
|
1.13 |
celer |
3.05 |
2.71× |
|
1.13 |
skglm |
4.80 |
4.26× |
|
1.13 |
sklearn |
0.20 |
0.18× (sklearn faster) |
|
1.70 |
ncvreg |
7.78 |
4.59× |
|
1.70 |
skglm |
4.61 |
2.72× |
|
1.57 |
ncvreg |
7.82 |
4.97× |
|
1.40 |
glmnet |
1.71 |
1.22× |
|
1.40 |
sklearn |
0.28 |
0.20× (sklearn faster) |
sklearn’s coordinate_descent is a tight Cython kernel for the L1-only
case — out of scope for skein to chase. The relevant comparisons are
against the structured / nonconvex peer set (glmnet, celer, skglm,
ncvreg).
Group / structured family¶
scenario |
skein (s) |
comparator |
comp (s) |
skein speedup |
|---|---|---|---|---|
|
5.33 |
grpreg |
11.39 |
2.14× |
|
6.57 |
grpreg |
12.56 |
1.91× |
No external package has a published sparse-group-MCP implementation, so
ls_sparse_group_mcp (skein: 7.08 s median) runs without a comparator.
GLM family¶
scenario |
skein (s) |
comparator |
comp (s) |
skein speedup |
|---|---|---|---|---|
|
19.62 |
ncvreg |
95.10 |
4.85× |
|
108.42 |
glmnet |
7.88 |
0.07× (glmnet faster — see Caveats) |
|
41.73 |
glmnet |
2.50 |
0.06× (glmnet faster — see Caveats) |
|
3.82 |
glmnet |
2.24 |
0.59× (glmnet faster — see Caveats) |
poisson_mcp (skein: 7.82 s), cox_mcp (skein: 4.13 s) have no
ncvreg/glmnet comparator at v0.10’s bench-cell coverage.
Caveats¶
The snapshot is from skein 0.10.0, dated 2026-05-19. The post-0.10
perf work — particularly M13.8’s celer-style gap-safe screening on
the GLM prox-Newton surrogate — landed after this snapshot and the
v2 suite has not been re-run on a v1.0+ build. The CHANGELOG entry for
M13.8 reports “2.2–8.2× wall-clock on logistic_lasso v2 cells”, which
if applied to the 108 s number above puts skein in the 13–49 s range —
still slower than glmnet’s 7.9 s on this cell, but closer. The
poisson_lasso and cox_lasso cells have similar pending re-snapshot
status.
Re-running the headline matrix is a maintainer-overnight job: the
full cibuildwheel-grade fits across 5 seeds × 16 scenarios × 4 BLAS
configs is ~6–10 h on a single laptop. The H1 closeout
(at_scale.md) describes the relevant xlarge tier
extension and the maintainer-overnight cadence.
These numbers are deep-tail (λ_min/λ_max = 1e-3). Sparse-regime
(5e-2) numbers track the same direction qualitatively but the
magnitudes shift — sparse cells finish 3–10× faster across the board
because the active set stays at k_active. See
paper/tables/T2_headline_timings.md for the per-regime breakdown.
Reproducing¶
The numbers above are regeneratable from a clean checkout via the v2 Snakemake suite:
pip install -e '.[bench]'
maturin develop --release --features=blas-accelerate # macOS
# or: --features=blas-openblas on Linux
cd benches/v2 && snakemake --profile profiles/m1-headline
The artifacts land under paper/figures/ and paper/tables/. See
benches/v2/README.md for the design rationale and the host-id
provenance contract that prevents committing snapshots from one
machine over snapshots from another.
A lightweight per-PR canary (.github/workflows/bench-smoke.yml) runs
two cells per PR to catch pipeline breakage; it doesn’t compare against
committed snapshots.
What this page is NOT¶
Not a per-scenario tutorial. The
mcp_ls,scad_ls, andlasso_ls_correctnesspages each cover one (datafit × penalty) combination in detail with the v1 bench comparison output.Not a profiling write-up. The historical perf notes in
docs/perf/cover the per-milestone investigations — seeceler_skglm_study,lasso_ls_profile, andm13_4_profile.Not a stability surface. The numbers above shift with each maintainer re-snapshot; the only durable contract is the methodology (
benches/v2/README.md).