Speed — headline benchmarks

One-page consolidation of skein-glm’s wall-clock headline numbers with explicit provenance. The authoritative raw data lives in paper/tables/T2_headline_timings.md (machine-generated by the v2 benchmark suite); this page is the human-readable summary backed by that table.

TL;DR

On the canonical medium cell (n=10000 samples × p=1000 features, 5 seeds, deep-tail λ grid, apple-m1 + Accelerate) skein is the fastest package on every LS-family scenario and on every group/structured scenario tested. It wins on logistic_mcp against ncvreg (the only ncvreg-comparable GLM cell). It loses against glmnet on logistic_lasso, poisson_lasso, and cox_lasso — see Caveats below; this snapshot is from skein 0.10.0 and predates the M13.8 celer-style screening cascade that the README’s “2.2–8.2× wall-clock on logistic_lasso” claim was measured against.

Provenance

The numbers in Headline timings come from a single snapshot run of benches/v2/:

field

value

host_id

3c43bb844695

CPU

Apple M1

BLAS

blas-accelerate (Accelerate framework)

cores

8

OS

macOS-26.4.1-arm64

Python

3.12.7

skein-glm

0.10.0

numpy

2.4.4

scipy

1.17.1

sklearn

1.8.0

skglm

0.5

celer

0.7.4

snapshot

2026-05-19

git rev

474c68a1 (per-cell) / 08b1d378 (bundle assembly, dirty WT)

Source: paper/tables/T6_environment.md and paper/BUNDLE.md.

Headline timings

All values are median wall-clock seconds across 5 seeds, single threaded inner CD (CV / stability selection parallelise across folds at a higher layer). Ratios are comparator_median / skein_median — values > 1 mean skein is faster.

Least-squares family

scenario

skein (s)

comparator

comp (s)

skein speedup

ls_lasso

1.13

glmnet

1.64

1.45×

ls_lasso

1.13

celer

3.05

2.71×

ls_lasso

1.13

skglm

4.80

4.26×

ls_lasso

1.13

sklearn

0.20

0.18× (sklearn faster)

ls_mcp

1.70

ncvreg

7.78

4.59×

ls_mcp

1.70

skglm

4.61

2.72×

ls_scad

1.57

ncvreg

7.82

4.97×

ls_elasticnet

1.40

glmnet

1.71

1.22×

ls_elasticnet

1.40

sklearn

0.28

0.20× (sklearn faster)

sklearn’s coordinate_descent is a tight Cython kernel for the L1-only case — out of scope for skein to chase. The relevant comparisons are against the structured / nonconvex peer set (glmnet, celer, skglm, ncvreg).

Group / structured family

scenario

skein (s)

comparator

comp (s)

skein speedup

ls_group_lasso

5.33

grpreg

11.39

2.14×

ls_group_mcp

6.57

grpreg

12.56

1.91×

No external package has a published sparse-group-MCP implementation, so ls_sparse_group_mcp (skein: 7.08 s median) runs without a comparator.

GLM family

scenario

skein (s)

comparator

comp (s)

skein speedup

logistic_mcp

19.62

ncvreg

95.10

4.85×

logistic_lasso

108.42

glmnet

7.88

0.07× (glmnet faster — see Caveats)

poisson_lasso

41.73

glmnet

2.50

0.06× (glmnet faster — see Caveats)

cox_lasso

3.82

glmnet

2.24

0.59× (glmnet faster — see Caveats)

poisson_mcp (skein: 7.82 s), cox_mcp (skein: 4.13 s) have no ncvreg/glmnet comparator at v0.10’s bench-cell coverage.

Caveats

The snapshot is from skein 0.10.0, dated 2026-05-19. The post-0.10 perf work — particularly M13.8’s celer-style gap-safe screening on the GLM prox-Newton surrogate — landed after this snapshot and the v2 suite has not been re-run on a v1.0+ build. The CHANGELOG entry for M13.8 reports “2.2–8.2× wall-clock on logistic_lasso v2 cells”, which if applied to the 108 s number above puts skein in the 13–49 s range — still slower than glmnet’s 7.9 s on this cell, but closer. The poisson_lasso and cox_lasso cells have similar pending re-snapshot status.

Re-running the headline matrix is a maintainer-overnight job: the full cibuildwheel-grade fits across 5 seeds × 16 scenarios × 4 BLAS configs is ~6–10 h on a single laptop. The H1 closeout (at_scale.md) describes the relevant xlarge tier extension and the maintainer-overnight cadence.

These numbers are deep-tail (λ_min/λ_max = 1e-3). Sparse-regime (5e-2) numbers track the same direction qualitatively but the magnitudes shift — sparse cells finish 3–10× faster across the board because the active set stays at k_active. See paper/tables/T2_headline_timings.md for the per-regime breakdown.

Reproducing

The numbers above are regeneratable from a clean checkout via the v2 Snakemake suite:

pip install -e '.[bench]'
maturin develop --release --features=blas-accelerate   # macOS
# or: --features=blas-openblas   on Linux
cd benches/v2 && snakemake --profile profiles/m1-headline

The artifacts land under paper/figures/ and paper/tables/. See benches/v2/README.md for the design rationale and the host-id provenance contract that prevents committing snapshots from one machine over snapshots from another.

A lightweight per-PR canary (.github/workflows/bench-smoke.yml) runs two cells per PR to catch pipeline breakage; it doesn’t compare against committed snapshots.

What this page is NOT

  • Not a per-scenario tutorial. The mcp_ls, scad_ls, and lasso_ls_correctness pages each cover one (datafit × penalty) combination in detail with the v1 bench comparison output.

  • Not a profiling write-up. The historical perf notes in docs/perf/ cover the per-milestone investigations — see celer_skglm_study, lasso_ls_profile, and m13_4_profile.

  • Not a stability surface. The numbers above shift with each maintainer re-snapshot; the only durable contract is the methodology (benches/v2/README.md).