M13.4 — Profile + Phase 2.3 fix for ls_group_mcp LLA overhead

Scenario: Gaussian LS + group MCP, n=10_000, p=1_000, group_size=5, n_groups=200, k_active_groups=5, SNR=5, 100-λ path from λ_max down to 1e-3 · λ_max, tol=1e-7, γ=3.0, max_outer=10, outer_tol=1e-6, strong rule + Anderson(5). Matches the v2 bench cell [ls_group_mcp / medium / deep / seed=any].

Profiling target: crates/skein-core/examples/group_mcp_ls_medium.rs — pure-Rust binary so samply / cargo flamegraph can resolve frames without the PyO3 + interpreter layer in the way. It uses an xorshift PRNG rather than numpy’s default_rng, so the realized active groups differ from the v2 bench cell; that yields a harder per-iteration problem (more inner-CD work per outer iter), which is useful for exposing the LLA overhead even though it does not directly reproduce the bench cell’s seconds.

cargo build --release --example group_mcp_ls_medium
./target/release/examples/group_mcp_ls_medium    # standalone profile
samply record ./target/release/examples/group_mcp_ls_medium

# Bench cell (numpy seed problem, 5 trials per seed):
PYTHONPATH=. .venv/bin/python -m benches.v2.report._run_cell \
    --scenario ls_group_mcp --size medium --regime deep --seed 0 \
    --package skein --config benches/v2/config.yaml \
    --out /tmp/skein.jsonl --env-out /tmp/skein.env.json

Phase 1 — Pre-fix baseline (standalone xorshift problem)

metric

value

total fit (measured)

54.0 s

sum(per_lambda_wall_ns)

53.4 s

final-λ active features

1000 / 1000

total outer iters across λ

617

total inner CD iters

2 555

total KKT passes

617

Outer-iter distribution was bimodal:

outer_iters

n_lambdas

1

37

2

1

7

5

8

4

9

19

10 (= max_outer)

34

53/100 λs ran the outer LLA loop to the cap or one short, indicating the coefficient-space convergence check max_block_change < outer_tol (block_path_lla.rs:213) was never satisfied on those — even though the surrogate MCP weights had long since saturated (max(base ‖β_g‖ / (λγ), 0) clamps to 0 for groups with ‖β_g‖ λγ).

Phase 2.3 fix — weight-space LLA fixed-point short-circuit

Implementation (block_path_lla.rs): cache the surrogate weights from the previous outer iter; before the screening + block-CD + KKT cycle in the next iter, compute the new weights w_t = ψ(β_{t-1}) and break if ‖w_t w_{t-1}‖_∞ < weight_short_circuit_tol. The threshold is set at 1000 · outer_tol (i.e. 1e-3 when outer_tol=1e-6) — above the inner-CD-induced weight jitter floor, but tight enough that no test in the integration suite regresses.

Standalone xorshift result (same problem as baseline):

metric

pre-Phase-2.3

post-Phase-2.3

delta

total wall

54.0 s

36.4 s

−32.6 %

total outer iters

617

340

−44.9 %

total inner CD iters

2 555

1 688

−33.9 %

max outer per λ

10 (cap)

6

Histogram of outer_iters collapses to {1: 38, 3: 1, 4: 24, 5: 19, 6: 18} — no λ runs to the max_outer cap, none past 6.

v2 bench cell result (numpy seed=0, 5-trial median):

metric

pre-Phase-2.3 (committed)

post-Phase-2.3

fit_time_s (skein)

39.31 s

40.17 s

outer_iters sum

(not in snapshot)

362

outer_iters mean

(not in snapshot)

3.62

grpreg comparison

12.74 s (skein 3.08×)

12.74 s (skein 3.15×)

The bench-cell numpy problem has dramatically fewer inner-CD iterations per outer iter (≈1 inner per outer) than the standalone xorshift problem (≈5 inner per outer), so trimming “polishing” outer iters at the tail of the LLA loop saves outer-iter count but not wall — the inner CD wasn’t doing real work in those iters anyway. The 41 % reduction in outer iters did not translate into wall savings on the realistic bench seed=0.

The wall is now dominated by the inner block-CD work itself, which the LLA wrapper cannot reduce — confirming the ROADMAP M13.4 conclusion that closing the residual ~3× gap to grpreg requires a native (non-LLA) group-MCP block solver (Phase 3, scoped as follow-up milestone M13.4b).

Headline (standalone xorshift problem, post-Phase-2.3)

metric

value

total fit (measured)

36.4 s

sum(per_lambda_wall_ns)

36.4 s

final-λ active features

1000 / 1000

total outer iters across λ

340

total inner CD iters

1 688

total KKT passes

340

Wall-time concentration (post-Phase-2.3, standalone)

Top 10 λ account for 31.9 % of solve wall-time (11.6 s / 36.4 s):

rank

k

λ

outer

inner_sum

wall_ms

1

83

8.56e-3

6

38

1520.4

2

80

1.05e-2

6

39

1275.3

3

86

6.94e-3

6

38

1201.2

4

92

4.57e-3

5

30

1134.1

5

82

9.17e-3

6

39

1119.0

6

84

7.98e-3

6

38

1100.1

7

81

9.84e-3

6

39

1086.0

8

85

7.44e-3

6

38

1076.5

9

87

6.47e-3

6

36

1036.9

10

88

6.04e-3

6

36

1033.4

These are the dense-tail saturation regime (active set 130–199 groups). Each does 6 outer iters × 5–6 inner CD iters each; the inner CD work is what we cannot avoid via LLA-loop micro-optimization.

What was tried but not shipped

  • Phase 2.1 — λ_max short-circuit. Profile data showed it would save ~3 % of wall time (the 38 λs that converge in 1 outer iter cost ~40 ms each, mostly in the unavoidable KKT scan). Deferred.

  • Phase 2.2 — empty-WS guard. Profile data showed zero λs produce an empty strong-rule WS in the bench scenario. Deferred.

  • Phase 2.4 — WS reuse across outer iters. The strong-rule scan costs ~1 ms per outer iter; reusing the WS could save ~100 ms total across the dense tail (<0.5 % of wall). Deferred.

All three are correct and easy fixes; none meaningfully closes the gap to grpreg. They are documented for future revisits if the LLA path remains primary, but the highest-leverage next investment is the native group-MCP BCD in Phase 3 / M13.4b.

Decision summary

Phase 2.3 ships because it is correct, simple, regression-free in the 343-test Rust suite and 412-test Python suite, and delivers real speedups on pathological-by-design problems where LLA outer iters do non-trivial work each. It is a no-cost improvement on the bench cell (the worst case there is “no change”) and a 30 % win on harder problems. The remaining gap to grpreg on the v2 cell is structural to the LLA approach and is scoped to follow-up milestone M13.4b.

Reproducibility

Numbers above are from an M1 host (8 BLAS threads, macOS, Accelerate), release profile with debug = "line-tables-only". Treat the seconds as approximate; the relative outer-iter distribution and wall-time concentration are the load-bearing observations.