lean-bench

4.6. Warm versus cold reading🔗

Every benchmark run has a cold regime where the per-call overhead (child startup, JIT-style adaptive code paths in the runtime, cache fills, branch predictor warmup) dominates the algorithm. The first few rungs of the ladder are systematically slower than the asymptotic regime would predict. The harness reports per-call wall time including this overhead and trims the leading verdictWarmupFraction (20% by default) of ratios before fitting the slope (see Quickstart).

What this means in practice:

  • Don't read the first few rows as algorithm data. Tombstones () on early rungs in the report mark the trim region. The trimmed tail is the source of truth for the verdict; the early rows are kept for context only.

  • The per-spawn floor sets a hard noise floor. Every report prints per-spawn floor: X ms. Any data point with total_nanos smaller than ~10× the floor is noise rather than signal. The auto-tuner usually drives inner_repeats up enough that batches hit targetInnerNanos := 500ms, which is well above the floor; but for very fast operations on small param you can see flat or non-monotone per-call times until param grows past where startup dominates.

  • A single child process per param means no warm cache between params. Each measurement is a fresh subprocess, by design (so the wall cap is enforceable without external timeout(1)). The flip side is no L1/L2 carry-over, no JIT-style steady state across rungs. The default warm mode does still amortise within a rung — the auto-tuner runs the function many times inside a single spawn, so caches and branch predictors reach steady state for that param. If you want the cold per-call cost (cache refill on every measurement, no internal averaging), use --cache-mode cold and read the Cache modes section of Advanced. The two modes measure different things; either can be appropriate.

  • Single-shot per param is fragile near the boundary. The default --outer-trials 1 collects one batch per ladder rung, and a single noisy spawn at the high end can flip a verdict from consistent to inconclusive. Bump --outer-trials 3 (or higher) to get a per-param median + spread; the verdict then sees the median per param, not a single noisy sample. See Outer trials in Advanced for what the summary block reports and what its limits are. Cost scales linearly with the trial count, so it's a deliberate trade.

For exponential-complexity benchmarks, the ladder shifts from doubling to a linear sweep over (lastOk, firstFail) and the log-x range narrows. The slope fit is rejected for narrow ranges and the verdict falls back to a multiplicative range check — cMax / cMin ≤ max(narrowRangeNoiseFloor, exp(slopeTolerance · xRange)), see Advanced. On those benchmarks the verdict's β line shows and you should read cMin/cMax directly.