lean-bench

3.3. Cache modes🔗

--cache-mode warm (the default) and --cache-mode cold measure different things, and the right choice depends on what you're investigating. The result table tags every run with [warm cache] or [cold cache] and the JSONL row carries the same discriminator as cache_mode: "warm" | "cold" so post-hoc analysis can keep them apart.

Warm mode is the v0.1 design. The child runs an auto-tuned inner-repeat loop in a single subprocess: inner_repeats is picked to land just past targetInnerNanos / 2 of inner work. CPU caches, branch predictors, the runtime's adaptive code paths, and the GC's live set all reach steady state across the repeats. The reported per-call time is total_nanos / inner_repeats — the asymptotic cost of the algorithm under hot microarchitectural state. This is what you want when you're measuring the algorithm itself: tightening a hot loop, comparing two implementations, or fitting a complexity model.

Cold mode respawns the child for every rung of the ladder and runs the function exactly once per spawn (inner_repeats := 1, auto-tune skipped). The harness no longer intentionally preserves intra-child state across measurements; whatever the warm loop amortises across many iterations of one process you pay in full here. The reported per-call time includes cache refill of whatever the previous spawn evicted, branch-predictor warmup, allocator first-touch, and any per-call setup that the auto-tuner would hide. This is what you want when you're investigating locality (does the working set fit in L1/L2?), first-touch costs, or whether the warm steady-state numbers under-report what a real workload will see.

Note: cold mode is "no harness-side warmup," not "guaranteed cold hardware." The OS can still keep recently-touched pages, the CPU front-end can still hold recently-decoded code, and a system under light load may carry parts of the working set across processes. Spawning a fresh process is a strong knob, not a deterministic flush — read it as "we are no longer doing anything to keep the caches warm," not "we have evicted them."

The trade-off: cold measurements are noisier per data point. With inner_repeats := 1, the per-spawn floor and OS jitter both fall on a single sample — there is no internal averaging. The verdict's slope and cMin/cMax checks still work, but expect a wider tolerance band and a higher chance of the verdict landing on inconclusive purely from noise. If you're interpreting cold numbers as algorithm data rather than locality data, raise --max-seconds-per-call so the larger rungs run long enough to dwarf the per-spawn floor.

Either mode can be pinned at declaration time (where { cacheMode := .warm } / .cold) or selected per-run via the CLI (--cache-mode warm|cold on run / compare). The CLI override layers on top of the declared mode, like every other config knob.

When in doubt, run both: lake exe bench run myFn --cache-mode warm gives the steady-state cost; lake exe bench run myFn --cache-mode cold exposes the cold-state cost. The gap between the two is the warm-up budget your real workload either can or cannot afford to amortise.