lean-bench

1.5. CI-budget mode🔗

For CI you usually want a benchmark suite that takes a predictable amount of time, not whatever total a full sweep happens to land on that day. run accepts a --total-seconds N flag that bounds the whole suite to roughly N seconds of wallclock time. It is a first-class feature: the harness schedules benchmarks against a monotonic-clock deadline rather than relying on shell-script timeout wrappers around the whole run.

Mechanics:

  1. The orchestrator captures the start time and computes a deadline.

  2. Before each benchmark, it checks the deadline. Benchmarks that would start past the deadline are recorded as budget_skip entries — they appear in the terminal output and the export document.

  3. The deadline is also threaded into runBenchmark. A benchmark that's mid-ladder when the deadline trips finishes its current rung (each rung is independently capped by maxSecondsPerCall) and then stops; the result is tagged budget_truncated.

  4. The run still produces a partial verdict for whatever rungs landed. Partial results aren't an error — they're the point of budget mode.

The bound on total wallclock time is approximately total_seconds + maxSecondsPerCall: at most one rung can be in flight when the deadline trips. There is no in-rung kill: a benchmark with a 30s maxSecondsPerCall can blow past a 5s budget by up to 30s. Tighten the per-call cap if you need a tighter bound.

A budgeted run prints each benchmark's normal report as it lands, plus a one-line budget summary at the bottom and a bullet per skipped benchmark. A realistic 60-second suite with seven parametric benchmarks tagged regression:

$ lake exe my_benchmarks run --tag regression --total-seconds 60
MyProject.Hash.murmur3        expected complexity: n           [warm cache]
  env: lean=4.30.0-rc2, os=linux, arch=x86_64, commit=ab12cd, 2026-05-04T18:02:11Z, host=ci-runner-7
  param  per-call    repeats  C
     16   12.2 ns    ×2^21    C= —     [<floor] †
     32   24.1 ns    ×2^21    C= —     [<floor] †
     64   47.9 ns    ×2^21    C= 0.748
    128   95.0 ns    ×2^20    C= 0.742
    256    188 ns    ×2^19    C= 0.734
    512    374 ns    ×2^18    C= 0.730
  1_024    742 ns    ×2^17    C= 0.725
  2_048  1.49 µs    ×2^16    C= 0.727
  verdict: consistent with declared complexity (cMin=0.725, cMax=0.748, β=0.005)
  per-spawn floor (harness self-measurement): 1.4 ms

… five more benchmarks reported similarly …

MyProject.Sort.heapSort       expected complexity: n * Nat.log2 (n + 1)    [warm cache, budget truncated]
  param  per-call    repeats  C
     16    302 ns    ×2^16    C= 4.71 †
     32    684 ns    ×2^15    C= 4.27
     64  1.51 µs    ×2^14    C= 3.93
    128  3.32 µs    ×2^13    C= 3.71
  verdict: consistent with declared complexity (cMin=3.71, cMax=4.27, β=-0.038)
  per-spawn floor (harness self-measurement): 1.4 ms

budget: 60.000s; elapsed 58.910s; completed 7, skipped 2, truncated 1
  [budget skip] MyProject.Sort.runMergeSort    [parametric]
  [budget skip] MyProject.Sort.runInsertion    [parametric]

Each completed benchmark gets its full report (ladder + verdict + per-spawn floor); the one whose ladder was cut off mid-flight is tagged [budget truncated] on its header and stops at the last rung that finished before the deadline. The two [budget skip] bullets at the bottom name benchmarks that didn't get to start at all. completed 7 and truncated 1 overlap: the truncated benchmark is one of the seven that completed (it just stopped early).

When --export-file FILE is also passed, the export document gains a top-level budget object describing what happened. Each result entry in results[] also carries "budget_truncated": true|false.

run and compare use distinct non-zero codes so CI can react without parsing stdout:

code

meaning

0

success — the run completed and (with --baseline) no regressions were flagged

1

a regression vs. the baseline was flagged, or argument validation / dispatch errored

2

the parametric run produced zero verdict-eligible rows

Code 2 is the right signal to fail-loud on rather than swallow: the verdict text would otherwise read inconclusive (cMin=—, cMax=—, β=—) and look like a benign "couldn't decide", masking the underlying calibration bug.