1.5. CI-budget mode
For CI you usually want a benchmark suite that takes a predictable
amount of time, not whatever total a full sweep happens to land on
that day. run accepts a --total-seconds N flag that bounds the
whole suite to roughly N seconds of wallclock time. It is a
first-class feature: the harness schedules benchmarks against a
monotonic-clock deadline rather than relying on shell-script
timeout wrappers around the whole run.
Mechanics:
-
The orchestrator captures the start time and computes a deadline.
-
Before each benchmark, it checks the deadline. Benchmarks that would start past the deadline are recorded as
budget_skipentries — they appear in the terminal output and the export document. -
The deadline is also threaded into
runBenchmark. A benchmark that's mid-ladder when the deadline trips finishes its current rung (each rung is independently capped bymaxSecondsPerCall) and then stops; the result is taggedbudget_truncated. -
The run still produces a partial verdict for whatever rungs landed. Partial results aren't an error — they're the point of budget mode.
The bound on total wallclock time is approximately
total_seconds + maxSecondsPerCall: at most one rung can be in
flight when the deadline trips. There is no in-rung kill: a
benchmark with a 30s maxSecondsPerCall can blow past a 5s budget
by up to 30s. Tighten the per-call cap if you need a tighter bound.
A budgeted run prints each benchmark's normal report as it lands,
plus a one-line budget summary at the bottom and a bullet per
skipped benchmark. A realistic 60-second suite with seven
parametric benchmarks tagged regression:
$ lake exe my_benchmarks run --tag regression --total-seconds 60
MyProject.Hash.murmur3 expected complexity: n [warm cache]
env: lean=4.30.0-rc2, os=linux, arch=x86_64, commit=ab12cd, 2026-05-04T18:02:11Z, host=ci-runner-7
param per-call repeats C
16 12.2 ns ×2^21 C= — [<floor] †
32 24.1 ns ×2^21 C= — [<floor] †
64 47.9 ns ×2^21 C= 0.748
128 95.0 ns ×2^20 C= 0.742
256 188 ns ×2^19 C= 0.734
512 374 ns ×2^18 C= 0.730
1_024 742 ns ×2^17 C= 0.725
2_048 1.49 µs ×2^16 C= 0.727
verdict: consistent with declared complexity (cMin=0.725, cMax=0.748, β=0.005)
per-spawn floor (harness self-measurement): 1.4 ms
… five more benchmarks reported similarly …
MyProject.Sort.heapSort expected complexity: n * Nat.log2 (n + 1) [warm cache, budget truncated]
param per-call repeats C
16 302 ns ×2^16 C= 4.71 †
32 684 ns ×2^15 C= 4.27
64 1.51 µs ×2^14 C= 3.93
128 3.32 µs ×2^13 C= 3.71
verdict: consistent with declared complexity (cMin=3.71, cMax=4.27, β=-0.038)
per-spawn floor (harness self-measurement): 1.4 ms
budget: 60.000s; elapsed 58.910s; completed 7, skipped 2, truncated 1
[budget skip] MyProject.Sort.runMergeSort [parametric]
[budget skip] MyProject.Sort.runInsertion [parametric]
Each completed benchmark gets its full report (ladder + verdict +
per-spawn floor); the one whose ladder was cut off mid-flight is
tagged [budget truncated] on its header and stops at the last
rung that finished before the deadline. The two [budget skip]
bullets at the bottom name benchmarks that didn't get to start at
all. completed 7 and truncated 1 overlap: the truncated
benchmark is one of the seven that completed (it just stopped
early).
When --export-file FILE is also passed, the export document gains
a top-level budget object describing what happened. Each result
entry in results[] also carries "budget_truncated": true|false.
run and compare use distinct non-zero codes so CI can react
without parsing stdout:
code | meaning |
|---|---|
|
success — the run completed and (with |
| a regression vs. the baseline was flagged, or argument validation / dispatch errored |
| the parametric run produced zero verdict-eligible rows |
Code 2 is the right signal to fail-loud on rather than swallow:
the verdict text would otherwise read inconclusive (cMin=—,
cMax=—, β=—) and look like a benign "couldn't decide", masking the
underlying calibration bug.