lean-bench

6.5. What we left out🔗

  • Elaboration-time subprocess sanity check ("does f 0 finish in < 1s?"). Considered and rejected: it makes the elaborator depend on the compiled binary, which is circular and phase-dependent. v0.1 static checks only verify symbol existence, arity, and complexity type. Compiled-code sanity checks live in lean-bench verify, which invokes the registered runner in-process against f 0 and f 1 for each registered benchmark and reports any non-ok outcome.

  • Strong verdict labels. The verdict fits the log-log slope β of C vs param over the trimmed tail (leading 20% of ratios dropped by default). |β| ≤ 0.15 → "consistent with declared complexity", otherwise "inconclusive". β's sign is reported alongside the verdict so callers can see direction, but the verdict itself stays binary: the raw ratios array is the source of truth and the verdict is decoration. The labels deliberately don't say "matches" or "doesn't match" because the heuristic isn't defended statistically.

  • Statistical machinery. No outlier rejection, confidence intervals, repeated outer trials. v0.1 trusts that targetInnerNanos := 500ms × auto-tuned inner-repeats is enough signal-to-noise for the orders of magnitude we care about.