lean-bench

1.7. Reading the output🔗

Each row of the report is a per-call wall time at one rung of the ladder. The C column is the unit-cost ratio C = perCallNanos / complexity(param). When the declared complexity matches reality, C is approximately constant across the ladder; when it doesn't, C drifts. The verdict line is a log-log slope check on C vs param: |β| ≤ 0.15 reads "consistent with declared complexity", anything wider reads "inconclusive". β's sign tells you the direction — positive means actually slower than declared, negative means actually faster.

The harness self-measures its per-spawn overhead and prints it under the verdict as per-spawn floor. Any rung whose total sample is shorter than signalFloorMultiplier × spawn_floor (default 10×) is dominated by subprocess startup rather than the function, so its per-call number is meaningless; those rows are tagged [<floor] and dropped from the verdict.

If too few rungs land — every row is below the floor, every row hits the cap, the ladder was truncated mid-flight, or fewer than three rows survive the warmup-trim plus signal-floor filter — the report can't compute a verdict, and advisory lines appear naming concrete knobs to turn next (--max-seconds-per-call, --param-ceiling, paramSchedule := .custom #[...], setup_fixed_benchmark, …).

A wide-ladder goodFib run, on stable hardware, gives a consistent verdict. flags the warmup-trim region (excluded from the slope fit); [<floor] flags rungs dominated by spawn overhead; the verdict line reports cMin, cMax, and β:

$ lake exe fib_benchmark_example run goodFib --param-ceiling 4096 --max-seconds-per-call 1
LeanBench.Examples.Fib.goodFib    expected complexity: n    [warm cache]
  env: lean=4.30.0-rc2, os=linux, arch=x86_64, commit=<HASH>, <DATE>, host=<HOST>
  param  per-call    repeats  C
      0   46.000 ns  ×2^23    C= —
      1  107.000 ns  ×2^23    C= —
      2   82.000 ns  ×2^22    C=41.343 †
      4  112.000 ns  ×2^21    C= —     [<floor]
      8  200.000 ns  ×2^22    C=25.061 †
     16  278.000 ns  ×2^21    C=17.416
     32  459.000 ns  ×2^20    C=14.353
     64    1.236 µs  ×2^19    C=19.324
    128    2.334 µs  ×2^18    C=18.241
    256    3.743 µs  ×2^17    C=14.623
    512    7.506 µs  ×2^16    C=14.662
  1_024   17.287 µs  ×2^15    C=16.882
  2_048   18.792 µs  ×2^13    C= —     [<floor]
  4_096   43.304 µs  ×2^13    C=10.572
  verdict: consistent with declared complexity (cMin=10.572, cMax=19.324, β=-0.065)
  per-spawn floor (harness self-measurement): 33.998 ms

β=-0.065 sits inside the |β| ≤ 0.15 band, so the verdict reads "consistent". The table tells the same story: per-call time grows roughly linearly with param once spawn overhead amortises, and C stabilises near 15 ns. The two [<floor] rungs are batches that ran shorter than 10× the spawn floor — pure noise, and correctly dropped from the slope fit.

The opposite case: a tight ladder where the cap fires before any real ladder data accumulates. The verdict reads "inconclusive" and the advisories name the knob to turn:

$ lake exe fib_benchmark_example run goodFib --param-ceiling 16 --max-seconds-per-call 0.5 --signal-floor-multiplier 1.0
LeanBench.Examples.Fib.goodFib    expected complexity: n    [warm cache]
  env: lean=4.30.0-rc2, os=linux, arch=x86_64, commit=<HASH>, <DATE>, host=<HOST>
  param  per-call    repeats  C
      0   55.000 ns  ×2^23    C= —
      1  113.000 ns  ×2^23    C= —
      2   79.000 ns  ×2^22    C=39.733
      4  140.000 ns  ×2^22    C=35.008
      8  234.000 ns  ×2^22    C=29.356
     16  247.000 ns  ×2^21    C=15.492
  verdict: inconclusive (cMin=15.492, cMax=39.733, β=-0.433, looks faster than declared by ~n^0.433)
  per-spawn floor (harness self-measurement): 32.899 ms

β=-0.433 sits well outside the band, so the verdict reads "inconclusive" with a directional hint ("looks faster than declared by ~n^0.433"). It's a calibration failure, not a bug: at small n the per-call setup dominates the linear algorithm, dragging C down across rungs. Widening the ladder (--param-ceiling) lets the asymptotic cost win and tips the verdict back to consistent — exactly what the previous example shows.