4.5. Compiled code is not the only way Lean evaluates
lean-bench measures compiled-code wall time, full stop (see
Advanced).
It does not measure:
-
#eval— runs in the elaborator, not native code, and can be much slower than the compiled equivalent on numeric work. -
decide/Decidablereduction at type-check time — uses the kernel reducer, which is slower still and has very different cost-per-step characteristics from compiled code. -
reduce/simp— symbolic, not computational. The cost model is "size of the term, number of rewrites" rather than "wall-clock time of a compiled function". -
kernel typechecking — has its own performance regime, dominated by definitional unfolding rather than by the algorithm.
Two consequences. First, you cannot use #eval (or a tactic that
calls into the elaborator) as a sanity check that your benchmark
is fast enough. Compile it, run it, look at
lean-bench list / verify. Second, performance regressions
visible in simp or decide aren't the same regressions a
compiled-code benchmark catches — these are distinct phenomena
and need distinct tooling. If you're chasing a compile-time
slowdown, you want kernel- or elaborator-level tracing rather
than a lean-bench row.
The corollary: a function being benchmarked must compile.
noncomputable defs, defs that depend on Classical.choice, and
proofs-as-programs (PSigma.fst-style extraction from
non-decidable existentials) won't be measurable. The failure
surfaces when building or running the compiled benchmark, not
during setup_benchmark elaboration.