4.5. Compiled code is not the only way Lean evaluates🔗

lean-bench measures compiled-code wall time, full stop (see Advanced). It does not measure:

#eval — runs in the elaborator, not native code, and can be much slower than the compiled equivalent on numeric work.
decide / Decidable reduction at type-check time — uses the kernel reducer, which is slower still and has very different cost-per-step characteristics from compiled code.
reduce / simp — symbolic, not computational. The cost model is "size of the term, number of rewrites" rather than "wall-clock time of a compiled function".
kernel typechecking — has its own performance regime, dominated by definitional unfolding rather than by the algorithm.

Two consequences. First, you cannot use #eval (or a tactic that calls into the elaborator) as a sanity check that your benchmark is fast enough. Compile it, run it, look at lean-bench list / verify. Second, performance regressions visible in simp or decide aren't the same regressions a compiled-code benchmark catches — these are distinct phenomena and need distinct tooling. If you're chasing a compile-time slowdown, you want kernel- or elaborator-level tracing rather than a lean-bench row.

The corollary: a function being benchmarked must compile. noncomputable defs, defs that depend on Classical.choice, and proofs-as-programs (PSigma.fst-style extraction from non-decidable existentials) won't be measurable. The failure surfaces when building or running the compiled benchmark, not during setup_benchmark elaboration.