4. Lean-specific benchmarking pitfalls🔗

A microbenchmark in Lean 4 measures wall time on compiled code. That sounds simple, but Lean's runtime — bignum Nat, reference-counted allocation, structural sharing, ahead-of-time compilation versus elaborator evaluation — has enough sharp edges that a benchmark can silently measure something other than the algorithm you wrote down. This guide walks through the failure modes we have hit ourselves.

lean-bench papers over some of these (it forces every result, hashes when it can, and runs in a clean child process for each measurement) but it cannot save you from a benchmark whose declared parameter doesn't actually drive the work, or whose return value's bignum arithmetic dominates the algorithm. Knowing what the harness does and does not handle is the point of this document.

For background on the harness mechanics referenced below, see Design. For the user-facing setup, see Quickstart.

4.1. Nat is bignum
4.2. Force the work
4.3. Hashing and result observation are not free
4.4. Allocation and sharing
4.5. Compiled code is not the only way Lean evaluates
4.6. Warm versus cold reading
4.7. Fixed-problem median
4.8. Summary