Measuring Functions in Criterion

I’ve recently spent a fair amount of time with Haskell’s Criterion library for measuring performance. I found myself somewhat disappointed by the performance of Criterion itself, and so took a whack at speeding it up. After several issues, I am now more familiar with both Criterion and Haskell. Haskell is often criticized as being hard to reason about, so I thought I would write up my experiences.

As Criterion is one of the sources of truth about the performance of Haskell code, the changes I made have to be verified through other means. I’ll try to provide evidence from both the dumped compilation results and examples.

The Measurement Loop

At the highest level, Criterion measures functions by running them in batches. It will run the function an increasing number of times until it can make a “valid” measurement. At the moment, the batch must run for 30ms to be considered statistically significant. Thus, a function that takes 1s must run once and a function that takes 1ms must run thirty times.

The batching and measurement both come with their own overhead, so care must be taken not to include those in the final result. Keeping that goal in mind, here is a simplified version of the main loop found in runBenchmark:

runBenchmark func = do
    start <- getTime
    let loop (iters:niters) prev count acc = do
          -- Measure the function for the given
          -- number of iterations
          (m, endTime) <- measure func iters

          -- Three conditions must be met to end the benchmark
          -- loop:
          -- * The loop has run for at least 5 seconds
          --     (This is a user-settable lower bound)
          -- * At least 300 ms of run time has been measured
          --     (This is 10 times the threshold for valid data)
          -- * At least 4 measurements have been made
          --     (This is a slightly flawed check to make sure the
          --      statistics work)
          if endTime - start >= 5000 &&
             prev + measuredTime m > 300 &&
             count >= 4
            then do
              return (m:acc)
            else
              loop niters (prev+measuredTime m) (count+1) (m:acc)
    loop [1..] 0 0 []

measure func iters =
    -- See /Reducing GC/ Below
    performGC
    startStats <- getRTSStats
    startTime <- getTime
    -- See /Reducing Allocation/ below
    runRepeatedly func iters
    endTime <- getTime
    performGC
    endStats <- getRTSStats

    -- Measure returns a /Measured/ record,
    -- and time at end of measurement.
    return (mkMeasured endStats startStats, endTime)

With that, it becomes much easier to talk about particular optimizations. I made two major changes of interest. When I first started looking into the performance issues, I noticed right away that the main issue was garbage collection. The two solutions to that are reducing allocation and removing memory leaks. In my case, reducing allocation did not help performance much, but it did help correctness. There also was not a memory leak, but there was a large working set. So I had to reduce the impact of garbage collection.