Results of Optimizing Criterion

My two changes came with two big results. The GC change will probably have a bigger impact, so we can look at that first.

Time

A reasonably large and well-known benchmark is the dictionary comparison from the Haskell performance group on GitHub: dictionaries.

I recently ran it on my computer using ghc-8.2.1 and criterion-1.3.0.0, the latest release. There are about 200 benchmarks in the time executable. Using the built-in lower bound of 5 seconds per benchmark, the total run time can be estimated to be 16-17 minutes. When I ran it, I got this:

  895,066,273,808 bytes allocated in the heap
1,472,538,253,152 bytes copied during GC
      306,722,520 bytes maximum residency (75761 sample(s))
       43,151,456 bytes maximum slop
              871 MB total memory in use (0 MB lost due to fragmentation)

                                  Tot time   (elapsed)   Avg pause  Max pause
Gen  0     842746 colls,   0 par  431.068s   432.023s    0.0005s    0.5117s
Gen  1     75761 colls,    0 par  1465.572s  1467.816s   0.0194s    0.7871s

INIT    time    0.000s  (   0.000s elapsed)
MUT     time  680.000s  ( 681.203s elapsed)
GC      time 1896.640s  (1899.839s elapsed)
EXIT    time    0.000s  (   0.005s elapsed)
Total   time 2576.640s  (2581.048s elapsed)

%GC     time      73.6%  (73.6% elapsed)

Alloc rate    1,316,273,932 bytes per MUT second

Productivity  26.4% of total user, 26.4% of total elapsed

The mutator time is around 11 minutes, which is slightly less than estimated. But the garbage collection time is a gob-smacking 30 minutes. With verbosity on, you can see that some benchmarks are taking much longer than their 5 seconds. The worst offenders all follow a similar pattern; they are fast functions with large working sets. (The output has been trimmed for space reasons)

benchmarking Lookup Int (Randomized)/Data.Map.Lazy:1000000
measurement took 140.5 s
time                 79.09 ns   (77.11 ns .. 81.08 ns)

benchmarking Lookup Int (Randomized)/Data.Map.Strict:1000000
measurement took 147.8 s
time                 83.36 ns   (82.48 ns .. 84.30 ns)

benchmarking Lookup Int (Randomized)/Data.HashMap.Lazy:1000000
measurement took 138.6 s
time                 46.87 ns   (46.67 ns .. 47.11 ns)

benchmarking Lookup Int (Randomized)/Data.HashMap.Strict:1000000
measurement took 138.5 s
time                 47.30 ns   (46.82 ns .. 47.91 ns)

benchmarking Lookup Int (Randomized)/Data.IntMap.Lazy:1000000
measurement took 136.6 s
time                 78.17 ns   (76.50 ns .. 80.38 ns)

benchmarking Lookup Int (Randomized)/Data.IntMap.Strict:1000000
measurement took 136.8 s
time                 77.81 ns   (76.39 ns .. 79.77 ns)

This is what sent me down this road in the first place, so let’s see how it runs after my optimizations.

971,043,931,776 bytes allocated in the heap
 88,686,348,856 bytes copied during GC
    306,721,800 bytes maximum residency (5293 sample(s))
     10,359,784 bytes maximum slop
            872 MB total memory in use (0 MB lost due to fragmentation)

                                   Tot time (elapsed)  Avg pause  Max pause
Gen  0     983696 colls,     0 par   106.552s  107.824s     0.0001s    0.8340s
Gen  1      5293 colls,     0 par    1.784s   1.800s     0.0003s    0.0328s

INIT    time    0.000s  (  0.000s elapsed)
MUT     time  949.712s  (951.470s elapsed)
GC      time  108.336s  (109.624s elapsed)
EXIT    time    0.000s  (  0.006s elapsed)
Total   time  1058.048s  (1061.100s elapsed)

%GC     time      10.2%  (10.3% elapsed)

Alloc rate    1,022,461,474 bytes per MUT second

Productivity  89.8% of total user, 89.7% of total elapsed

Much better. Our mutator time is now up to just under 16 minutes, and our gc time is down to 2 minutes. Productivity is now at 90% instead of 25%. Finally, my computer did not have to copy 1.4TB during GC, which is a win.

Just to be sure, our results are still in line with the previous ones.

benchmarking Lookup Int (Randomized)/Data.Map.Lazy:1000000
time                 79.75 ns   (78.30 ns .. 81.66 ns)

benchmarking Lookup Int (Randomized)/Data.Map.Strict:1000000
time                 77.79 ns   (77.36 ns .. 78.37 ns)

benchmarking Lookup Int (Randomized)/Data.HashMap.Lazy:1000000
time                 45.12 ns   (44.96 ns .. 45.38 ns)

benchmarking Lookup Int (Randomized)/Data.HashMap.Strict:1000000
time                 45.20 ns   (44.91 ns .. 45.49 ns)

benchmarking Lookup Int (Randomized)/Data.IntMap.Lazy:1000000
time                 80.78 ns   (76.73 ns .. 85.77 ns)

benchmarking Lookup Int (Randomized)/Data.IntMap.Strict:1000000
time                 73.59 ns   (73.25 ns .. 74.25 ns)

Everyone should now be able to accurately estimate how long their benchmark program will take. The basic rule of thumb is 5s per call to bench. If a function takes longer than ~500ms though, you have to multiply it by 15.

Allocation

The second major change is that allocation numbers are now more accurate. We can use this page for calculating allocations.

Here is benchmark that we can use to test:

-- Alloc.hs
module Main where

import Criterion.Main

main :: IO ()
main = defaultMain
 [ bench "Bool" $ whnf (toEnum :: Int -> Bool) 0
 -- ^ Should be 0, True and False are shared in memory
 , bench "Char" $ whnf chr 0
 -- ^ Should be 2 words, a single Char
 ]

Criterion lets you collect and analyze more data than just time. We can look at allocations with --regress allocated:iters. This will report how much is allocated in a single iteration. We also need to turn on stat collection with +RTS -T.

$ ./Alloc --regress allocated:iters +RTS -T

benchmarking primitives/Bool
  iters              1.165e-7   (-1.831e-5 .. 2.058e-5)

benchmarking primitives/Char
  iters              16.000     (16.000 .. 16.000)

The value after iters is the number of bytes allocated per iteration. If we multiply our guess by 8 (because a word is 8 bytes on my machine), we see that they match. Here’s the output from before my change, but after a garbage collection change that improved the results greatly.

benchmarking primitives/Bool
  iters              32.000     (32.000 .. 32.000)

benchmarking primitives/Char
  iters              48.000     (48.000 .. 48.000)

So I managed to save 32 bytes per iteration. Which not only makes the results cleaner, but also prevents the useless allocation of memory. For reference, currently the Char benchmark runs for ~500 million iterations total. That comes out to 16GB of saved memory.

It turns out that the vast majority of the improvements came by switching to performMinorGC instead of performGC, but I think the other changes were worth it. When the number one suggestion for reasoning about Haskell code is “benchmark it”, your benchmarks better be pretty darn accurate.