Results of Optimizing Criterion
My two changes came with two big results. The GC change will probably have a bigger impact, so we can look at that first.
Time
A reasonably large and well-known benchmark is the dictionary comparison from the Haskell performance group on GitHub: dictionaries.
I recently ran it on my computer using ghc-8.2.1
and criterion-1.3.0.0
, the latest release. There are about 200 benchmarks in the time
executable. Using the built-in lower bound of 5 seconds per benchmark, the total run time can be estimated to be 16-17 minutes. When I ran it, I got this:
895,066,273,808 bytes allocated in the heap
1,472,538,253,152 bytes copied during GC
306,722,520 bytes maximum residency (75761 sample(s))
43,151,456 bytes maximum slop
871 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 842746 colls, 0 par 431.068s 432.023s 0.0005s 0.5117s
Gen 1 75761 colls, 0 par 1465.572s 1467.816s 0.0194s 0.7871s
INIT time 0.000s ( 0.000s elapsed)
MUT time 680.000s ( 681.203s elapsed)
GC time 1896.640s (1899.839s elapsed)
EXIT time 0.000s ( 0.005s elapsed)
Total time 2576.640s (2581.048s elapsed)
%GC time 73.6% (73.6% elapsed)
Alloc rate 1,316,273,932 bytes per MUT second
Productivity 26.4% of total user, 26.4% of total elapsed
The mutator time is around 11 minutes, which is slightly less than estimated. But the garbage collection time is a gob-smacking 30 minutes. With verbosity on, you can see that some benchmarks are taking much longer than their 5 seconds. The worst offenders all follow a similar pattern; they are fast functions with large working sets. (The output has been trimmed for space reasons)
benchmarking Lookup Int (Randomized)/Data.Map.Lazy:1000000
measurement took 140.5 s
time 79.09 ns (77.11 ns .. 81.08 ns)
benchmarking Lookup Int (Randomized)/Data.Map.Strict:1000000
measurement took 147.8 s
time 83.36 ns (82.48 ns .. 84.30 ns)
benchmarking Lookup Int (Randomized)/Data.HashMap.Lazy:1000000
measurement took 138.6 s
time 46.87 ns (46.67 ns .. 47.11 ns)
benchmarking Lookup Int (Randomized)/Data.HashMap.Strict:1000000
measurement took 138.5 s
time 47.30 ns (46.82 ns .. 47.91 ns)
benchmarking Lookup Int (Randomized)/Data.IntMap.Lazy:1000000
measurement took 136.6 s
time 78.17 ns (76.50 ns .. 80.38 ns)
benchmarking Lookup Int (Randomized)/Data.IntMap.Strict:1000000
measurement took 136.8 s
time 77.81 ns (76.39 ns .. 79.77 ns)
This is what sent me down this road in the first place, so let’s see how it runs after my optimizations.
971,043,931,776 bytes allocated in the heap
88,686,348,856 bytes copied during GC
306,721,800 bytes maximum residency (5293 sample(s))
10,359,784 bytes maximum slop
872 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 983696 colls, 0 par 106.552s 107.824s 0.0001s 0.8340s
Gen 1 5293 colls, 0 par 1.784s 1.800s 0.0003s 0.0328s
INIT time 0.000s ( 0.000s elapsed)
MUT time 949.712s (951.470s elapsed)
GC time 108.336s (109.624s elapsed)
EXIT time 0.000s ( 0.006s elapsed)
Total time 1058.048s (1061.100s elapsed)
%GC time 10.2% (10.3% elapsed)
Alloc rate 1,022,461,474 bytes per MUT second
Productivity 89.8% of total user, 89.7% of total elapsed
Much better. Our mutator time is now up to just under 16 minutes, and our gc time is down to 2 minutes. Productivity is now at 90% instead of 25%. Finally, my computer did not have to copy 1.4TB during GC, which is a win.
Just to be sure, our results are still in line with the previous ones.
benchmarking Lookup Int (Randomized)/Data.Map.Lazy:1000000
time 79.75 ns (78.30 ns .. 81.66 ns)
benchmarking Lookup Int (Randomized)/Data.Map.Strict:1000000
time 77.79 ns (77.36 ns .. 78.37 ns)
benchmarking Lookup Int (Randomized)/Data.HashMap.Lazy:1000000
time 45.12 ns (44.96 ns .. 45.38 ns)
benchmarking Lookup Int (Randomized)/Data.HashMap.Strict:1000000
time 45.20 ns (44.91 ns .. 45.49 ns)
benchmarking Lookup Int (Randomized)/Data.IntMap.Lazy:1000000
time 80.78 ns (76.73 ns .. 85.77 ns)
benchmarking Lookup Int (Randomized)/Data.IntMap.Strict:1000000
time 73.59 ns (73.25 ns .. 74.25 ns)
Everyone should now be able to accurately estimate how long their benchmark program will take. The basic rule of thumb is 5s per call to bench
. If a function takes longer than ~500ms though, you have to multiply it by 15.
Allocation
The second major change is that allocation numbers are now more accurate. We can use this page for calculating allocations.
Here is benchmark that we can use to test:
-- Alloc.hs
module Main where
import Criterion.Main
main :: IO ()
main = defaultMain
[ bench "Bool" $ whnf (toEnum :: Int -> Bool) 0
-- ^ Should be 0, True and False are shared in memory
, bench "Char" $ whnf chr 0
-- ^ Should be 2 words, a single Char
]
Criterion lets you collect and analyze more data than just time. We can look at allocations with --regress allocated:iters
. This will report how much is allocated in a single iteration. We also need to turn on stat collection with +RTS -T
.
$ ./Alloc --regress allocated:iters +RTS -T
benchmarking primitives/Bool
iters 1.165e-7 (-1.831e-5 .. 2.058e-5)
benchmarking primitives/Char
iters 16.000 (16.000 .. 16.000)
The value after iters
is the number of bytes allocated per iteration. If we multiply our guess by 8 (because a word is 8 bytes on my machine), we see that they match. Here’s the output from before my change, but after a garbage collection change that improved the results greatly.
benchmarking primitives/Bool
iters 32.000 (32.000 .. 32.000)
benchmarking primitives/Char
iters 48.000 (48.000 .. 48.000)
So I managed to save 32 bytes per iteration. Which not only makes the results cleaner, but also prevents the useless allocation of memory. For reference, currently the Char
benchmark runs for ~500 million iterations total. That comes out to 16GB of saved memory.
It turns out that the vast majority of the improvements came by switching to performMinorGC
instead of performGC
, but I think the other changes were worth it. When the number one suggestion for reasoning about Haskell code is “benchmark it”, your benchmarks better be pretty darn accurate.