Running Batches in Criterion

In the introduction, I glossed over batching by inserting the undefined function runRepeatedly. The measure function actually takes the action to run as an argument. It is simply a data-collecting wrapper around the action. So the question is, where does the action come from?

If you’ve written a benchmark with Criterion before, you’ve made the actions yourself with whnf[IO] and nf[IO]. I’ll work through nf because whnf is just a simplified version.

It is difficult to convince GHC to do work for no reason, which is exactly what the timing loop is asking for. Criterion is compiled with -O2 enabled, so I’ll use that when demonstrating. To be sure that GHC is doing what we think it is, we’ll need to look at the compiled stg output. I’m going to use -ddump-to-file -ddump-stg -dsuppress-all to dump the output to a file and make it easier to read.

Our goal is to create this function:

-- Apply an argument, x, to a function, f, n times, reducing
-- the result to normal form.
nf :: (b -> ()) -> (a -> b) -> a -> Int -> IO ()
nf rnf func x n = undefined

rnf takes the place of Control.DeepSeq.rnf. The real code uses (NFData b) to get the reducing function, but we can supply anything with the right type. We don’t really care about what the function does, just that it is applied. This also removes the dependency, which makes it very easy for you to test these functions for yourself.

Here is a simple setup that should let us test various implementations. I’ve included the final argument after the $ to emphasize that nf doesn’t really use the number of iterations. When you create a benchmark with it, you are not providing the number of iterations; that comes later.

-- Nf.hs
module Main where

-- Use an nf implementation to run
-- @map (^2) [1..10]@ one million times.
main :: IO ()
main = nf (const ()) (map (^2)) [1..10] $ 1000000

nf :: (b -> ()) -> (a -> b) -> a -> Int -> IO ()
nf = undefined

To compile and run with all my options:

# Optimize and dump stg
$ ghc -O2 -ddump-stg -ddump-to-file -dsuppress-all Nf.hs
# Print runtime statistics for analysis
$ ./Nf +RTS -s

A First Try

Our first attempt will be just basic recursion.

-- First attempt
nfFirst :: (b -> ()) -> (a -> b) -> a -> Int -> IO ()
nfFirst rnf func x n
    | n <= 0    = return ()
    | otherwise = let y = rnf (func x)
                  in nfFirst rnf func x (n-1)

If you haven’t looked at stg output before, it looks vaguely like Haskell. The most important rule for our analysis is that let allocates a thunk, and case performs evaluation. Aside from that, you mostly just need to do a little work to track where all the names come from. Here’s our output, trimmed for space.

$wnfFirst =
    \r [ww_s4yk void_0E]
        case <=# [ww_s4yk 0#] of sat_s4ym {
          __DEFAULT ->
              case tagToEnum# [sat_s4ym] of {
                False ->
                    case -# [ww_s4yk 1#] of sat_s4yo {
                      __DEFAULT -> $wnfFirst sat_s4yo void#;
                    };
                True -> Unit# [()];
              };
        };

nfFirst1 =
    \r [w_s4yp w1_s4yq w2_s4yr w3_s4ys void_0E]
        case w3_s4ys of { I# ww1_s4yv -> $wnfFirst ww1_s4yv void#; };

nfFirst =
    \r [eta_B5 eta_B4 eta_B3 eta_B2 void_0E]
        nfFirst1 eta_B5 eta_B4 eta_B3 eta_B2 void#;

Our function nfFirst is passed four variables and the IO token. You can mostly ignore the IO related stuff, but this post has some good information if you haven’t seen it before. We are benchmarking pure functions, so do not actually use the IO state. The etas are, in order, rnf func x n.

nfFirst passes its arguments on to nfFirst1, which unboxes n. Finally, nfFirst1 calls $wnfFirst which does the work of the function. Unfortunately, nfFirst1 drops rnf func x. Only the Int is used by $wnfFirst. This is bad news because it means our function can not possibly be evaluating func x. Just to confirm, we can look at the body of the worker loop.

First, it checks if n is less than or equal to 0. Then it converts the result to a Bool. Finally, it either subtracts 1 from n and loops, or terminates.

GHC foiled our first attempt. It realized that y was never used, so removed it. We will need to convince GHC that y should be evaluated even though we don’t use it.

If we run Nf, we can double check that nothing is happening. This is the same result as main = return ().

        41,008 bytes allocated in the heap
         3,312 bytes copied during GC
        36,184 bytes maximum residency (1 sample(s))
        21,160 bytes maximum slop
             2 MB total memory in use (0 MB lost due to fragmentation)

                                   Tot time (elapsed)  Avg pause  Max pause
Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
Gen  1         1 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s

INIT    time    0.000s  (  0.000s elapsed)
MUT     time    0.000s  (  0.001s elapsed)
GC      time    0.000s  (  0.000s elapsed)
EXIT    time    0.000s  (  0.000s elapsed)
Total   time    0.000s  (  0.002s elapsed)

%GC     time       0.0%  (0.0% elapsed)

Alloc rate    0 bytes per MUT second

Productivity 100.0% of total user, 89.8% of total elapsed

Adding Strictness

We can trick GHC by adding strictness to our function. One way of doing this without too much work is the BangPatterns extension. All we have to do is turn it on and add one !.

{-# LANGUAGE BangPatterns #-}
module Main where

...

-- Add Strictness
nfBang :: (b -> ()) -> (a -> b) -> a -> Int -> IO ()
nfBang rnf func x n
    | n <= 0    = return ()
    | otherwise = let !y = rnf (func x)
                  in nfBang rnf func x (n-1)

Comparing the output, we now have:

$wnfBang =
    \r [w_s4zQ w1_s4zR w2_s4zS ww_s4zT void_0E]
        case <=# [ww_s4zT 0#] of sat_s4zV {
          __DEFAULT ->
              case tagToEnum# [sat_s4zV] of {
                False ->
                    let { sat_s4zX = \u [] w1_s4zR w2_s4zS;
                    } in
                      case w_s4zQ sat_s4zX of {
                        () ->
                            case -# [ww_s4zT 1#] of sat_s4zZ {
                              __DEFAULT -> $wnfBang w_s4zQ w1_s4zR w2_s4zS sat_s4zZ void#;
                            };
                      };
                True -> Unit# [()];
              };
        };

nfBang1 =
    \r [w_s4A0 w1_s4A1 w2_s4A2 w3_s4A3 void_0E]
        case w3_s4A3 of {
          I# ww1_s4A6 -> $wnfBang w_s4A0 w1_s4A1 w2_s4A2 ww1_s4A6 void#;
        };

nfBang =
    \r [eta_B5 eta_B4 eta_B3 eta_B2 void_0E]
        nfBang1 eta_B5 eta_B4 eta_B3 eta_B2 void#;

$wnfBang now accepts all the arguments, which is promising. If we look further down, there is only one sticking point. We introduced a let. If you follow the names, you’ll see that func x is made into a thunk before rnf (func x) is reduced. We will need to make sure func x is evaluated first.

But there is a bigger problem. If we run the code with nfBang in our main, nothing happens.

        41,008 bytes allocated in the heap
         3,312 bytes copied during GC
        36,184 bytes maximum residency (1 sample(s))
        21,160 bytes maximum slop
             2 MB total memory in use (0 MB lost due to fragmentation)

                                   Tot time (elapsed)  Avg pause  Max pause
Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
Gen  1         1 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s

INIT    time    0.000s  (  0.000s elapsed)
MUT     time    0.000s  (  0.001s elapsed)
GC      time    0.000s  (  0.000s elapsed)
EXIT    time    0.000s  (  0.000s elapsed)
Total   time    0.000s  (  0.001s elapsed)

%GC     time       0.0%  (0.0% elapsed)

Alloc rate    0 bytes per MUT second

Productivity 100.0% of total user, 90.4% of total elapsed

Even though our code changed, the runtime didn’t. You can see why if you look deeper into the stg. nfBang was inlined into main. After that, GHC gets another chance to optimize away func x. So main is again turned into just a counting loop. To combat this, we can mark nfBang as NOINLINE. This prevents GHC from changing our carefully written function. It also means that nfBang is not affected by the user’s optimization level, which is a win.

nfBang :: ...
nfBang rnf func x = ...
{-# NOINLINE nfBang #-}

    48,041,008 bytes allocated in the heap
         7,272 bytes copied during GC
        36,184 bytes maximum residency (1 sample(s))
        29,352 bytes maximum slop
             2 MB total memory in use (0 MB lost due to fragmentation)

                                   Tot time (elapsed)  Avg pause  Max pause
Gen  0        45 colls,     0 par    0.000s   0.000s     0.0000s    0.0001s
Gen  1         1 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s

INIT    time    0.000s  (  0.000s elapsed)
MUT     time    0.012s  (  0.015s elapsed)
GC      time    0.000s  (  0.000s elapsed)
EXIT    time    0.000s  (  0.000s elapsed)
Total   time    0.012s  (  0.016s elapsed)

%GC     time       0.0%  (1.8% elapsed)

Alloc rate    4,003,417,333 bytes per MUT second

Productivity 100.0% of total user, 96.5% of total elapsed

This also affected nfFirst, so you might want to mark that NOINLINE as well to see the difference.

More Strictness

We can remove the func x thunk by applying a bit more strictness. This time we will use seq to make it look a little nicer.

nfStricter :: (b -> ()) -> (a -> b) -> a -> Int -> IO ()
nfStricter rnf func x n
    | n <= 0    = return ()
    | otherwise = let !y = func x
                  in rnf y `seq` nfStricter rnf func x (n-1)
{-# NOINLINE nfStricter #-}

If we look at the output, we squashed one issue and created another.

nfStricter1 =
    \r [w_s4z3 w1_s4z4 w2_s4z5 w3_s4z6 void_0E]
        case w3_s4z6 of {
          I# ww1_s4z9 -> $wnfStricter w_s4z3 w1_s4z4 w2_s4z5 ww1_s4z9 void#;
        };
nfStricter =
    \r [eta_B5 eta_B4 eta_B3 eta_B2 void_0E]
        nfStricter1 eta_B5 eta_B4 eta_B3 eta_B2 void#;
$wnfStricter =
    \r [w_s4za w1_s4zb w2_s4zc ww_s4zd void_0E]
        case <=# [ww_s4zd 0#] of sat_s4zf {
          __DEFAULT ->
              case tagToEnum# [sat_s4zf] of {
                False ->
                    case w1_s4zb w2_s4zc of y_s4zh {
                      __DEFAULT ->
                          case w_s4za y_s4zh of {
                            () ->
                                case -# [ww_s4zd 1#] of sat_s4zj {
                                  __DEFAULT ->
                                      let { sat_s4zk = NO_CCS I#! [sat_s4zj];
                                      } in  nfStricter w_s4za w1_s4zb w2_s4zc sat_s4zk void#;
                                };
                          };
                    };
                True -> Unit# [()];
              };
        };

We are now properly evaluating rnf (func x), but we introduced another let right before the recursive call. Because our function is marked NOINLINE, it can’t inline into itself. So our function is wasting time boxing n, only to unbox it on the next iteration.

We are also passing rnf, func, and x to the function every time, even though we don’t have to. We have one (and a half) more trick to tighten this loop.

Final Form

Instead of making nf recursive, we can create a function that just takes n as input. This is what I was hinting at in the beginning. We are really interested in taking rnf, func, and x, and returning a function that runs those n times. Here is the final form of nf:

nfFinal :: (b -> ()) -> (a -> b) -> a -> (Int -> IO ())
nfFinal rnf func x = go
  where
    go n | n <= 0    = return ()
         | otherwise = let !y = func x
                       in rnf y `seq` go (n-1)
{-# NOINLINE nfFinal #-}

This results in output that is incredibly close to what we want, but falls just short.

$wnfFinal =
  \r [w_s4zl w1_s4zm w2_s4zn ww_s4zo void_0E]
    case <=# [ww_s4zo 0#] of sat_s4zq {
      __DEFAULT ->
        case tagToEnum# [sat_s4zq] of {
          False ->
            case w1_s4zm w2_s4zn of y_s4zs {
              __DEFAULT ->
                case w_s4zl y_s4zs of {
                  () ->
                    case -# [ww_s4zo 1#] of sat_s4zA {
                      __DEFAULT ->
                        let-no-escape {
                          $wgo_s4zu =
                            sat-only \r [ww1_s4zv void_0E]
                              case <=# [ww1_s4zv 0#] of sat_s4zx {
                                __DEFAULT ->
                                  case tagToEnum# [sat_s4zx] of {
                                    False ->
                                      case -# [ww1_s4zv 1#] of sat_s4zz {
                                        __DEFAULT -> $wgo_s4zu sat_s4zz void#;
                                      };
                                    True -> Unit# [()];
                                  };
                              };
                        } in  $wgo_s4zu sat_s4zA void#;
                    };
                    };
                  };

This loop performs perfectly well the first time, but then enters a tight counting loop. The let-no-escape ensures our counter stays in a register so it doesn’t have to keep passing it as an argument.

The only way I found to defeat this under -O2 was to use -fno-full-laziness. This leads to the final output:

nfFinal1 =
  \r [rnf_s4yr func_s4ys x_s4yt eta_s4yu void_0E]
    case eta_s4yu of {
      I# ww1_s4yx ->
        let-no-escape {
          $wgo_s4yy =
            sat-only \r [ww2_s4yz void_0E]
              case <=# [ww2_s4yz 0#] of sat_s4yB {
                __DEFAULT ->
                  case tagToEnum# [sat_s4yB] of {
                    False ->
                      case func_s4ys x_s4yt of y_s4yD {
                        __DEFAULT ->
                          case rnf_s4yr y_s4yD of {
                            () ->
                              case -# [ww2_s4yz 1#] of sat_s4yF {
                                __DEFAULT -> $wgo_s4yy sat_s4yF void#;
                              };
                          };
                      };
                    True -> Unit# [()];
                  };
              };
        } in  $wgo_s4yy ww1_s4yx void#;
    };

That is the best implementation that I know how to write. After all that, Criterion is slightly faster and slightly more accurate. The changes I’m making will all work together to make visible improvements, so I’ll save the summary for the last post. The outcome of these optimizations won’t be shocking because Criterion was a top-rate library before I did anything, but the improvements are nothing to sneeze at.

You can see the original discussion here.

Previous Next