Adding Heap View to GHC

While working on my previous Criterion project, I ran into the question, how big is a thunk? I was attempting to manually verify allocation statistics, but did not know how to count unevaluated expressions. Haskell’s automatic memory management means that most people don’t need to think about these issues. However, once you start trying to tune performance, memory problems start sneaking in.

Before working through the actual GHC source, all of my knowledge came from two wonderful posts. First, this blog post by Johan Tibell walks through a general case of computing the size of a datatype. His rule-of-thumb is both easy and accurate. The second took a bit more looking, but helped me answer my thunk question. It also introduced me to the heap-view package. A few days later on IRC, someone expressed that they hoped D3055 would get finished. So my next project was decided.

To give credit where credit is due, most of this code traces back to a 2006 Summer of Code project attributed to José Iborra López here. The project was the GHCi debugger. A big portion of integrating heap-view into GHC was pulling the duplicated code out of the existing source. There are a number of different rough representations of the heap in the Haskell source of GHC, so perhaps further work could be done to unify access to the heap.

The Haskell Heap

Having covered the background, let’s turn to what heap-view actually does. In short, it shows you what a given value looks like at this moment. Haskell’s non-strict evaluation means that GHC get to decide what order it wants to do work in. While you might think x = 2 + 1 assigns 3 to x, in fact, you might just get a promise that x will be 3 when you try to use it. That promise is a thunk: some unevaluated expression that holds information, but hasn’t yet done anything with it.

I’m going to work through an example where this matters and demonstrate how heap-view can help, but first, you’ll need a primer on the structure of heap objects. If you are familiar with Haskell’s heap, feel free to jump to the example.

The best technical reference for this material is here, in the GHC Commentary. I’m going to gloss over the TABLES_NEXT_TO_CODE and PROFILING differences because they are not particularly pertinent. This will mostly be a compressed version of the information on that page.

Let’s start with the Int type. It may be a little new to you, but there’s no time like the present!

data Int = I# Int#

This type works similarly to any other type you would make. It is given a name Int, and has a constructor I#. The # is a convention that indicates this is lower-level than most code. Typically in Haskell, you work with objects that are “boxed”. That is, you really have a pointer to an object, not the object itself. Int is special because it is built from an “unboxed” value, of type Int#. Int# is built into GHC because it is simply a C int. Just the plain bytes, not an address.

Looking at the wiki page linked above, if scroll down to the “Data Constructors” heading, you will see the layout of a data constructor on the heap. It looks something like this:

Constructor Closure
+---------+----------+--------------+
| *Header | Pointers | Non-pointers |
+---------+----------+--------------+

That is, a single pointer to an InfoTable, which we will get to, an array of pointers, and an array of non-pointers. In the case of our Int type, that would be 0 pointers and a single non-pointer holding our Int#. I# 42# (the # after 42 gets us the plain, non-pointer C int) of type Int would look like so:

+---------++----+
| *Header || 42 |
+---------++----+

That would be: sizeof(void *) + sizeof(int) if you are calculating size.

InfoTables are dependent on the type of value being stored. They are all basically the same size, but read each field differently. Things are packed tightly here, so we are going to look at the first three HalfWords. On 64-bit platforms that would be 3 * 4 = 12 bytes total.

InfoTable (read like a constructor's)
+------+-------+------+
| ptrs | nptrs | type |
+------+-------+------+

These fields are the number of pointers in the closure, the number of non-pointers in the closure, and the type of the closure. For our Int, that would be:

+---+---+------------+
| 0 | 1 | CONSTR_0_1 |
+---+---+------------+

Constructor types encode some common layouts to make garbage collection faster. So now that we know roughly what it should look like, let’s ask heap-view.

When run (edited to look nicer):

ConstrClosure
    { info = StgInfoTable {..., ptrs = 0, nptrs = 1, tipe = CONSTR_0_1, ...}
    , ptrArgs = [], dataArgs = [42]
    , pkg = "ghc-prim", modl = "GHC.Types", name = "I#"
    }

We even get information about where the constructor comes from which is great. And our theory lines up with reality, which is always a relief. Next time, I’ll demonstrate a practical application of heap-view.

Next