Graphs and Structural Equivalence

We’d like to rebuild this bit of R code in Java so that we can integrate it into a Maven plugin.

So we have three pieces to work on:

Let’s start chipping away.

GraphViz dot Files

The “dot” format is pretty simple,

This declares a “directed graph”, or digraph. It first declares that there is a node called a. And then declares that there is an edge from node b to node c. The format also allows for some additional properties and structure to be included, but we only care about nodes and the edges between them.

The code for this is occurs in two passes. First, it scans the dot file to find all of the node names. Then, it creates a two dimensional array where array[a][b] is 1 if node a depends on node b, otherwise 0.

Source code

Distance Matrix

Now that we have our dependency data in our program, we can start manipulating it.

Currently our data represents whether or not class a depends on class b. But the clustering algorithm wants input data that represents a “distance” or “difference” between two classes. In particular, the input data for the clustering algorithm should be “symmetrical”.

So based on our dependency data, we need to create a new matrix where data[a][b] == data[b][a] and it represents how different class a and b are.

We are going to use the Hamming distance because it is trivial to compute and fits our needs nicely. For every pair of rows in the distance matrix, we simply count the number of cells where they are different. For example, given this input matrix,

    a  b  c
a  {1, 1, 0}
b  {0, 1, 0}
c  {0, 0, 1}

We can calculate the distance between a and b like this,

a  {1, 1, 0}
b  {0, 1, 0}
    1 +0 +0 = 1

The values match in two of the three columns, so the distance between a and b is 1.

Applying this to all pairs, the resulting distance matrix would be,

    a  b  c
a  {0, 1, 3}
b  {1, 0, 2}
c  {3, 2, 0}

Note that this matrix is symmetrical. You can remove everything either below or above the diagonal and you still have all the information,

    a  b  c
a  {0, 1, 3}
b  {   0, 2}
c  {      0}

Source code

equiv.clust

This is where all of the interesting work occurs. The actual algorithm comes from this paper, but the general strategy is that we find the two closest nodes, merge them into a single new node, recalculate the distances for this new node, and repeat until we have just one node.

This algorithm builds up a binary tree from the bottom. So every merge is just combining two nodes under a parent node and recording the difference between them.

Given our distance matrix,

    a  b  c
a  {0, 1, 3}
b  {   0, 2}
c  {      0}

The closest pair is a and b, with a distance of 1. This results in a new node, n1 = (a, b, 1). Now the closest pair – because it is the only pair – is n1 and c. The distance is calculated as the maximum distance between c and all of the nodes in n1. In this case, the max is between c and a, and so the distance is 3. This gives us our final node n2 = (c, n1, 3).

Visually, our tree would look like this,

3
├── 1
│   ├── a
│   └── b
└── c

Source code

Writing Back to the Filesystem

Finally we can write this binary tree back to the filesystem. This is a simple function that creates a directory if the node has children, or copies the file from the existing source if it is one of the original classes.

There is a bit of additional functionality to avoid deleting non-Java files, but the main strategy is to write the files to a temp folder, then delete the existing source files, and move the temp folder into position.

Source code

Previous Next