Software! Math! Data! The blog of R. Sean Bowman
The blog of R. Sean Bowman
February 09 2016

I’d like to introduce an algorithm for finding clusters in graphs. This is an important problem in machine learning, in part because so many problems can be phrased as finding “nice” clusters in a given graph, where the meaning of "nice" tends to depend on the particular application. It’s easy to imagine simple applications of a good clustering algorithm, like finding groups of friends in a social network. But things only get more interesting from there.

For example, we may view an image as a graph whose vertices are pixels and edges connect adjacent pixels. We assign a weight to those edges depending on how different they are: a black pixel next to a white one has a low similarity, whereas two adjacent grey pixels have high similarity. In this case, clusters represent segments of the image, and these segments are often visually meaningful. The commonly cited reference for this type of clustering is Normalized Cuts and Image Segmentation.

The TILO Algorithm

The point is that graph clustering is a fascinating problem that has lots of real world applications. My colleague Jesse Johnson developed a novel algorithm for graph clustering in the paper Topological graph clustering with thin position, which, as you can tell from the title, uses ideas from topology. The roots of the idea come from the technique of thin position in \(3\)–manifold topology. You don’t need to understand any of this to see how the algorithm works on a basic level, but note that its heritage places it firmly in the field of topological data analysis.

Jesse together with our colleague Doug Heisterkamp developed this into a workable algorithm, and later I joined in and did some work with it. The algorithm itself came to be called TILO, for Topologically Intrinsic Lexicographical Ordering, and a companion criterion for selecting clusters is called PRC for Pinch Ratio Clusters. Jesse has a blog post on this at TILO/PRC clustering, and he and Doug wrote a paper called Pinch Ratio Clustering from a Topologically Intrinsic Lexicographic Ordering which I can’t seem to find a good link to.

A Visualization

I’d like to give a gentle introduction to the ideas using some interactive visualizations, so let’s start out with a simple graph and see what we can do. Unfortunately, these visualizations don’t seem to work very well on mobile devices like phones and tablets. I’m working on that, but for now you’ll probably get the best results with a desktop computer.

Below is a graph shown in two different ways. The top view is a more traditional view of the graph as nodes and vertices. Beneath that is a different view where we’ve assigned an (arbitrary) order to the vertices, put them in order, and drawn the edges between them a curved lines.

By hovering the mouse over edges, you can see which edges on the top graph correspond to ones on the bottom graph, and vice versa. Same for vertices. Take a moment to convince yourself that the two graphs are the same, and check out how the vertices and edges correspond.

The idea of TILO is to start with some ordering of vertices and iteratively change it in ways that “simplify” the bottom graph. Here, simplification means that there aren’t too many edges that cross an imaginary line between any two vertices of the lower graph. You don’t have to use your imagination here, though, because I’ve put light vertical lines between each of the ordered vertices in the next graph, below. Notice that at the bottom of these lines are numbers, and the numbers just tell us how many graph edges cross that particular line.

You can drag vertices in the lower graph to reorder them; notice how the numbers change when you do so.

The numbers we’ve been talking about are called the boundaries of the set of vertices to the left and right of the vertical line. When you change the ordering of vertices (by dragging them around), you potentially change the boundaries. The set of boundaries of a graph with a given ordering on its vertices, arranged in nonincreasing order, is called the width of the ordering, and the goal of TILO is to reduce the width of the ordering as much as possible.

What about clusters?

At this point, notice that when you hover the mouse over a vertex in the bottom graph, the vertices divide themselves into purple ones (to the left of the vertex you’ve hovered over) and green ones (to the right, including the one you’re hovering over). You see this also reflected in the top graph.

And this is the point of TILO: when the width of the ordering has been reduced as much as possible, there is often a local minimum among the boundaries. Hovering the mouse on the vertex just to the right of this local minimum gives a division of the top graph into two sets (purple and green), and this is the clustering we seek. Jesse calls this a “pinch cluster” because moving any single vertex across the division increases its boundary; the picture is like an old coke bottle with a thin waist between two portions of larger radius. If we were to move, say, a rubber band across the coke bottle, we’d have to stretch it in order to move it either way once we arrive at the thin portion.

The actual definition of “pinch cluster” is a bit too technical to repeat here; you can read about it in the paper above. The point, though, is that pinch clusters have “nice” properties and often correspond to graph clusters that are well behaved and have real world significance.

Back to the interactive pictures: go back to the previous graph and try to find an arrangement of the nodes so that the boundaries read "3, 3, 2, 3, 3." Note that this has a local minimum in the middle (2), and when you hover your mouse over the vertex just to the right of the 2, you should see a clustering that “makes sense” for the given graph. (With a graph so small, reasonable people can disagree on what a good clustering looks like. That’s okay; we’ll just be happy with the one we arrive at for now.)

Note that this is the example graph from Jesse’s paper, thus making this an interactive version of that example.

Of course, things get trickier with more complicated graphs; below is another one with more vertices and edges. Can you find moves that reduce the boundaries? Can you find a configuration for which there is a local minimum in the boundaries? (Some trial and error shows that it isn’t too difficult, but… sometimes it takes me a while to figure it out!)

Of course, the whole point of Jesse’s work is to be able to automatically find sequences of moves that reduce the width. I mean, it’s an algorithm after all, it wouldn’t be all that interesting otherwise. Once you’ve struggled a bit with the previous graph, take a look at the one below. It’s the same graph, but this time arrows beneath the boundary numbers tell you how and where to move vertices. Grab the vertex that lies above the tail of the arrow and move it to the spot directly above the head of the arrow. Repeat until the arrow disappears. In most cases, you’ll arrive at an ordering with a local minimum among its boundaries. Check out the clustering corresponding to this ordering. Does it match your intuition about what clusters in the given graph might look like?

Can you find the local minimum? (You should be able to; the arrows tell you exactly how to move the nodes.) Next, if you arrive at an ordering of the vertices that has a local minimum, what do the clusters look like? Remember: when you’ve moved the vertices around so that the red arrow at the bottom disappears, look for a local minimum in the sequence of boundaries. Hover your mouse over the vertex just to the right of the vertical line corresponding to this local minimum. Some vertices (of both graphs) will turn green, some purple. This is the clustering that the TILO algorithm has arrived at. If everything went smoothly, the clusters you find should match up with your idea of what clusters in the original graph might look like.

Some notes

This implementation is not perfect; sometimes it gets stuck/messes up/does weird stuff. In this case, you might not find a local minimum boundary. The best thing to do, I suppose, is reload the page and give it another shot. Any errors in the program are, of course, mine, and do not reflect some inherent problem in the TILO algorithm.

The TILO algorithm is an interesting one for many reasons. First of all, there’s no reason it should be efficient. Enumerating partitions of vertices of a graph is clearly exponential, and the TILO algorithm doesn’t actually address that. As far as is known, its worst case running time (heck, average case!) is super-duper exponential (a technical term, surely). However, the algorithm works quite well in practice, with C++ code written by Doug having been used on graphs with up to around a million vertices. It’s entirely conceivable that this number could be made quite larger. Plus, other variants (like Thin tree position) may be more amenable to parallel implementation.

Second, this is the first algorithm that I know of from topological data analysis whose roots are in low dimensional topology, as opposed to persistent homology whose roots lie in homotopy theory and other algebro-topological fields. The fact that its origins are different makes it interesting per se, but notice that nothing we did required the use of two or three (or four) dimensions. Clearly the technique of thin position which is the genesis of this algorithm captures something more universal about the graph. In other words, the algorithm is inspired by methods from \(3\)–manifolds, but applicable to data residing in an ambient space of arbitrary dimension.

The visualization itself is written in javascript using my new favorite visualization library d3. The code is available on github, but be warned: you may need (as my colleagues sometimes say) some eye bleach if you intend to look at it. I would love to rewrite it, perhaps using cyclejs or mobx… If only there were so many hours in the day.

Approx. 1734 words, plus code