One problem I have faced a couple of times in the past is this: we’re given a list of objects, and a list of identifications of two objects. How many groups of objects are there after we perform the identifications? In mathematical parlance we’re given a set, a relation, and we’re asked to find the equivalence classes under the equivalence relation generated by the given relation.

To make the problem more concrete, suppose that we’re given the set \(\set{0, 1, 2, 3, 4, 5, 6}\) and a list of tuples \(\set{(1, 2), (1, 4), (3, 6)}\). The tuples tell us to first lump together 1 and 2, then lump together 4 with those two, and then group 3 and 6 together. In the end, we have the sets \(\set{0}\), \(\set{1, 2, 4}\), \(\set{3, 6}\), and \(\set{5}\).

One place this comes up is in finding the connected components of a (disconnected) graph. After adding all vertices connected by an edge, we can quickly tell whether two vertices are connected by a path in the graph. There are many other applications in which this type of data structure is useful, too. To be honest, I can’t remember where I needed this or what the context was, but I remember being vexed – I came up with a solution, but it was complicated and not very fast.

It turns out that there is a fast and relatively simple way to do this, using
a **Union-Find** data structure, named after two of its important operations.
This data structure is also called a disjoint set or merge-find data
structure, and you can read more about them on
Wikipedia. I’ve
also put a simple implementation in a github
repo.

## Short implementation in Python

I wanted to show you a short python implementation that runs so much faster than the implementation I came up with years ago that it’s not even funny. It’s not too hard to understand, either. Conceptually, we’ll construct a forest (a collection of trees), but the implementation is quite a bit simpler.

Without loss of generality we may as well assume our set is the integers between 0 and some number \(n\), not including \(n\). Our input is a list of tuples of numbers in this range.

The idea is to keep a list of parents of elements.

- The parent of element
`i`

is`self.parents[i]`

. At the beginning, each element is its own parent; this is our way of saying that this is a tree with only one vertex (the root). - As we merge together trees, we make some tree the parent of another tree.
This is the
`self.parents`

list.

The `root`

method takes a value and returns the root of the tree the value
belongs to. It does this by looking at the parent of the value, the parent of
the parent, and so on until it gets to the root of the tree (which has the
property that it is its own parent). This gives us a canonical choice of
representative for every value in the set of vertices of the tree.

The union (or merge) operation takes two values and merges their trees together into one tree. To do this, we first get the root of both trees. Then we pick one and make its parent the other. Here’s the code:

```
class UnionFind(object):
def __init__(self, n):
self.parents = range(n)
def root(self, i):
p = i
while p != self.parents[p]:
p = self.parents[p]
return p
def union(self, v1, v2):
r1 = self.root(v1)
r2 = self.root(v2)
self.parents[r1] = r2
```

That’s it. No kidding! We can run it like this for the example above:

```
union_find = UnionFind(7)
union_find.union(1, 2)
union_find.union(1, 4)
union_find.union(3, 6)
```

At the beginning, each number is the root of its own tree (trees 0, 5, and 6 are left out):

After we unify 1 and 2, we have a new tree with 2 as its root:

To unify 1 and 4, we look up the root of 1 (which is now 2) and make 4 its parent:

The `root`

method gives us a canonical representative of the equivalence class
a number belongs to, so if we want to check whether 2 and 4 are in the same
class, we just check whether `union_find.root(2) == union_find.root(4)`

. That
returns true in this case, but `union_find.root(0) == union_find.root(1)`

is
false because those two values lie in different equivalence classes.

## An optimization: path compression

As we identify elements as belonging to the same equivalence class, our trees grow in height. This is bad for lookups; it would be better to keep the trees relatively shallow. The first optimization is dirt simple and does exactly that.

When we find the root of an element (the parent of its parent…), we might as
well set all the intermediate children to point to the root. After all, the
root is the representative of all of them. Why spend lots of time in the
while loop when we can just loop once and be done? Although there are tricks
to do it in one loop, I think this new version of the `root`

method is pretty
clear:

```
def root(self, i):
p = i
while p != self.parents[p]:
p = self.parents[p]
c = i
while c != p:
old_parent = self.parents[c]
self.parents[c] = p
c = old_parent
return p
```

After this trick, the example above looks like this:

This trick is sometimes called *path compression*. As you can see, the path
from 1 to 2 to 4 is eliminated in favor of two paths, from 1 to 4 and from
2 to 4. This optimization has the effect of “flattening” the tree and makes
a huge difference in the height of trees in the tests I ran. This in turn can
translate to much shorter times to determine whether two elements belong to
the same equivalence class.

## Other optimizations

There are some optimizations we can make, but frankly they add code and complexity. The implementation given here works up to millions of numbers and even more identifications with no problem.

## Other fun things to do

Once the basic data structure is implemented, it’s not hard to get a list of
the equivalence classes. It’s also not hard to get a representation of the
trees that are implicitly stored in the `parents`

member. If you’re
interested in these things, check out some code (with additional
optimizations, simple tree visualization, and other things) at my github
repo.

If you want to go further with this, there’s plenty of ground to cover. It turns out that the implementation of a union-find structure presented here (together with path compression and the other optimization we didn’t talk about) has very good running time: the amortized cost per operation is essentially constant (see Wikipedia for details). Plus, there is a formalization in Coq. How cool!

Approx. 1130 words, plus code