This post is a short, incomplete, and probably mistake-riddled introduction to the field of topological data analysis. Broadly speaking, we try to apply ideas from the relatively “pure” parts of topology and geometry to problems in data analysis. Although the field has a pretty long history, it has just started to become very popular thanks to evangelism by a few prominent mathematicians. (Increasing computer power and memory probably don’t hurt, either!)
What are topology and geometry?
Topology is the study of spaces without regard to their exact geometry. Sometimes it’s called “rubber sheet geometry,” because we’re allowed to stretch, bend, or otherwise manipulate the space we’re looking at, as long as we don’t cut, tear, or do other destructive operations. A famous (infamous?) joke goes something like this: a topologist is a mathematician who can’t tell the difference between a coffee mug and a donut. The explanation for the joke is that if you considered the coffee mug as being made of Play-Doh or some other malleable substance, you could mold it into the shape of a donut without ripping or tearing. Hence the two objects are equivalent, topologically.
Geometry (I’m thinking especially of differential geometry here) is the study of spaces where we do take in to account local properties like how curved the space is at different points. From this perspective, a coffee cup is completely different from a donut. This is perhaps the more natural point of view to take, but the fact is that geometry is hard, and (especially in higher dimensions) it’s often easier to tell two objects apart by using purely topological properties.
What is Topological Data Analysis?
Many problems in machine learning start with a set of points in a high dimensional space. It’s very common that these points actually sit on a smaller dimensional submanifold of the ambient space. Think of a circle embedded in the plane: the points are in a space of dimension 2, but the object itself (a circle) is one dimensional. Typically the problem is more profound, with points in spaces with hundreds or thousands of dimensions common, but the submanifold having dimension only a small fraction of that.
The goal of topological data analysis is to use ideas from topology and geometry to say interesting things about a data set. Often the goal is more specific: to compute, approximate, or shed light on the topological type of the submanifold the data is drawn from. To go back to the circle example, if we’re given points sampled from a circle embedded in high dimensional space, we’d like to know that the object being sampled from is a circle. Sometimes, though, we can get by with even less. Perhaps the data isn’t drawn from a manifold at all, but a set with singularities. We may still be able to use topological techniques to learn about the data. For example, is it possible that the data consists of points drawn from several disconnected components?
Early topological data analysis
The fact is that topological data analysis has a lot to do with machine learning. Depending on your point of view, you might even view it as a subfield. I won’t take sides in that argument, but many older ideas from ML clearly fit under the broad title of topological data analysis. One great example is Kohonen’s Self Organizing Map, developed in the early 80s. This is an unsupervised learning algorithm that takes potentially high dimensional data and places it on a grid of some kind in a way that preserves as much of the topological information as possible. For example, points near each other in the source space should end up relatively near each other in the map. The SOM has many successful applications to all sorts of problems.
More recent stuff
More recently, data analysis has received attention from a group including lots of homotopy theorists. One of the main proponents has been Gunnar Carlsson at Stanford. Carlsson, along with some collaborators, defined the idea of “persistent homology” as well as a few other important algorithms, including “mapper.” The company Ayasdi, cofounded by Dr. Carlsson, uses mapper in their central product (as far as I know).
Persistent homology has attracted more attention from the academic community. Unfortunately, it’s quite computationally expensive to run for large data sets, so a lot of work has been done on that. It’s also relatively sensitive to outliers, which is another important idea to address. My colleague Itamar Gal and I did some work in persistent homology a few years back, and I’d like to talk about that eventually.
My (small) contributions
As I said, I have worked some in persistent homology. But when I arrived as a postdoc at Oklahoma State University, my postdoctoral advisor Jesse Johnson had just invented a novel way to find clusters of vertices in graphs. The technique was inspired by thin position for knots and 3–manifolds; it certainly counts as a topological data analysis algorithm. For about 3 years we worked on various aspects of it, and I hope to talk about that work soon as well. (In the meantime, you can check out our paper on thin tree position, and another on an application of these ideas to a concrete problem in protein functional classification.)
Other folks have applied ideas from topology to real world problems. One notable example is Rob Ghrist. Although he has also written some about persistent homology, he has a lot of other work going on. He has worked on sensor networks (for example, can you tell if there are gaps in coverage? Can you take a “local” picture from just a few sensors and turn it into a “global” view ?) He has worked on applications to robotics. He’s looked into “Euler calculus” as a method to do all sorts of cool stuff in engineering. In short, I don’t think the guy sleeps.
As usual, Wikipedia has a good article on topological data analysis that I would encourage you to read if you’re interested in this stuff.
Approx. 1006 words, plus code