Distance analysis methods: Multidimensional Scaling and SplitsTree try to unravel the Tube map

In scientific research, one of the things you sometimes need to do is take a set of distance measurements (e.g. "it's 5 metres from A to B, 4 metres from A to C, and 3 metres from B to C") and try to reconstruct the actual spatial layout underlying that data.

So how to do it? Well one approach is Multidimensional Scaling (MDS) and it's been known for a few decades in timbre research. It assumes that the data exist in a Euclidean space (a pretty straightforward space like ordinary 3D space we're used to) and arranges the points in a layout that gives the least disagreement with the distance measurements. So if we have a set of musical timbre judgments (e.g. "a bassoon sounds quite like an oboe, but not much like a violin") we can try and force those objects into a spatial arrangement that suits their relationships, and then view the resulting map.

But there's a problem. Who in the world said that audio timbre behaved like a standard Euclidean space? Does it depend on context? (Yes.) Is the difference between A and B always the same as the difference between B and A? Does timbre behave more like categories (e.g. woody vs metallic vs watery) than like a space?

That's a big problem and there's no clear solution. I saw a talk by Ashley Burgoyne at ICMC 2007 which suggested some modifications to MDS to help account for the weirdness of timbre-space. Some of it makes intuitive sense: e.g. the use of "specificities" builds in the idea that one data-point may be more unique than it should be, having its own special distance to cover the fact that a trumpet sounds uniquely different from everything else. And he argued that the nonlinear versions coped better with the evidence about timbre judgments.

Then I heard about another completely different approach. Geneticists have developed rather clever ways of analysing the genes of different creatures, to produce "genetic distance" measures and then use those to reconstruct what the evolutionary tree could have been. The maths can be applied to any set of distance measurements (aha!) and creates a tree that best represents them - the "tree" is actually a kind of space, not the same as Euclidean space.

For an introduction to the maths involved, see Metric Spaces in Pure and Applied Mathematics.

I needed to get my head around how this approach might work, and whether it might be useful. So I decided to apply it to a weird space in which distance measurements might not correspond to actual spatial distances... the Tube map.

If you've been on the Tube you'll know that some journeys are longer than they should do, and the durations don't actually match up with the geographic distances they take you. You'll also know that the Tube map itself is highly nonlinear, the geographical layout is warped to make it neat and easy to read.

So I took this section of the Tube map:

and from the web I found two different sorts of data:

how long it takes (in minutes) to walk from one station to another, overground;
how long it takes to get from one station to another by tube.

Now the first set of data should be "more Euclidean" since walking is basically going in a straight line except for the buildings in the way; while the tube timings should be weirder because you're strongly constrained, there's only a few pipes you can go down and they don't always connect up in all the obvious ways.

So when you feed the walking-times into MDS you get this (I've painted the tube-lines back onto the map to make things more obvious):

Not bad eh? The arrangement is actually quite a lot like the real-world layout of the tube stations.

And here's what happened when the same walking-times were fed into SplitsTree:

Yes, it kind of works, except that Russell Square pokes out a bit weirdly, I think due to the algorithm's requirement that the data points sit at the edge of the graph. The SplitsTree representation is almost-but-not-quite happy to represent the data in 2D, shown by the patchwork of almost-rectangles.

Here's where the differences really show up though: the tube-timing data. The walking-time data was "easy"...

Tube-timing data after MDS:

Tube-timing data after SplitsTree:

Note that both algorithms push the circle line (the yellow line) away from the others, out towards the top-right of the space. That's because the circle line, although it crosses over the others, doesn't have as many intersections as it might do (it doesn't have a stop at Euston or Warren Street, for example). Both algorithms spot that Kings Cross is a hub in this network (meaning it's easy to get to most of these stops from Kings Cross), placing it right at the heart of the layout. More generally, neither algorithm reconstructs the geographical layout of the stations, simply because the time it takes to get from A to B isn't so much defined by geography but by the peculiarities of London Underground.

The SplitsTree representation seems here to use a lot of 3D boxes, and there are some convoluted goings-on inside the way it tries to rationalise all the distances.

Notice also that on the SplitsTree diagram, most stations have their own little spike to live on. These are similar to the "specificities" I mentioned earlier - each tube station takes that little bit of extra time because of the time needed to get up and down the escalators (or whatever). For the Piccadilly (dark blue) line, SplitsTree seems to suggest that the majority of the time taken is in getting up and down and the actual journey between stations is pretty quick, which I think pretty much reflects reality.

I did all this in order to try and grok the tree reconstruction algorithms. Not sure if I've got there yet, but this was definitely helpful...

Wed 11 February 2009 | science | Permalink

mcld.co.uk

Other things on this site...