Learn about

**The Molecules of HIV**

I'm writing a tree data structure as part of my research. I'm not going to describe the algorithm in detail, but it takes a set of data points and repeatedly chops them into two groups so that you can divide a dataset up into spatial subgroups.

Anyway, my first implementation (in SuperCollider 3) was running fairly slowly so I tried it in three other languages, to see which would be most practical for my situation.

It's an informal kind of benchmark - informal cos I'm not going to show you the code, and I haven't run the tests dozens of times, etc. (Some of the tests I ran just once, since they took so long.) The datasets consisted of artificially-generated 3D points sampled from a mixture of a cubic and a toroidal distribution. In the following graph, lower results (shorter times) are better:

The results show a couple of interesting things. SuperCollider was my starting point and it was never developed for large data-crunching tasks so I'm not surprised that it becomes the worst performer once we get to large datasets, although it actually doesn't do too badly. To be ten times as slow as Python or Matlab on big datasets is not embarrassing when both of those have had so many more person-hours of development effort specifically for big data crunching.

The comparison against Octave is illuminating. Octave was originally my open-source Matlab alternative of choice, but I've come to feel like it has all the drawbacks of Matlab (mainly the godawful design of the Matlab language) and none of the advantages (under-the-hood optimisation tricks, great plotting). Here I was running exactly the same code in Matlab (7.4) and Octave (3.0.5). I expected Octave to be roughly competitive, since this branching recursive code is quite difficult to auto-optimise, but Matlab generally handles it something like ten times as fast. So here I find another sign that Octave isn't quite there.

I now know, of course, that Python + numpy is the open-source Matlab alternative of choice. The language design is much better, and numpy (the module that provides all the matrix-crunching tools) has undergone lots of development effort and become better and better. And this (informal!) benchmark shows python (2.5.4, with numpy 1.3.0) performing just as well as Matlab on the large data.

(There is one thing that Python definitely lacks compared to Matlab: decent well-integrated 3D

plotting. matplotlib doesn't have it except in old deprecated versions; python's gnuplot interface is poorly developed; other python plotting libs have drawbacks such as non-interactivity. I've mentioned this before.)

So I'll probably be using my Python implementation of the tree data structure. It's right up there in terms of speed, plus the code is conceptually cleaner than the Matlab version, so it'll easier to maintain, and easier for others to grok, so it's better for reproducible research. Remember, this benchmark was only informal so do your own tests if you care about this kind of thing...