Learn about

**The Molecules of HIV**

This week I was learning about **Gaussian Processes**, at the very nice Gaussian Processes Winter School in Sheffield. The term "Gaussian Processes" refers to a family of techniques for inferring a smooth surface (1D, 2D, 3D or more) from a set of sampled noisy data points. Essentially, it's an advanced and mathematically very sound type of regression.

*Don't get confused by the name, by the way: your data doesn't have to be Gaussian, and Gaussian Process regression doesn't always produce smooth Gaussian-looking results. It's very flexible.*

As an example, here's a first pass I did of analysing the frequency trajectories in a single recording of birdsong.

I used the "GPy" Python package to do all this. Here's their GPy regression tutorial.

I do want to emphasise that this is just a first pass, I don't claim this is a meaningful analysis yet. But there's a couple of neat things about the analysis:

- It can combine periodic and nonperiodic variation (by combining periodic and nonperiodic covariance kernels). Here I used a standard RBF kernel plus a periodic kernel which repeats every 1 syllable, and another periodic kernel which repeats every 3 syllables, which reflects well the patterning of this song bout.
- It can represent variation across multiple levels of detail. Unlike many other regressions/interpolations, sometimes there are fast wiggles and sometimes broad curves.
- It gives you error bars, which are derived from a proper Bayesian posterior.

So now here's my second example, in a completely different domain. I'm not a geostatistician but I decided to have a go at reconstructing the hills and valleys of Britain using point data from OpenStreetMap. This is a fairly classic example of the technique, and OpenStreetMap data is almost a perfect for the job: it doesn't hold any smooth data about the surface terrain of the Earth, but it does hold quite a lot of point data where elevations have been measured (e.g. the heights of mountain peaks).

If you want to run this one yourself, here's my Python code and OpenStreetMap data for you.

This is what the input data look like - I've got "ele" datapoints, and separately I've got coastline location points (for which we can assume ele=0):

Those scatter plots don't show the heights, but they show where we have data. The elevation data is densest where we have mountain ranges etc, such as central Scotland and in Derbyshire.

And here are two different fits, one with an "exponential" kernel and one with a "Matern" kernel:

Again, the nice thing about Gaussian Process regression is that it seamlessly handles smooth generalisations as well as occasional patches of fine detail where needed. How good are the results? Well it's hard to tell by eye, and I'd need some official relief-map data to validate it. But from looking at these two, I like the exponential-kernel fit a bit better - it certainly gives an intuitively appealing relief map in central Scotland, and it gives visually a bit less blobbiness than the other plot. However it's a bit more wrong in some places, e.g. an overestimated elevation in Derbyshire there (near the centre of the picture). If you ask an actual geostatistics expert, they will probably tell you which kernel is a good choice for regressing terrain shapes.

The other thing you can see in the images is that it isn't doing a very good job of predicting the sea. Often, we dip down to altitude of zero at the coast and then pop back upwards after. No surprises about this, for two reasons: firstly I didn't give it any data points about the sea, and secondly I'm using "stationary" kernels, meaning there's no reason for the algorithm to believe the sea behaves any differently from the land. This is easy to fix by masking out the sea but I haven't bothered.

So altogether, these examples show some of the nice features of Gaussian Process regression, and, along with the code, that the GPy module makes it pretty easy to put together this kind of analysis in Python.