Other things on this site...

Evolutionary sound
Listen to Flat Four Internet Radio
Learn about
The Molecules of HIV
Make Oddmusic!
Make oddmusic!

Modelling vocal interactions

Last year I took part in the Dagstuhl seminar on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR). Many fascinating discussions with phoneticians, roboticists, and animal behaviourists (ethologists).

One surprisingly difficult topic was to come up with a basic data model for describing multi-party interactions. It was so easy to pick a hole in any given model: for example, if we describe actors taking "turns" which have start-times and end-times, then are we really saying that the actor is not actively interacting when it's not their turn? Do conversation participants really flip discretely between an "on" mode and an "off" mode, or does that model ride roughshod over the phenomena we want to understand?

I was reminded of this modelling question when I read this very interesting new journal article by a Japanese research group: "HARKBird: Exploring Acoustic Interactions in Bird Communities Using a Microphone Array". They have developed this really neat setup with a portable microphone array attached to a laptop which does direction-estimation and decodes which birds are heard from which direction. In the paper they use this to help annotate the time-regions in which birds are active, a bit like on/off model I mentioned above. Here's a quick sketch:

boxes diagram

From this type of data, Suzuki et al calculate a measure called the transfer entropy which quantifies the extent to which one individual's vocalisation patterns contain information that predicts the patterns of another. It gives them a hypothesis test for whether one particular individual affects another, in a network: who is listening to whom?

That's a very similar question to the question we were asking in our journal article last year, "Detailed temporal structure of communication networks in groups of songbirds". I talked about our model at the Dagstuhl event. Here I'll merely emphasise that our model doesn't use regions of time, but point-like events:

boxes diagram

So our model works well for short calls, but is not appropriate for data that can't be well-described via single moments in time (e.g. extended sounds that aren't easily subdivided). The advantage of our model is that it's a generative probabilistic model: we're directly estimating the characteristics of a detailed temporal model of the communication. The transfer-entropy method, by contrast, doesn't model how the birds influence each other, just detects whether the influence has happened.

I'd love to get the best of both worlds. a generative and general model for extended sound events influencing one another. It's a tall order because for point-like events, we have point process theory; for extended events I don't think the theory is quite so well-developed. Markov models work OK but don't deal very neatly with multiple parallel streams. The search continues.

Friday 24th February 2017 | science | Permalink

Paper: Applications of machine learning in animal behaviour studies

A colleague pointed out this new review paper in the journal "Animal Behaviour": Applications of machine learning in animal behaviour studies.

It's a useful introduction to machine learning for animal behaviour people. In particular, the distinction between machine learning (ML) and classical statistical modelling is nicely described (sometimes tricky to convey that without insulting one or other paradigm).

The use of illustrative case studies is good. Most introductions to machine learning base themselves around standard examples predicting "unstructured" outcomes such as house prices (i.e. predict a number) or image categories (i.e. predict a discrete label). Two of the three case studies (all of which are by the authors themselves) similarly are about predicting categorical labels, but couched in useful biological context. It was good to see the case study relating to social networks and jackdaws. Not only because it relates to my own recent work with colleagues (specifically: this on communication networks in songbirds and this on monitoring the daily activities of jackdaws - although in our case we're using audio as the data source), but also because it shows an example of using machine learning to help elucidate structured information about animal behaviour rather than just labels.

The paper is sometimes mathematically imprecise: it's incorrect that Gaussian mixture models "lack a global optimum solution", for example (it's just that the global optimum can be hard to find). But the biggest omission, given that the paper was written so recently, is any real mention of deep learning. Deep learning has been showing its strengths for years now, and is not yet widely used in animal behaviour but certainly will be in years to come; researchers reading a review of "machine learning" should really come away with at least a sense of what deep learning is, and how it sits alongside other methods such as random forests. I encourage animal behaviour researchers to look at the very readable overview by LeCun et al in Nature.

Tuesday 31st January 2017 | science | Permalink

Some papers seen at MLSP 2016

MLSP 2016 - i.e. the IEEE International Workshop on Machine Learning for Signal Processing - was a great, well-organised workshop, held last week on Italy's Amalfi coast. (Yes, lovely place to go for work - if only I'd had some spare time for sightseeing on the side! Anyway.)

Here are a few of the papers that caught my interest:

  • Approximate State-Space Gaussian Processes Via Spectral Transformation by Toni Karvonen and Simo Särkkä. This is an important contribution to the current work on Gaussian processes and in particular on running efficient Gaussian process inference. It builds on other work from the Särkkä lab converting Gaussian processes to state-space models, which often involves a (mild) approximation. This paper introduces some new methods in that vein, with proofs, and in fact the paper includes various ways to approximate a GP. A veritable mathematical toolkit. It seems the Taylor expansion (the most immediately comprehensible IMHO) is not the best.

Actually, there was substantial work involving Gaussian processes at MLSP. Is it a growth area? Well, if the use of GPs can be made more scalable (as in the above paper) then yes, it certainly should be. They are a very flexible and general tool, and nicely Bayesian too. Richard Turner's keynote about Gaussian processes was a beautiful introduction - he manages to make GPs extremely understandable. If you get a chance to see him speak on them then do.

  • Localizing Users And Items From Paired Comparisons by O'Shaughnessy and Davenport. This is a nicely conceived addition to the literature on recommendation algorithms, and with good demonstrations of how the approach is robust to issues such as incoherent paired comparisons.
  • "Data Privacy Protection By Kernel Subspace Projection And Generalized Eigenvalue Decomposition" by Diamantaras and Kung. Privacy-preserving computing is an important area for current research. It's made obvious when we see how much a large company like Facebook or Tesco can infer about its users. Here, the authors treat privacy as a classification task - i.e. the data to be kept private is some kind of discrete label - and they apply an LDA-like method: maximise the scatter between the target classes for the "allowed" task, while minimising the scatter between the private classes. (I raised an issue with their "Privacy Index", noting that the desired accuracy for the private task was not in fact zero but ignorance. I'd presume that a metric based on mutual information would be a nice alternative.)
  • "Scale and shift invariant time/frequency representation using auditory statistics: application to rhythm description" by Marchand and Peeters. They use the "Scale Transform", a class of Mellin transform. Equivalent to exponentially time-warping a signal then weighting by an exponential window. Since it's not shift-invariant you don't want to apply it directly to audio, but to e.g. autocorrelation. From there, they argue you get a good featureset for characterising musical rhythm.
  • Score-Matching Estimators For Continuous-Time Point-Process Regression Models by Sahani, Bohner and Meyer - good to see this. I've been using point process models to analyse bird communication and so I'm interested in efficient ways to do such analysis, which commonly seem to come from the computational neuroscience literature at the moment. Notable that this approach doesn't require any time discretisation, so could be useful. The functions analysed need to be differentiable, so to work with impulsive time series they actually convolve/correlate them with basis functions; feels like a minor hack but there you go.

Also, I was very pleased that Pablo A Alvarado Duran presented his work on Gaussian processes for music audio modelling - his first publication as part of his PhD with me!

Sunday 18th September 2016 | science | Permalink

Some papers seen at InterSpeech 2016

InterSpeech 2016 was a very interesting conference. I have been to InterSpeech before, yes - but I'm not a speech-recognition person so it's not my "home" conference. I was there specifically for the birds/animals special session (organised by Naomi Harte and Peter Jancovic), but it was also a great opportunity to check in on what's going on in speech technology research.

Here's a note of some of the interesting papers I saw. I'll start with some of the birds/animals papers:

That's not all the bird/animal papers, sorry, just the ones I have comments about.

And now a sampling of the other papers that caught my interest:

  • Retrieval of Textual Song Lyrics from Sung Inputs by Anna Kruspe. Nice to see work on aligning song lyrics against audio recordings - it's something that the field of MIR is in need of. The example application here is if you sing a few words, can a system retrieve the right song audio from a karaoke database?
  • The auditory representation of speech sounds in human motor cortex - this journal article has some of the amazing findings presented by Eddie Chang in his fantastic keynote speech, discovering the way phonemes are organised in our brains, both for production and perception.
  • Today's Most Frequently Used F0 Estimation Methods, and Their Accuracy in Estimating Male and Female Pitch in Clean Speech by Sofia Strömbergsson. This survey is a great service for the community. The general conclusion is that Praat's pitch detection is really among the best off-the-shelf recommendations (for speech analysis, here - the evaluation hasn't been done for non-human sounds!).
  • Supervised Learning of Acoustic Models in a Zero Resource Setting to Improve DPGMM Clustering by Heck et al - "zero-resource" speech analysis is interesting to me because it could be relevant for bird sounds. "Zero resource" means analysing languages for which we have no corpora or other helpful data available - all we have is audio recordings. (Sounds familiar?) In this paper the authors used some adaptation techniques to improve a method introduced last year based on unsupervised nonparametric clustering.
  • Speech reductions cause a de-weighting of secondary acoustic cues by Varnet et al: a study of some niche aspects of human listening. Through tests of people listening to speech examples in noise they found that people's use of secondary cues - i.e. clues that help to distinguish one phoneme from another, which clues are embedded elsewhere in the word than the phoneme itself - changes according to the nature of the stimulus. Yet more evidence that perception is an active, context-sensitive process etc.

One thing you won't realise from my own notes is that InterSpeech was heavily dominated by deep learning. Convolutional neural nets (ConvNets), recurrent neural nets (RNNs), they were everywhere. Lots of discussion about connectionist temporal classification (CTC) - some people say it's the best, some people say it requires too much data to train properly, some people say they have other tricks so they can get away without it. It will be interesting to see how that discussion evolves. However, many of the other deep-learning based papers were much of a muchness: lots of people use a ConvNet or an RNN and, as we all know, in many cases they can get good results. They apply these to many tasks in speech technology. However, in many cases there was application without a whole lot of insight. That's the way the state of the art is at the moment, I guess. Therefore, many of my most interesting moments at InterSpeech were deep-learning-less :) see above.

(Also, I had to miss the final day, to catch my return flight. Wish I'd been able to go to the VAD and Audio Events session, for example.)

Another aspect of speech technology is the emphasis on public data challenges - there are lots of them! Speech recognition, speaker recognition, language recognition, distant speech recognition, zero-resource speech recognition, de-reverberation... Some of these have been running for years and the dedication of the organisers is worth praising. Useful to check in on how these things are organised, as we develop similar initiatives in general and natural sound scene analysis.

Sunday 18th September 2016 | science | Permalink

Re: Metrics for Polyphonic Sound Event Detection

I just read this new paper, "Metrics for Polyphonic Sound Event Detection" by Mesaros et al. Very relevant topic since I'm working on a couple of projects to automatically annotate bird sound recordings.

I was hoping this article would be a complete and canonical reference that I could use as a handy citation to refer to in any discussion of evaluating sound event detection. It isn't that, for a single reason:

Just so you know, the context of that paper is the DCASE2016 challenge. For the purposes of the challenge, they've released a public python toolbox with their evaluation metrics, and that's a great way to go about things. This paper, then, is oriented around the evaluation paradigm used in DCASE2016.

In that paradigm, they evaluate systems which supply a list of inferred event annotations which are entirely on or off. They're not probabilistic, or ranked, or annotated with certainty/uncertainty. Fair enough, this happens a lot, and it's a perfectly justifiable way to set up the contest. However, in many of my scenarios, we work with systems that output a probabilistic or rankable set of output events - you can turn this into a definite annotation simply by thresholding, but actually what we'd like to do is evaluate the fully "nuanced" output.

Why? Why should evaluation care about whether a system labels an event confidently or weakly? Well, it's all about what happens downstream. An example: imagine you have an automatic system for detecting events, and you apply it to a dataset of 1000 hours of audio. No automatic system is perfect, and so you often want to either (a) only focus on the strongly-detected items in later analysis, or (b) ask a human expert to go through the results to cross-check them. In the latter case, the expert does not have time to listen to all 1000 hours; instead you'd like to prioritise their work, for example by focussing on the annotations that are the most ambiguous. This kind of work is very likely in the applications I'm working with.

The statistics focussed on in the above paper (F-measure, precision, recall, accuracy, error rate) are all based on definite binary annotations, so they don't make use of the nuance. I'm generally an advocate of the "area under the ROC curve" (AUC) statistic, which doesn't tell the whole story but it helps make use of the nuance by averaging over a whole range of possible detection thresholds.

A nice example of a paper which uses AUC for event detection is "Chime-home: A dataset for sound source recognition in a domestic environment" by Foster et al. The above paper does mention this in passing, but doesn't really tease out why anyone would use AUC or how it differs from the DCASE2016 paradigm.

I want to be clear that the Mesaros et al. paper is not "wrong" or anything like that. I just wish it had a section on evaluating ranked/probabilistic outputs, why that might matter, and what metrics come in useful. Similarly, the sed_eval toolbox doesn't have an implementation of AUC for event detection. Presumably fairly straightforward to add it to its "segment-wise" metrics. Maybe one day!

Monday 6th June 2016 | science | Permalink

Video: me on Science Showoff

On Monday I did a bit of what-you-might-call standup - at the Science Showoff night. Here's a video - including a bonus extra bird-imitation contest at the end!

Friday 20th May 2016 | science | Permalink

Research visit to MPIO Seewiesen

I'm nearing the end of a great three-week research visit to the Max-Planck Institute for Ornithology at Seewiesen (Germany). It's a lovely place dedicated to the study of birds. Full of birds and ornithologists:

Watch out for ornithologists

I'm visiting Manfred Gahr's group. We had some ideas in advance and some of them have turned out nicely fruitful for this brief visit.

  • With Lisa Gill we've been looking at jackdaw calls. "Where is the individuality encoded?" is a question various researchers have asked about animal sounds. With these jackdaws it's a great challenge to think about in computational (machine listening) terms, because jackdaws (like many corvids) have calls with complicated structure, sometimes creaky, sometimes harmonic, often a mixture. Did you know that songbirds have two sets of vocal folds, whereas humans have one? Well that certainly can make things tricky, if you're trying to use a standard harmonics-based or pitch-based analysis, or... to be honest most methods will trip up here for some reason or other. Not all songbirds use both sets of vocal folds noticeably at the same time but I suspect it's a big part of the complexity here. You also see period-doubling effects and the like - perhaps caused by dual voicing or perhaps by other control. I don't think there's that much known about the physiology/biomechanics of these particular vocalisations, nor the learned/volitional control.

    So, together with my student Veronica Morfi, we've applied some signal-processing methods to try and get a clearer view on Lisa's dataset of jackdaw calls. I think we've found some useful little improvements, learnt from each other, and it's been a good topic to have a go at together.

  • With Lisa as well as Mauricio Nicolas Adreani and Pietro d' Amelio we've been making use of the method I use for analysing timing influences in zebra finch communication networks. We've got some interesting results with one of Nico's datasets - all preliminary for now, so I'll leave it at that!
  • With Albertine Leitão we're having a look at zebra finch song tutoring, since our feature-learning method for bird classification has some properties that could potentially make it attractive for teasing out signatures of tutor influence on song patterns. This one is even more preliminary at the moment.

I'm staying on-site, and I've been lucky enough to catch the tail-end of the beautiful snowy weather, making it look like this:


Thanks to my hosts and collaborators for their involvement!

Monday 7th March 2016 | science | Permalink

Tracking fast frequency modulation (FM) in audio signals - a list of methods

When you work with birdsong you encounter a lot of rapid frequency modulation (FM), much more than in speech or music. This is because songbirds have evolved specifically to be good at it: as producers they have muscles specially adapted for rapid FM (Goller and Riede 2012), and as listeners they're perceptually and behaviourally sensitive to well-executed FM (e.g. Vehrencamp et al 2013).

Standard methods for analysing sound signals - spectrograms (or Fourier transforms in general) or filterbanks - assume that the signal is locally stationary, which means that when you consider a single small window of time, the statistics of the process are unchanging across the length of the window. For music and speech we can use a window size such as 10 milliseconds, and the signal evolves slowly enough that our assumption is OK. For birdsong, it often isn't, and you can see the consequences when a nice sharp chirp comes out in a spectrogram as a blurry smudge across many pixels of the image.

So, to analyse birdsong, we'd like to analyse our signal using representations that account for nonstationarity. Lots of these representations exist. How can we choose?

If you're impatient, just scroll down to the Conclusions at the bottom of this blog. But to start off, let's state the requirements. We'd like to take an audio signal and convert it into a representation that:

  • Characterises FM compactly - i.e. FM signals as well as fixed-pitch signals have most of their energy represented in a similar small number of coefficients;
  • Handles multiple overlapping sources - since we often deal with recordings having multiple birds;
  • Copes with discontinuity of the frequency tracks - since not only do songbirds make fast brief audio gestures, but also, unlike us they have two sets of vocal folds which they can alternate between - so if a signal is a collage of chirpy fragments rather than a continuously-evolving pitch, we want to be able to reflect that;
  • Ideally is fairly efficient to calculate - simply because we often want to apply calculations at big data scales;
  • Does the transformation need to be invertible? (i.e. do we need a direct method to resynthesise a signal, if all we know is the transformed representation?) Depends. If we're interested in modifying and resynthesising the sounds then yes. But I'm primarily interested in extracting useful information, for which purposes, no.

Last year we published an empirical comparison of four FM methods (Stowell and Plumbley 2014). The big surprise from that was that the dumbest method was the best-performing for our purposes. But I've encountered a few different methods, including a couple that I learnt about very recently, so here's a list of methods for reference. This list is not exhaustive - my aim is to list an example of each paradigm, and only for those paradigms that might be particularly relevant to audio, in particular bird sounds.

  • Let's start with the stupid method: take a spectrogram, then at each time-point find out which frequency has the most energy. From this list of peaks, draw a straight line from each peak to the one that comes immediately next. That set of discontinuous straight lines is your representation. It's a bit chirplet-like in that it expresses each moment as a frequency and a rate-of-change of frequency, but any signal processing researcher will tell you not to do this. In principle it's not very robust, and it's not even guaranteed to find peaks that correspond to the actual fundamental frequency. In our 2014 paper we tested this as a baseline method, and... it turned out to be surprisingly robust and useful for classification! It's also extremely fast to compute. However, note that this doesn't work with polyphonic (multi-source) audio at all. For big data analysis it's handy to be able to do this, but I don't expect it to make any sense for analysing a sound scene in detail.
  • Chirplets. STFT analysis assumes the signal is composed of little packets, and each packet contains a sine-wave with a fixed frequency. Chirplet analysis generalises that to assume that each packet is a sine-wave with parametrically varying frequency (you can choose linearly-varying, quadratically-varying, etc). See chirplets on wikipedia for a quick intro. There are different ways to turn the concept of a chirplet into an analysis method. Here are some applied to birds:
  • Filter diagonalisation method - an interesting method from quantum mechanics, FDM models a chunk of signal as a sum of purely exponentially decaying sinusoids. Our PhD student Luwei Yang recently applied this to tracking vibrato in string instruments. I think this is the first use of FDM for audio. It's not been explored much - I believe it satisfies most of the requirements I stated above, but I've no idea of its behaviour in practice.
  • Subspace-based methods such as ESPRIT. See for example this ESPRIT paper by Badeau et al. These are one class of sinusoidal tracking techniques, because they analyse a signal by making use of an assumed continuity from one frame to the next. In fact, this is a problem for birdsong analysis. Roland Badeau tested a birdsong recording for me and found that the very fast FM was a fatal problem for this type of method: the method simply needs to be able to rely on some relatively smooth continuity of pitch tracks, in order to give strong tracking results.
  • Fan chirp transform (Weruaga and Kepesi 2007) - when you take the FFT of a signal, we might say you analyse it as a series of "horizontal lines" in the time-frequency plane. The fan chirp transform tilts all these lines at the same time: imagine the lines, instead of being horizontal, all converge on a single vanishing point in the distance. So it should be particularly good for analysing harmonic signals that involve pitch modulation. Note that the angles are all locked together, so it's best for monophonic-but-harmonic signals, not polyphonic signals. My PhD student Veronica Morfi, before she joined us, extended the fan-chirp model to non-linear FM curves too: Morfi et al 2015.
  • Spectral reassignment methods. When you take the FFT of a signal, note that you analyse it as a series of equally-spaced packets on the frequency axis. The clever idea in spectral reassignment is to say, if we assume the packets weren't actually sitting on that grid, but we analysed them with the FFT anyway, let's take the results and move every one of those grid-points to an irregular location that best matches the evidence. You can extend this idea to allow each packet to be chirpy rather than fixed-freqency, so there you have it: run a simple FFT on a frame of audio, and then magically transform the results into a more-detailed version that can allow each bin to have its own AM and FM. This is good because it makes sense for polyphonic audio.
    • A particular example of this is the distribution derivative method (code available here). I worked with Sasho Musevic a couple of years ago, who did his PhD on this method, and we found that it yielded good informative information for multiple birdsong tracking. (Stowell et al 2013) Definitely promising. (In my later paper where I compared different FM methods, this gave a strong performance again. The main disbenefit, in that context, was simply that it took longer to compute than the methods I was comparing it against.) Also you have to make some peak-picking decisions, but that's doable. This summer, I did some work with Jordi Bonada and we saw the distribution derivative method getting very good precise results on a dataset of chaffinch recordings.
  • The Jančovič and Köküer method is a specific method for detecting modulated sinusoids by the specific shapes they create in a short-time spectrum. The method is specifically designed to take advantage of the behaviour of heavily-modulated tones, so unlike many other methods it should actually work particularly well on them. Nifty. I don't think it's been used by anyone else, though the authors themselves have incorporated their tracker into a bird classifier. I'd like to see this evaluated more.
  • There's lots of work on multi-pitch trackers, and it would be incomplete if I didn't mention that general idea. Why not just apply a multi-pitch tracker to birdsong audio and then use the pitch curves coming out from that? Well, as with the ESPRIT method I mentioned above, the methods developed for speech and music tend to build upon assumptions such as relatively long, smooth curves often with hard limits to the depth of FM that can exist.
  • How about feature learning? Rather than design a feature transform, we could simply feed a learning algorithm with a large amount of birdsong data and get it to learn what FM patterns exist in the audio. That's what we did last year in this paper on large-scale birdsong classification - that was based on spectrogram patches, but it definitely detected characteristic FM patterns. That representation didn't explicitly recover pitch tracks or individual chirplets, but there may be ways to develop things in that direction. In particular, there's quite a bit of effort in deep learning on "end-to-end" learning which asks the learning algorithm to find its own transformation from the raw audio data. The transformations learnt by such systems might themselves be useful representations for other tasks.


So.... It's too soon to have conclusions about the best signal representations for FM in birdsong. But out of this list, the distribution derivative method is the main "off-the-shelf" tool that I'd suggest for high-resolution bird FM analysis (code available here), while feature-learning and filter diagonalisation and the Jančovič and Köküer method are the approaches that I'd like to see more research on.

At the same time, I should also emphasise that machine learning methods don't need a nice clean understandable representation as their input. Even if a spectrogram turns birdsong into a blur when you look at it, that doesn't necessarily mean it shouldn't be used as the input to a classifier. Machine learning often has different requirements than the human eye.

(You might think I'm ignoring the famous rule garbage in, garbage out when I say a classifier might work fine with blurry data - well, yes and no. A spectrogram contains a lot of high-dimensional information, so it's rich enough that the crucial information can still be embedded in there. Even the "stupid method" I mentioned, which throws away so much information, preserves something of the important aspects of the sound signal. However modern classifiers work well with rich high-dimensional data.)

But if you're trying to do something specific such as clearly characterise the rates of FM used by a particular bird species, a good representation will help you a lot.

Tuesday 8th December 2015 | science | Permalink

Reading list: excellent papers for birdsong and machine learning

I'm happy to say I'm now supervising two PhD students, Pablo and Veronica. Veronica is working on my project all about birdsong and machine learning - so I've got some notes here about recommended reading for someone starting on this topic. It's a niche topic but it's fascinating: sound in general is fascinating, and birdsong in particular is full of many mysteries, and it's amazing to explore these mysteries through the craft of trying to get machines to understand things on our behalf.

If you're thinking of starting in this area, you need to get acquainted with: (a) birds and bird sounds; (b) sound/audio and signal processing; (c) machine learning methods. You don't need to be expert in all of those - a little naivete can go a long way!

But here are some recommended reads. I don't want to give a big exhaustive bibliography of everything that's relevant. Instead, some choice reading that I have selected because I think it satisfies all of these criteria: each paper is readable, is relevant, and is representative of a different idea/method that I think you should know. They're all journal papers, which is good because they're quite short and focused, but if you want a more complete intro I'll mention some textbooks at the end.

  • Briggs et al (2012) "Acoustic classification of multiple simultaneous bird species: A multi-instance multi-label approach"

    • This paper describes quite a complex method but it has various interesting aspects, such as how they detect individual bird sounds and how they modify the classifier so that it handles multiple simultaneous birds. To my mind this is one of the first papers that really gave the task of bird sound classification a thorough treatment using modern machine learning.
  • Lasseck (2014) "Large-scale identification of birds in audio recordings: Notes on the winning solution of the LifeCLEF 2014 Bird Task"

    • A clear description of one of the modern cross-correlation classifiers. Many people in the past have tried to identify bird sounds by template cross-correlation - basically, taking known examples and trying to detect if the shape matches well. The simple approach to cross-correlation fails in various situations such as organic variation of sound. The modern approach, introduced to bird classification by Gabor Fodor in 2013 and developed further by Lasseck and others, uses cross-correlation, but it doesn't use it to guess the answer, it uses it to generate new data that gets fed into a classifier. At time of writing (2015), this type of classifier is the type that tends to win bird classification contests.
  • Wang (2003), "An industrial strength audio search algorithm"

    • This paper tells you how the well-known "Shazam" music recognition system works. It uses a clever idea about what is informative and invariant about a music recording. The method is not appropriate for natural sounds but it's interesting and elegant.

      Bonus question: Take some time to think about why this method is not appropriate for natural sounds, and whether you could modify it so that it is.

  • Stowell and Plumbley (2014), "Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning"

    • This is our paper about large-scale bird species classification. In particular, a "feature-learning" method which seems to work well. There are some analogies between our feature-learning method and deep learning, and also between our method and template cross-correlation. These analogies are useful to think about.
  • Lots of powerful machine learning right now uses deep learning. There's lots to read on the topic. Here's a blog post that I think gives a good introduction to deep learning. Also, for this article DO read the comments! The comments contain useful discussion from some experts such as Yoshua Bengio. Then after that, this recent Nature paper is a good introduction to deep learning from some leading experts, which goes into more detail while still at the conceptual level. When you come to do practical application of deep learning, the book "Neural Networks: Tricks of the Trade" is full of good practical advice about training and experimental setup, and you'll probably get a lot out of the tutorials for the tool you use (for example I used Theano's deep learning tutorials).

    • I would strongly recommend NOT diving in with deep learning until you have spent at least a couple of months reading around different methods. The reason for this is that there's a lot of "craft" to deep learning, and a lot of current-best-practice that changes literally month by month, and anyone who gets started could easily spend three years tweaking parameters.
  • Theunissen and Shaevitz (2006), "Auditory processing of vocal sounds in birds"

    • This one is not computer science, it's neurology - it tells you how birds recognise sounds!

      A question for you: should machines listen to bird sounds in the same way that birds listen to bird sounds?

  • O'Grady and Pearlmutter (2006), "Convolutive non-negative matrix factorisation with a sparseness constraint"

    • An example of analysing a spectrogram using "non-negative matrix factorisation" (NMF), which is an interesting and popular technique for identifying repeated components in a spectrogram. NMF is not widely used for bird sound, but it certainly could be useful, maybe for feature learning, or for decoding, who knows - it's a tool that anyone analysing audio spectrograms should be aware of.
  • Kershenbaum et al (2014), "Acoustic sequences in non-human animals: a tutorial review and prospectus"

    • A good overview from a zoologist's perspective on animal sound considered as sequences of units. Note, while you read this, that sequences-of-units is not the only way to think about these things. It's common to analyse animal vocalisations as if they were items from an alphabet "A B A BBBB B A B C", but that way of thinking ignores the continuous (as opposed to discrete) variation of the units, as well as any ambiguity in what constitutes a unit. (Ambiguity is not just failure to understand: it's used constructively by humans, and probably by animals too!)
  • Benetos et al (2013), "Automatic music transcription: challenges and future directions"

    • This is a good overview of methods used for music transcription. In some ways it's a similar task to identifying all the bird sounds in a recording, but there are some really significant differences (e.g. the existence of tempo and rhythmic structure, the fact that musical instruments usually synchronise in pitch and timing whereas animal sounds usually do not). A big difference from "speech recognition" research is that speech recognition generally starts from the idea of there just being one voice. The field of music transcription has spent more time addressing problems of polyphony.
  • Domingos (2012), "A few useful things to know about machine learning"

    • lots of sensible, clearly-written advice for anyone getting involved in machine learning.


  • "Machine learning: a probabilistic perspective" by Murphy
  • "Nature's Music: the Science of Birdsong" by Marler and Slabbekoorn - a great comprehensive textbook about bird vocalisations.
Friday 13th November 2015 | science | Permalink

Getting neural networks and deep learning right for audio (WASPAA/SANE)

I'm just back from a conference visit to the USA, to attend WASPAA and SANE. Lots of interesting presentations and discussions about intelligent audio analysis.

One of the interesting threads of discussion was about deep learning and modern neural networks, and how best to use them for audio signal processing. The deep learning revolution has already had a big impact on audio: famously, deep learning gets powerful results on speech recognition and is now used pervasively in industry for that task. It's also widely studied in image and video processing.

But that doesn't mean the job is done. Speech recognition is only one of many ways we get information out of audio, and other "tasks" are not direct analogies, they have different types of inputs and outputs. Secondly, there are many different neural net architectures, and still much lively research in which architectures are best for which purposes. Part of the reason that big companies get great results for speech recognition is that they have masses and masses of data. In cases where we have modest amounts of data, or data without labels, or data with fuzzy labels, getting the architecture just right is an important thing to focus on.

And audio signal processing insights are important for getting neural nets right for audio. This was one of the main themes of Emmanuel Vincent's WASPAA keynote, titled "Is audio signal processing still useful in the era of machine learning?" (Slides.) He mentioned, for example, that the intelligent application of data augmentation is a good way for audio insight to help train deep nets well. I agree, but in the long-term I think the more important point is that our expertise should be used to help get the architectures right. There's also the thorny question (and hot topic in deep learning) of how to make sense of what deep nets are actually doing: in a sense this is the flip-side of the architecturing issue, making sense of an architecture once it's been found to work!

It's common knowledge that convolutional nets (ConvNets) and recurrent neural nets (specifically LSTMs) are powerful architectures, and in principle LSTMs should be particularly appropriate for time-series data such as audio. Lots of recent work confirms this. At the SANE workshop Tuomas Virtanen presented results showing strong performance at sound event detection (recovering a "transcript" of the events in an audio scene), and Ron Weiss presented impressive deep learning that could operate directly from raw waveforms to perform beamforming and speech recognition from multi-microphone audio. Weiss was using an architecture combining convolutional units (to create filters) and LSTM units (to handle temporal dependences). Pablo Sprechmann discussed a few different architectures, including one "unfolded NMF"-type architecture. (The "deep unfolding" approach is certainly a fruitful idea for deep learning architectures. Introduced a couple of years ago by Hershey et al. [EDIT: It's been pointed out that the unfolding idea was first proposed by Gregor and Lecun in 2010, and unfolded NMF was described by Sprechmann et al. in 2012. The contribution of Hershey et al. comes from the curious step of untying the unfolded parameters, which turns a truncated iterative algorithm into something more like a deep network.])

I'd like to focus on a couple of talks at SANE that exemplified how domain issues inform architectural issues:

  • John Hershey presented "Deep clustering: discriminative embeddings for single-channel separation of multiple sources". The task being considered was source separation of two or more speaking people recorded in a mixture, which is usually handled by applying binary masking to a spectrogram of the mixture. The task then becomes how to identify which "pixel" of the spectrogram should be assigned to which speaker. In some sense, it's a big multilabel classification task, with each pixel needing a label. Except as John pointed out, it's not really a classification task but a clustering task, because when we get a mixture of two speakers and we want to separate them, we usually have no prior labels and no reason to care who is "Speaker 1" and who is "Speaker 2". Motivated by this, Hershey described an approach where a deep learning system is trained to cluster the pixels in a latent space. The objective function happens to be the same as the K-means objective, except that instead of learning which items go in which cluster, the net is being trained to move the items around in the latent space so that the cluster separation is maximised. (The work is described in this arxiv preprint.)
  • Paris Smaragdis presented "NMF? Neural Nets? It’s all the same...". Smaragdis is well-known for his work on NMF (non-negative matrix factorisation) methods. He presented a great narrative arc of how you might start with NMF and then throw away the things that irritate you about it - such as the spectrogram, instead working with convolutional filters learnt from raw audio.

    (Note this other recurring theme: I already mentioned that Ron Weiss was also talking about waveform-based methods. Others have worked on this before, such as Sander Dieleman's paper on "end-to-end" deep learning for music audio. It's still not clear if ditching the spectrogram is actually that beneficial. Certainly if you do, you need lots of data in order to train successfully, as Weiss demonstrated empirically. I don't think I'd recommend ditching the spectrogram yet unless you're really sure what you're doing...)

    The really surprising thing about Smaragdis' talk (given his previous work) was that by the time he'd deconstructed NMF and built up a neural net having similar source-separation properties, the end result was a surprisingly recognisable autoencoder - however, with a nicely principled architecture, and also some specific modifications (some "skip" connections and choices about tying/untying parameters). This autoencoder is not the same as NMF - it doesn't have the same non-negativity constraints, for example - but is inspired by similar motivations.

Happily, these discussions relate to some work I've been involved in this year. I spent some time visiting Rich Turner in Cambridge, and we had good debates and a small study about how to design a neural network well for audio. We have a submitted paper about denoising audio without access to clean data using a partitioned autoencoder which is the first fruit of that visit. The paper focuses on the "partitioning" issue but the design of the autoencoder itself has some similarities to what Paris Smaragdis was describing, and for similar reasons.

There's sometimes a temptation to feel despondent about deep learning: the sense of foreboding that a generic deep network will always beat your clever insightful method, just because of the phenomenal amounts of data and compute-hours that some large company can throw at it. All of the above discussions feed into a more optimistic interpretation, that domain expertise is crucial for getting machine-learning systems to learn the right thing - as long as you can learn to jettison some of your cherished habits (e.g. MFCCs!) at the right moment.

Monday 26th October 2015 | science | Permalink

Detection and Classification of Acoustic Scenes and Events - paper now out

Our journal paper Detection and Classification of Acoustic Scenes and Events is now out in IEEE Transactions on Multimedia! It evaluates many different methods for detecting/classifying in everyday audio recordings.

I'm highlighting this paper because it covers the whole process of the IEEE DCASE evaluation challenge that we ran a little while ago, with many international research teams submitting systems either for audio event detection or audio scene classification.

It was a big team effort, with various people putting many months of time in, from 2012 through to 2015 (even though it was essentially an unfunded initiative!). Specific thanks to Dimitrios and Emmanouil, who I know put lots of manual effort in, repeatedly, to get this right.

Thursday 17th September 2015 | science | Permalink

IBAC 2015 - some thoughts on conference organisation

The International Bioacoustics Congress 2015 was a fantastic conference. Lots of fascinating research, in a great place (Murnau, Bavaria, Germany), and very well organised! In this note I want to capture some thoughts that it triggered, about the practical organisation of a conference.

The staff that faciliated the conference made it run very smoothly. There were helpful people in the downstairs office almost all week, to ask questions etc. I particularly appreciated the facilitation for conference speakers: downstairs, the organisers loaded our presentations onto the laptop and checked they worked; then upstairs, there was a sound engineer who very efficiently fitted us with the radio mic and opened the presentations. This kind of support was crucial to make it possible to have such a busy schedule: many sessions had only 15 minutes per speaker! So no time for messing around.

Various IBAC people said, and I agree, that it's vital to keep it as a single-track conference: that seems to be part of its friendly community atmosphere. This is tricky, as IBAC has grown so that the schedule is now tightly-packed, and one "easy" way to reduce the pressure would be to go multi-track. I suspect the biggest risk there is of splitting the community into taxa (birds, marine, anurans, etc). So if parallel sessions were to be used (not my preferred solution), it'd be better to do that with the "open" rather than themed sessions, as someone at the AGM suggested. (The mix of open and themed sessions was well-balanced here in 2015.)

Every day opened with a 60-minute keynote, which is a great and widely-used pattern. We then had 20-minute slots in the themed sessions, and 15-minute slots in the open sessions. In my home discipline I've never seen 15-minute talk slots, and I think that's too short. I think that 20-minute slots are good, as long as the chair insists on keeping some time for questions, since I personally believe that public discussion with conference speakers is a really important part of what conference presentations are for. The IBAC chairs didn't insist on this at all really, which is a shame. That aside, they were well hosted.

The poster sessions were lively and very interesting, but physically they were too full! It was often very difficult to even read the titles of posters, let alone talk to the person standing there, if one or two people were discussing a nearby poster. This could have been improved by having 4 separate sessions of 40 posters, rather than 2 sessions of 80 which were each repeated for two days.


So, as I've already implied, IBAC was very highly subscribed, with many talks and posters, and I've been suggesting it could be better if the programme was a bit less tightly-packed. How could this be done (without going multi-track)? One answer is to be more selective, i.e. to accept fewer abstracts. Immediately I want to highlight a risk of this: it's great at IBAC to have lots of student and early-postgrad presenters, so we would want to avoid a selection process that favoured big names or experienced abstract-submitters. (We'd also want to maintain a decent balance across taxa.) I'd suggest a simple quota: minimum 50% student or recently-graduated people, both for talks and for posters.

Being selective has a cost: interesting things get rejected. The quality of IBAC 2015 was high, there's no need to be selective for quality purposes. IBAC is currently every two years. I wonder if the IBAC community would be interested in having IBAC every year? There's clearly enough content for that. Would it suit the rhythm of the community? Could the IBAC steering committee cope with the doubled workload?


I find a printed programme absolutely essential. The 2015 organisers decided that many people don't want it because they use electronic versions, so printing it would be wasteful. That's fine, but for me and many others we need something. I think ideal would be simply to have a tick-box on the conference registration form, "Would you like a printed programme?" Simple to handle, and reduces unnecessary printing.


A few other miscellaneous thoughts:

  • The social event in the middle of the week was excellent (I went on the hike). Also it's crucial to have something like that to give the week a good rhythm. For me that kind of thing is much more important than the "conference dinner". (After all, most of us are dining together each night. I've always found the "conference dinner" a slightly odd ritual.) Having said that, we did have a lovely dinner and a good dance afterwards.
  • Someone mentioned the issue of gender balance. Certainly IBAC had a good gender balance compared against many of the computer science conferences I go to. It's true that the keynote speakers were more men than women. The organising committee told us they had tried to get an even balance - it's always difficult, because female keynotes are often in high demand. In my judgment IBAC did well. I like the suggestion made during the AGM that gender balance of keynotes should be officially made an aim for IBAC organisers, without having to force it as an absolute requirement.
  • Someone suggested everyone should put their photo on their poster - it helps people know who to look for! This is a good idea, though we must remember that some people have personal reasons not to print their photo. So I'd suggest the organisers should "strongly recommend" all presenters add their photo into the top of their poster.
  • The IBAC organisers said that they'd deliberately chosen a small town (and a beautiful one), so that people would tend to stay together, meeting each other in the cafes etc. This is a clever idea.

Of course almost everything I've written is about general conference organisation, not just IBAC. These thoughts are spurred by conversations we had at IBAC, and spurred by the overall extremely good conference organisation. Massive thanks to the IBAC 2015 organisers and staff!

P.S. I previously blogged about the research at IBAC 2015.

Monday 14th September 2015 | science | Permalink

IBAC 2015 - bioacoustics research

The International Bioacoustics Congress 2015 was a fantastic conference. Lots of fascinating research, in a great place (Murnau, Bavaria, Germany), and very well organised. Here I'm making some notes on the interesting research topics I encountered. I can't list everything because almost everyone at the conference was doing something fascinating! What a niche this is ;)

This was my first IBAC. I'd say the majority of people were animal communication or animal behaviour researchers, plus ecologists, sound archivists, a composer or two, a couple of industry people and a couple of computer scientists. (I didn't spot any acousticians/physicists, I was wondering if I would.) Lots of great people talking about animal sounds.

My own presentations went down well, I'm pleased to say. I had a talk about our Warblr bird sound recogniser (here's the journal paper, Stowell and Plumbley (2014)), and a poster about inferring the communication network underlying the timing of animal calls. (From the latter, lots of good conversation about whether cross-correlation was a good tool for the job. My answer is that it's perfectly fine for pairs. For larger groups it's tolerable if you have enough data, but I have a better way... need to write it up.)

My colleague Rob Lachlan presented his really neat work on vocal learning in chaffinches. Apparently chaffinch syllable transmission is one of the most precise cultural transmission processes that's yet been quantified. I'd imagine he could tell you more about how that might relate to questions of the birds' innate biases etc.

Now here are some good things that were new to me. (Note that I'm quite a bit biased towards birds rather than the other taxa.) I'll save all the zebra finch items until the end since they're interrelated and something I'm currently thinking about. First, miscellaneous highlights:

  • Stefan Schoeneich described tracing the exact set of neurons responsible for call detection in a cricket species. This was fascinating - it's only a handful of neurons, so how to the crickets do it? Stefan described how post-inhibitory rebound is a crucial piece of the puzzle since it's a very simple neural phenomenon that provides a "delay line" that the cricket uses to detect repetition. The important thing is that this delay line is the same mechanism in the caller and the listener. This enables co-adaptation: evolution can change the repeat rate without breaking the communication channel. (Rohini Balakrishnan told me afterwards that this is not a new idea, though it's novel to me - Stefan's contribution is to demonstrate the exact network that uses this mechanism.)

  • Diego Llusia presented a playback experiment to modify the timing of dawn choruses. Interesting to see this: playbacks often involve a single species, but this investigated the timing of a whole assemblage of chorusing bird species. The study raised lots of good questions - it'd be good to see more development of this line of inquiry.

  • There was a good session on female vocalisations (led by Michelle Hall). From a European-biased perspective we often think of birdsong as being largely a male preserve. Karen Odom talked about the patterns of usage in one species (troupials). The main thing I note is actually her finding published last year (Odom et al 2014, Nature Communications) that female song is highly likely to be ancestral in songbirds, i.e. the reason it's seen less often in the northern hemisphere is that it was dropped (multiple separate times) by evolution, as songbirds radiated north. Lauryn Benedict then discussed why this might be. Maybe we can find correlates in life history - i.e. maybe the songbirds that dropped female song concomitantly developed some other communication or behavioural pattern, and this might help us understand what happened? Lauryn's study found no correlation either with migration or dichromatism. She noted that studying this is tricky because although lots of songbirds are described as having no female song, in many cases this might be due to our own biases and failure to spot it (especially in non-dimorphic species). Lauryn showed that her lack of correlation was robust to this issue.

  • Coen Elemans showed his work on physical modelling of the songbird syrinx. He found that the "myoelastic aerodynamic" model (developed in the context of the human larynx) works well for the syrinx. This was a surprise to me, since many songbirds have two oscillators in the syrinx rather than one, and I would have suspected the model might noticeably fail to account for interactions between them. It seems his model is tested for bird species with relatively independent sets of vocal folds, so maybe this suspicion is yet to be fully tested.

  • Lots of interesting discussion around acoustic diversity indices during the ecoacoustics session (led by Jerome Sueur). I remain to be convinced that we have robust useful "acoustic index" measurements directly from the audio signal without heavy user configuration. In that context it was interesting to hear from the experience of others. For example Nadia Pieretti found the ACI useful and robust for her shallow marine soundscapes, while Gianni Pavan working with forest soundscapes found it too strongly affected by weather sound (wind, rain).

  • Karla Rivera-Caceres showed that when plain wrens develop a duet code - meaning a specific choice of syllables to combine into their duet - it's due to learned association between the syllables, and not a private code designated by individual ID.

  • Karen Rowe talked about automatic detection in practice - really interesting from my point of view, to see how people fare when they use automatic detectors for their immediate practical work. She had deployed Songmeters in the Grampians, using an occupancy framework, which means that they only need to know presence/absence not the whole set of calls - a single positive detection is all that's needed. They tested a two-pass approach with an initial detection pass, then a second pass using some of the already-detected syllables as templates. They found that the manual work involved (in checking false positives, tweaking the classifier etc) meant that the automation was not in fact more efficient! In their case it was approximately as efficient to do fully manual annotation.

  • Peter Slater gave an evocative talk on their study of many wren species. He noted various things about duetting, and male and female song, finding that these traits correlate with phylogeny. It seems wrens have, multiple times, developed introductory phrases to lead in to duets - that's an interesting fact, food for thought.

  • Andrea Thibault showed us the behaviour of foraging seabirds, and the calls they make just before diving - apparently to warn others of the impending dive.

  • Lisa Gill showed a poster on jackdaw "addressing" call. We (with Rob too) had a good chat about how to computationally analyse corvid "caw"-like sounds - still very tricky and non-obvious. Lisa also told me about her paper just accepted for eLife about zebra finch social networks and call patterns - very pertinent to me! Look forward to reading it.

  • A nice session on comparative work with music, speech and language (led by Carel ten Cate). Marisa Hoeschele described that songbirds are - in general - sensitive to absolute pitch not relative pitch. They're much easier to train to discriminate absolute pitch variation rather than relative. (This is notably unlike humans!) She then showed her experimental evidence that black-capped chickadees can do relative pitch discrimination, but they're much better at it when the stimuli are made of chickadee syllables rather than pure sinewaves. Particularly interesting since the chickadee syllables are fairly pure-tone, not harmonic stacks, so the difference might not be the presence of harmonics. It also shows that a simple pitch-following model is not sufficient to explain their good performance, there must be some other attribute that makes things accessible to them.

  • Vera Klimsova gave us all a lesson in how to listen like an impala, to alarm calls from other species (including other species that don't live near impalas). She also gave us all a lesson in how to do a talk when you're the last speaker of a 5-day conference - an entertaining and memorable talk!

Now the zebra-finch-based research:

  • Solveig Mouterde described her work on how zebra finch calls degrade as they propagate through the environment, and how that affects individual recognition, both for zebra finch listeners and for machines. I'd like to see more of this kind of work because I think there are still many issues that are not completely addressed by some of the older bioacoustic concepts. For example Solveig referred to "active space" - a useful concept, but one that needs to incorporate the complexities of perceptual and acoustic variation before it really gets to the issue of how far an animal can be heard. Solveig's work goes towards addressing that.

  • Pietro d'Amelio talked about duetting in zebra finch mate pairs, showing very consistent antiphonal calling patterns, some symmetrical, some asymmetrical.

  • Had a good chat with Manfred Gahr and Albertine Leitao about how to measure tutored vocal learning in zebra finches. I have an idea that it could be done usefully with feature learning, which would be good to study some time.

  • Nicole Geberzahn studied how individuality emerges in zf song, through an experiment with many tutors who were themselves all taught from the same song. She found that new phrases emerged by mechanisms such as repeating whole phrases or adding call-like syllables onto the end. In a recognition test, zf listeners heard individual identity to be encoded in syllable details, not in phrase structure.

  • Andries ter Maat presented the work of his student Hanneke Poot, finding that pupil syllables are often not shared with any of the tutors, and complete copies of tutor songs are very rare. (Unlike Nicole's test mentioned above, in this case the tutors were quite varied.) Also that zfs don't particularly choose their genetic or social father to learn from. He also noted that Tchernikovsky's sound similarity measures (as calculated by Sound Analysis Pro, that is) can depend strongly on the syllable type, so you need to apply some kind of standardisation procedure if you want to make global similarity comparisons.

  • Marie S A Fernandez looked at the calling patterns of zf pairs when they are together, separated, and then reunited. She found that the cross-correlation or Markov analysis found strong back-and-forth structure only while the birds were separated. (I wonder: if we could include all visual and other cues, would there in fact be a detectable structure in all cases? It would be a different structure with/without visual contact, presumably. Very hard to annotate all possible multimodal cues though.) Perez et al (2015)

  • Clementine Vignal studied zf negotiation over parental care, finding that the length of some zf conversations could predict the subsequent balance of parental care. (This correlation was over and above the obvious factors such as how much nest-time each parent had recently spent.)

  • Buddhamas Pralle Kriengwatana presented an experiment in which zebra finches were trained to discriminate very short audio clips of human "i"/"e" vowels. She showed that once trained, the zfs can generalise to clips from another language (with slightly different formant positions), which demonstrates a generalisation ability that is not just about formant frequencies, possibly some relative rather than absolute distinction. For me there's a niggling question: formants are not the only way that vowels differ - there's also aspiration etc - so I'd be interested to know how such confounds were avoided when using real speech recordings. Pralle's suggestion seems plausible, though, that the ability could be explained by a perceptual mechanism based on using the sound to infer some physical trait such as the volume of the mouth cavity.

To all at IBAC: my apologies if I misrepresent you here, missed you out, or misspelt your name! In particular I didn't manage to see much of the second poster session since I was myself presenting a poster.

At the end of the conference there was an organised visit to MPIO Seewiesen, where a lot of good bird studies are happening. I was most struck by the magnificent ravens, living in outdoor aviaries and showing off their awesome vocal skills.

What else? Well, lots more. A great hike organised in the wetlands around Murnau (Murnauer Moos). Bavarian beer and food. The mountains as a backdrop...

Monday 14th September 2015 | science | Permalink

Warblr bird app launched today

Today we launched Warblr, our app for automatically recognising the sounds of the UK's hundreds of bird species.

It's £3.99, and it's in the Apple Store here.

It's built using our research here at QMUL. The research was funded by the EPSRC - they funded me to do the basic research, and they also funded the "innovation" grant that helped turn it into software people can use on their phone.

One question you might wonder... If it's based on public research funding, why is it a paid app? We're going with a spin-out model, creating a business (a social enterprise with open data and conservation goals) and we believe that's a good route to making it sustainable. The basic research is publicly available to all.

I'm particularly happy to see the Guardian did a head-to-head test of our app and another one. Yes they agreed our app was better :) but the broader point is that this research on machine learning and sound is now reaching the point where, like speech recognition, it becomes more than just a research idea to become something people can use as an everyday tool.

The data we used during development: Xeno Canto, the big crowdsourced bird sound database, has been invaluable. And more recently the British Trust for Ornithology also very kindly allowed us to use some of their bird monitoring data (collected by thousands of volunteers over decades) as part of the recognition process.

The data we collect: we shall see! But a big motivation for this endeavour is to collect audio as well as geospatial data, that can help research and one day will also help organisations such as the BTO to monitor bird conservation.

It's been interesting getting to this point. Thanks to all who helped us on our way, including my business partner Florence Wilkinson who's been working tirelessly on this. And a personal thanks from me to Mark Plumbley for his enthusiastic support and discussion all through the early stages of this research!

Thursday 13th August 2015 | science | Permalink

Cambridge research visit: birds, NMF, auto-encoders

For two months I've been visiting Richard Turner and the Machine Learning Group at Cambridge University. It's been a very stimulating visit. As part of my fellowship applying machine learning to bird sounds this was planned as a time to think about methods appropriate for the various purposes we want to analyse bird sounds - in particular given the constraints of uncontrolled audio recorded in the wild.

We considered approaches derived from non-negative matrix factorisation (NMF), from convolutional neural networks (ConvNets), and from neural spiking models.

NMF is conceptually simple and easy to optimise, and there have been some interesting recent extensions to hierarchical representations and so forth, which might allow for a structured decomposition of an audio scene. One thing I'd love to do is augment NMF with Markov renewal process temporal modelling, and it looked like Cemgil-style NMF would give us a way to inject that in as a prior on the activation patterns, but then we found a hole in our maths which meant it wasn't going to give us that. NMF models are interesting and very clear, but it's not always obvious when your problem will admit a cute algorithm to solve it. Still lots of interesting things one can do with NMF.

We then put most of our time into looking at convolutional auto-encoders (ConvAEs). As with the rest of the neural net renaissance, these offer very flexible ways to model data. An auto-encoder is good for unsupervised learning, and has a lot of potential for learning useful representations of data, given appropriate constraints. These have been used for all sorts of purposes, and occasionally for audio.

Some interesting recent papers look at how to get a structured/semantic representation out of an autoencoder. This is often helped by having speech/vision datasets which are highly structured themselves (e.g. a photo of the same face from many angles and many lighting conditions). With natural birdsong we don't really have that opportunity, so the interesting question is whether we can design a system to do something along those lines despite the uncontrolled (and often unlabelled!) data.

I'm not going to say too much about the method here because the work isn't finished, but here's a work-in-progress image, showing (in the top row) a spectrogram of some birdsong contaminated by background noise. In the lower two rows the autoencoder is outputting an estimate of the foreground and of the background. Not perfect but certainly encouraging.

three spectrogram plots

Thanks to Rich and the group for their welcome in Cambridge!

Also thanks to some people who offered some specific insights into convolutional neural networks and Theano: Sander Dieleman, Matthew Koichi Grimes, Vijay Badrinarayanan.

Thursday 28th May 2015 | science | Permalink

Non-negative matrix factorisation in Stan

Non-negative matrix factorisation (NMF) is a technique we use for various purposes in audio analysis, to decompose a sound spectrogram.

I've been dabbling with the Stan programming environment here and there. It's an elegant design for specifying and solving arbitrary probabilistic models.

(One thing I want to point out is that it's really for solving for continuous-valued parameters only - this means you can't explicitly do things like clustering etc (unless your approach makes sense with fuzzy cluster assignments). So it's not a panacea. In my experience it's not always obvious which problems it's going to be most useful for.)

So let's try putting NMF and Stan together.

First off, NMF is not always a probabilistic approach - at its core, NMF simply assumes you have a matrix V, which happens to be the product of two "narrower" matrices W and H, and all these matrices have non-negative values. And since Stan is a probilistic environment we need to choose a generative model for that matrix. Here are two alternatives I tried:

  1. We can assume that our data was generated by an independent random complex Gaussian for each "bin" in the spectrogram, each one scaled by some weight value specified by a "pixel" of WH. If we're working with the power spectrogram, this set of assumptions matches the model of Itakura-Saito NMF, as described in Fevotte et al 2009. (See also Turner and Sahani 2014, section 4A.)
  2. We can assume that our spectrogram data itself, if we normalise it, actually just represents one big multinomial probability distribution. Imagine that a "quantum" of energy is going to appear at some randomly-selected bin on your spectrogram (a random location in time AND frequency). There's a multinomial distribution which represents the probabilities, and we assume that our spectrogram represents it. This is a bit weird but if you assume we got our spectrogram by sampling lots of independent quanta and piling them up in a histogram, it would converge to that multinomial in the limit. This is the model used in PLCA.

So here is the Stan source code for my implementations of these models, plus a simple toy dataset as an example. They both converge pretty quickly and give decent results.

I designed these implementations with audio transcription in mind. When we're transcribing music or everyday sound, we often have some pre-specified categories that we want to identify. So rather than leaving the templates W completely free to choose, in these implementations I specify pre-defined spectral templates "Winit".

(Specifying these also breaks a permutation symmetry in the model, which probably helps the model to converge since it shouldn't keep flipping around through different permutations of the solution. Another thing I do is fix the templates W to sum up to 1 each [i.e. I force them to be simplexes] because otherwise there's a scaling indeterminacy: you could double W and halve H and have the same solution.)

I use a concentration parameter "Wconc" to tell the model how closely to stick to the Winit values, i.e. how tight to make the prior around them. I also use an exponential prior on the activations H, to encourage sparsity.

My implementation of the PLCA assumptions isn't quite traditional, because I think in PLCA the spectrogram is assumed to be a sample from a multinomial (which implies it's quantised). I felt it would be a bit nicer to assume the spectrogram is itself a multinomial, sampled from a Dirichlet. There's little difference in practice.

Monday 2nd February 2015 | science | Permalink

Merge operation: in Chomsky, and in recursive neural networks for NLP

This is either a spooky coincidence, or a really neat connection I hadn't known:

For decades, Noam Chomsky and colleagues have famously been developing and advocating a "minimalist" idea about the machinery our brain uses to process language. There's a nice statement of it here in this 2014 paper. They propose that not much machinery is needed, and one of the key components is a "merge" operation that the brain uses in composing and decomposing grammatical structures. (Figure 1 shows it in action.)

Then yesterday I was reading this introduction to embeddings in deep neural networks and NLP, and I read the following:

"Models like [...] are powerful, but they have an unfortunate limitation: they can only have a fixed number of inputs. We can overcome this by adding an association module, A, which will take two word or phrase representations and merge them.

(From Bottou (2011))

"By merging sequences of words, A takes us from representing words to representing phrases or even representing whole sentences! And because we can merge together different numbers of words, we don’t have to have a fixed number of inputs."

This is a description of something called a "recursive neural network" (NOT a "recurrent neural network"). But look: the module "A" seems to do what the minimalists' "merge" operation does. The blogger quoted above even called it a "merge" operation...

As far as I can tell, the inventors of recursive neural networks were motivated by technical considerations - e.g. how to handle sentences of varying lengths - and not by the minimalist linguists. But it looks a little bit like they've created an artificial neural network embodiment of the minimalist programme! I'm not an NLP person, nor a linguist, however: surely I'm not the first to notice this connection? It would be a really neat convergence if it was indeed unconscious. Does this mean we can now test some Chomskian ideas (such as their explanation of word displacement) by implementing them in software?

UPDATE: After chatting with my QMUL colleague Matt Purver - he actually is a computational linguistics expert, unlike me - I should add that there's a little bit less to this analogy than I initially thought. The most obvious disjunction is that the ReNN model performs language analysis in a left-to-right (or right-to-left) fashion, whereas Chomskyan minimialists do not: one thing they preserve from "traditional" grammar is the varying nested constructions of linguistic trees, nothing like as neat in general as the "sat on the mat" example above.

The ReNN model also doesn't really give you anything about long-range dependencies such as the way questions are often constructed with a kind of implicit "move" of a word from one part of the tree to another.

Matt and many other linguists have also told me it's problematic to consider a model where words and sentences are both represented in the same conceptual space. For example, a complete utterance usually implies some practical consequence in the real world, whereas its individual components do not. I recognise that there are differences, but personally I haven't heard any killer argument that they shouldn't exist in the same underlying space-like representation. (After all, many utterances consist of single words; many utterances are partial fragments; many utterances lead to consequences before the speaker has finished speaking.)

I do still believe there's an interesting analogy here. I definitely can't claim that any current ReNN model is an implementation of the Strong Minimalist Programme, but it'd be interesting to see the analogy pushed further, see where it breaks and how it can be improved.

Wednesday 21st January 2015 | science | Permalink

PhD opportunity! Study machine learning and bird sounds with me

I have a fully-funded PhD position available to study machine learning and bird sounds with me!

For full details please see the PhD advertisement on jobs.ac.uk. Application deadline Monday 12th January 2015.

Please do email me if you're interested or if you have any questions.

– and is there anyone you know who might be interested? Send them the link!

Thursday 6th November 2014 | science | Permalink

I have been awarded a 5-year fellowship to research bird sounds

I've been awarded a 5-year research fellowship! It's funded by the EPSRC and gives me five years to research "structured machine listening for soundscapes with multiple birds". What does that mean? It means I'm going to be developing computerised processes to analyse large amounts of sound recordings - automatically detecting the bird sounds in there and how they vary, how they relate to each other, how the birds' behaviour relates to the sounds they make.

zebra finches

Why it matters:

What's the point of analysing bird sounds? Well...

One surprising fact about birdsong is that it has a lot in common with human language, even though it evolved separately. Many songbirds go through similar stages of vocal learning as we do, as they grow up. And each species is slightly different, which is useful for comparing and contrasting. So, biologists are keen to study songbird learning processes - not only to understand more about how human language evolved, but also to help understand more about social organisation in animal groups, and so on. I'm not a biologist but I'm going to be collaborating with some great people to help improve the automatic sound analysis in their toolkit - for example, by analysing much larger audio collections than they can possibly analyse by hand.

Bird population/migration monitoring is also important. UK farmland bird populations have declined by 50% since the 1970s, and woodland birds by 20% (source). We have great organisations such as the BTO and the RSPB, who organise professionals and amateurs to help monitor bird populations each year. If we can add improved automatic sound recognition to that, we can help add some more detail to this monitoring. For example, many birds are changing location year-on-year in response to climate change (source) - that's the kind of pattern you can detect better when you have more data and better analysis.

Sound is fascinating, and still surprisingly difficult to analyse. What is it that makes one sound similar to another sound? Why can't we search for sounds as easily as we can for words? There's still a lot that we haven't sorted out in our scientific and engineering understanding of audio. Shazam works well for music recordings, but don't be lulled into a false sense of security by that! There's still a long way to go in this research topic before computers can answer all of our questions about sounds.

What I am going to do:

I'll be developing automatic analysis techniques (signal processing and machine learning techniques), building on starting points such as my recent work on tracking multiple birds in an audio recording and on analysing frequency-modulation in bird sounds. I'll be based at Queen Mary University of London.

I'll also be collaborating with some experts in machine learning, in animal behaviour, in bioacoustics. One of the things on the schedule for this year is to record some zebra finches with the Clayton Lab. I've met the zebra finches already - they're jolly little things, and talkative too! :)

Tuesday 18th March 2014 | science | Permalink

How long it takes to get my articles published - update

Here's an update to my own personal data about how long it takes to get academic articles published. I've also augmented it with funding applications too, to compare how long all these decisions take in academia.

It's important because often, especially as an early-career researcher, if it takes one year for a journal article to come out (even after the reviewers have said yes), that's one year of not having it on your CV.

So how long do the different bits take? Here's a bar-chart summarising the mean durations in my data:

The data is divided into 3 sections: first, writing up until first submission; then, reviewing (including any back-and-forth with reviewers, resubmission etc); then finally, the time from final decision through to publication.

Firstly note that there are not many data points here, so for example I have one journal article that took an extremely long time after acceptance to actually appear, and this skews the average. But it's certainly notable that the time spent writing generally is dwarfed by the time spent waiting. And particularly that it's not necessarily the reviewing process itself that forces us all to wait - various admin things such as typesetting seem to take at least as long. Whether or not things should take that long, well, it's up to you to decide.

Also - I was awarded a fellowship recently, which is great - but you can see in the diagram, that I spent about two years repeatedly getting negative funding decisions. It's tough!

This is just my own data - I make no claims to generality.

Monday 17th March 2014 | science | Permalink

Gaussian Processes: advanced regression with sounds, and with geographic data

This week I was learning about Gaussian Processes, at the very nice Gaussian Processes Winter School in Sheffield. The term "Gaussian Processes" refers to a family of techniques for inferring a smooth surface (1D, 2D, 3D or more) from a set of sampled noisy data points. Essentially, it's an advanced and mathematically very sound type of regression.

Don't get confused by the name, by the way: your data doesn't have to be Gaussian, and Gaussian Process regression doesn't always produce smooth Gaussian-looking results. It's very flexible.

As an example, here's a first pass I did of analysing the frequency trajectories in a single recording of birdsong.

I used the "GPy" Python package to do all this. Here's their GPy regression tutorial.

I do want to emphasise that this is just a first pass, I don't claim this is a meaningful analysis yet. But there's a couple of neat things about the analysis:

  1. It can combine periodic and nonperiodic variation (by combining periodic and nonperiodic covariance kernels). Here I used a standard RBF kernel plus a periodic kernel which repeats every 1 syllable, and another periodic kernel which repeats every 3 syllables, which reflects well the patterning of this song bout.
  2. It can represent variation across multiple levels of detail. Unlike many other regressions/interpolations, sometimes there are fast wiggles and sometimes broad curves.
  3. It gives you error bars, which are derived from a proper Bayesian posterior.

So now here's my second example, in a completely different domain. I'm not a geostatistician but I decided to have a go at reconstructing the hills and valleys of Britain using point data from OpenStreetMap. This is a fairly classic example of the technique, and OpenStreetMap data is almost a perfect for the job: it doesn't hold any smooth data about the surface terrain of the Earth, but it does hold quite a lot of point data where elevations have been measured (e.g. the heights of mountain peaks).

If you want to run this one yourself, here's my Python code and OpenStreetMap data for you.

This is what the input data look like - I've got "ele" datapoints, and separately I've got coastline location points (for which we can assume ele=0):

Those scatter plots don't show the heights, but they show where we have data. The elevation data is densest where we have mountain ranges etc, such as central Scotland and in Derbyshire.

And here are two different fits, one with an "exponential" kernel and one with a "Matern" kernel:

Again, the nice thing about Gaussian Process regression is that it seamlessly handles smooth generalisations as well as occasional patches of fine detail where needed. How good are the results? Well it's hard to tell by eye, and I'd need some official relief-map data to validate it. But from looking at these two, I like the exponential-kernel fit a bit better - it certainly gives an intuitively appealing relief map in central Scotland, and it gives visually a bit less blobbiness than the other plot. However it's a bit more wrong in some places, e.g. an overestimated elevation in Derbyshire there (near the centre of the picture). If you ask an actual geostatistics expert, they will probably tell you which kernel is a good choice for regressing terrain shapes.

The other thing you can see in the images is that it isn't doing a very good job of predicting the sea. Often, we dip down to altitude of zero at the coast and then pop back upwards after. No surprises about this, for two reasons: firstly I didn't give it any data points about the sea, and secondly I'm using "stationary" kernels, meaning there's no reason for the algorithm to believe the sea behaves any differently from the land. This is easy to fix by masking out the sea but I haven't bothered.

So altogether, these examples show some of the nice features of Gaussian Process regression, and, along with the code, that the GPy module makes it pretty easy to put together this kind of analysis in Python.

Friday 17th January 2014 | science | Permalink

The UK Government Response to the BIS Open Access Review

The UK Government's Department of Business Innovation and Skills recently published a review of Open Access research publication. It made a number of really good recommendations, including de-emphasising the "gold" (pay-to-publish) route, and stepping back from the over-extended embargo periods that the publishers seem to have got RCUK to agree to.

The Government has published its response to this review. What is their response? basically, "Nah, no thanks."

  • The review said "RCUK should build on its original world leading policy by reinstating and strengthening the immediate deposit mandate in its original policy". The Government said "... timely OA ... through mutually acceptable embargo period". There's nothing "mutual" about the choice of embargo period, given that many academics have been asking for the position that the government has just explicitly rejected.
  • The review said "We recommend that the Government and RCUK revise their policies to place an upper limit of 6 month embargoes on STEM subject research and up to 12 month embargoes for HASS subject research, in line with RCUK’s original policy published in July 2012". The Government said "A re-engineering of the research publications market entails a journey not an event" or in other words "No". Note the vacuousness of their statement. It could easily have been "an event", and the committee wasn't even recommending the total removal of embargoes.
  • The review said "We recommend that the Government and RCUK reconsider their preference for Gold open access during the five year transition period, and give due regard to the evidence of the vital role that Green open access and repositories have to play as the UK moves towards full open access." The government said "Government and RCUK policy with an expressed preference for Gold OA [sets the direction of travel]". This is fair enough as a sentiment, but unfortunately the government response also included the publisher's favourite "open access flowchart" which clearly tells researchers that gold open access must be chosen if available. Note that this is not a consensus or objective reading of current RCUK rules, let alone the future. The government is showing no signs of backing away from this weird new competitive problem they're creating right now, where researchers in universities have to compete with their own colleagues (studying completely different disciplines) for the tiny and certainly insufficient institutional pay-to-publish funding pots.
  • The review in fact agrees with the position I just stated: "RCUK’s current guidance provides that the choice of Green or Gold open access lies with the author and the author’s institution, even if the Gold option is available from the publisher. This is incompatible with the Publishers Association decision tree, and RCUK should therefore withdraw its endorsement of the decision tree as soon as possible, to avoid further confusion within the academic and publishing communities." The government says "As discussed above the UK OA Decision Tree sets out clearly the direction of travel." Arrrrrrrrrrrrrrrgh, are you not even listening?

I could go on. Suffice to say that I was so encouraged by the sane voice of the BIS review; yet the government's response appears to be a solid and completely shameless "not for turning".

Thursday 28th November 2013 | science | Permalink

Improved multiple birdsong tracking - video

The "Faculti" website did a video interview with me about automatic birdsong tracking. A little tongue-tied occasionally but here it is (5:36):

Image from video - click to watch

The research papers related to this are:

Monday 28th October 2013 | science | Permalink

Open access: green does NOT mean CC-BY-NC

There's been a fair amount of confusion around the new UK guidelines that mean we have to publish our research articles as open access. One of the urban myths that has sprung up is rather curious, and it's the idea that if you choose to publish under the green route, you're supposed to publish under a Creative Commons NonCommercial licence. This is not true. (It's just one of the many licences that would work.) But I have heard it from heads of research groups, I've heard it from library staff. We need to be clear!

(BACKGROUND: "Green" and "gold" are terms often used to describe two different sorts of open access, and they're also the two terms used by Research Councils UK [RCUK] to tell us what to do. "Gold" means that the publisher has to provide the article freely to everyone, rather than charging people for access; in lieu of that, most publishers will charge us researchers in order to publish under gold. "Green" means the publisher doesn't have to do anything, except to agree that the author can put a copy of the paper on their website or in an online repository. So, both enable free access to research, but in different ways, and with different costs and benefits.)

Now, in RCUK official guidance we have the option of green or gold publication. If we go the gold route, RCUK requires a specific licence: Creative Commons Attribution, aka CC-BY. If we go the green route, the RCUK policy doesn't exactly specify the licence, but it does say that it has to be published "without restriction on non‐commercial re‐use". Pause for a second to unpick the triple-negative in that turn of phrase...

The reason for that wording is that RCUK didn't want the publishers to "lock down" green OA by saying things like "you can self-archive the paper, but only under these strict terms and conditions which don't actually let people get the benefits of OA". For whatever reasons, they decided that it was OK for publishers to forbid commercial reuse (perhaps to prevent other publishers profiting from simply re-publishing?), but they would draw the line and say they weren't allowed to forbid non-commercial reuse. However, the policy doesn't require any particular licence.

But we might be tempted to ask, well, fine, but what is an example of a licence that would satisfy these RCUK rules? Well, Mark Thorley of RCUK gave an example of this: the Creative Commons Attribution-NonCommercial or CC-BY-NC would be fine. It's an appropriate example because it forbids commercial reuse but allows non-commercial reuse. OK so far?

Unfortunately, when you look at Mark Thorley's slides on the RCUK website, that's not exactly what is conveyed. If you go to slide 10 it says:

"Green (at least post print) with a maximum embargo period of 6(12) months, and CC-BY-NC"

OK that's pretty clear isn't it? It doesn't say that CC-BY-NC is just an example, it basically says CC-BY-NC is required. This is not what Thorley meant. I raised this issue on a mailing list, and he clarified the position:

"The policy does not define a specific licence for green deposit, provided non-commercial re-use such as text and data mining is supported. In presentations I say that this 'equates to CC-BY-NC', however, we do not specifically require CC-BY-NC. This is because some publishers, such as Nature, offer specific deposit licences which meet the requirements of the policy. However [...] this is the minimum requirement. So if authors are able and willing to use more open licences, such as CC-BY, we would encourage this. The more open the licence, the less ambiguities and barriers there are to re-use of repository content."

This clarification is welcome. But unfortunately it was provided in a reply on a mailing list discussion, and the RCUK website itself doesn't provide this clarification, so the misunderstanding is bound to run and run. This week I heard it repeated in an Open Access forum, and I hope that if you've read this far you'll help stop this misconception getting out of hand!

Wednesday 17th July 2013 | science | Permalink

Bird sound analysis with MPTK: chirp vs gabor

I'm just back from a great two-week research visit to INRIA in Rennes. The fruit of our labour will be a new release of the Matching Pursuit ToolKit with some whizzy extra features and polish. In my previous blog entry I showed how we can use Matching Pursuit to detect patterns in spectrograms - now I want to show you a quick example of how these techniques can give you a clearer, more meaningful representation of sounds such as birdsong.

On my way home one day I got a nice recording of a chiff-chaff, so we'll use that as our example. (I also put the longer audio on Xeno Canto as XC125867.)

My particular concern is how to analyse this sound so we can capture some of the fine detail of the very fast pitch variation in birdsong - the chiffchaff is a clear example of this because it sings individual "notes", each with a very fast downward chirp onto the note.

So, using MPTK, I have a few choices of how to analyse. A classic option is using Gabor atoms, which you can think of as simple time-frequency blobs a little bit like the pixels in a spectrogram. MPTK can find a sparse representation of the signal using Gabor atoms - in the picture below, the first plot is a simple spectrogram, and the second one is the result of Matching Pursuit with Gabor atoms:


(BTW, the vertical axes aren't quite the same - oops.) As you can see, it's worked out how to build the energetic parts of the signal using a small number of Gabor atoms.

But another choice is to analyse using chirplets. These are a lot like Gabor atoms except they don't just have a fixed frequency, they can slope downwards or upwards in frequency. MPTK has a nice feature for efficient chirplet analysis (it uses Rémi Gribonval's fast matching pursuit technique for chirplets).

You can see the chirplet-based analysis in the bottom of the plot above. Notice how each syllable from the bird seems to begin with a big downward slash, showing a very fast downward chirp. That reflects what is actually happening and what you can hear in the recording.

The important thing, for me, is that chirplets here seem to be getting a much more meaningful representation of the signal than the Gabor atoms. This should be more useful for downstream analysis (whether by human or machine).

We can even sonify the difference using timestretching. Once I've analysed the sound using MPTK, I can reconstruct it... or I can manipulate the data and reconstruct modified versions of it. In the following MP3 player you'll hear 5 tracks. First the 7-second original recording. Then there's a reconstruction using chirplets; then a four-times-slower timestretch version made from the chirplets. Then the same but with Gabor atoms (a reconstruction, then a timestretch version).

In particular, compare the timestretched versions. With the Gabor version, you hear a lot of very robotised / quantised / MP3ish artifacts to the sound, whereas the chirplet version sounds much more natural. Still some artifacts in both, of course.

The Python code for these examples is available here - note that it relies on the pyMPTK wrapper, which is going to be in the soon-to-be-released MPTK version 0.7.

Monday 25th March 2013 | science | Permalink

Pulling bird sounds out of the fog

I'm on a research visit to Rémi Gribonval's research group at INRIA (Rennes, France). So far it's been great, and maybe I'll tell you more later, but first I just want to blog a little signal-processing achievement for today.

Together with Bob Sturm and Boris Mailhé I've been working on improvements to MPTK, the C++ toolkit for sparse decomposition of signals. I've spent quite a lot of time on making a nice Python wrapper (yay) and also on code refactoring (hmm), but today I have actually done some signal processing:

Below you can see a little spectrogram of a chiffchaff singing:


Time is along the x-axis, frequency the y-axis. (The audio is from Xeno Canto: #XC25760.)

Now, in my previous research work I developed a way of tracking those chirrups, but it relied on a rather simplistic first step of detecting individual sounds. What I've been able to do, finally today, is use Matching Pursuit instead, thanks to MPTK (with the "anywave" feature which I think I have just fixed). Potentially, this has some advantages in detecting birdsong syllables cleanly.

So my first test today is to take a simple "template" for a syllable:


And use this as a one-atom dictionary in matching pursuit applied to the above spectrogram. The result is a set of "detections" which I can use for various purposes, such as, for example, reconstructing a cleaned-up spectrogram:


Looks pretty good eh? There's one false-positive in the above, and one or two false-negatives, but the basic principle is looking good. This should be useful.

Monday 18th March 2013 | science | Permalink

Update on GM-PHD filter (with Python code)

Note: I drafted this a while back but didn't get round to putting it on the blog. Now I have published code and a published paper about the GM-PHD filter, I thought these practical insights might be useful:

I've been tweaking the GM-PHD filter which I blogged about recently. (Gaussian mixture PHD is a GM implementation of the Probability Hypothesis Density filter, for tracking multiple objects in a set of noisy observations.)

I think there are some subtleties to it which are not immediately obvious from the research articles.

Also, I've published my open source GM-PHD Python code so if anyone finds it useful (or has patches to contribute) I'd be happy. There's also a short research paper about using the GM-PHD filter for multi-pitch tracking.

In that original blog post I said the results were noisier than I was hoping. I think there are a couple of reasons for this:

  • The filter benefits from a high-entropy representation and a good model of the target's movement. I started off with a simple 1D collection of particles with fixed velocity, and in my modelling I didn't tell the GM-PHD about the velocity - I just said there was position with some process noise and observation noise. Well, if I update this so the model knows about velocity too, and I specify the correct linear model (i.e. position is updated by adding the velocity term on to it) the results improve a little. I was hoping that I coud be a bit more generic than that. It may also be that my 1D example is too low-complexity, and a 2D example would give it more to focus on. Whatever happened to "keep it simple"?!

  • The filter really benefits from knowing where targets are likely to come from. In the original paper, the simulation examples are of objects coming from a fixed small number of "air bases" and so they can be tracked as soon as they "take off". If I'm looking to model audio, then I don't know what frequency things will start from, there's no strong model for that. So, I can give it a general "things can come from anywhere" prior, but that leads to the burn-in problem that I mentioned in my first blog post - targets will not accumulate much evidence for themselves, until many frames have elapsed. (It also adds algorithmic complexity, see below.)

  • Cold-start problem: the model doesn't include anything about pre-existing targets that might already be in the space, before the first frame (i.e. when the thing is "turned on"). It's possible to account for this slightly hackily by using a boosted "birth" distribution when processing the first frame, but this can't answer the question of how many objects to expect in the first frame - so you'd have to add a user parameter. It would be nice to come up with a neat closed-form way to decide what the steady-state expectation should be. (You can't just burn it in by running the thing with no observations for a while before you start - "no observations" is expressed as "empty set", which the model takes to mean definitely nothing there rather than ignorance. Ignorance would be expressed as an equal distribution over all possible observation sets, which is not something you can just drop in to the existing machinery.)

One mild flaw I spotted is in the pruning algorithm. It's needed because without it the number of Gaussians would diverge exponentially, so to keep it manageable you want to reduce this to some maximum limit at each step. However, the pruning algorithm given in the paper is a bit arbitrary, and in particular it fails to maintain the total sum of weights. It chops off low-weight components, and doesn't assign their lost weight to any of the survivors. This is important because the sum of weights for a GMPHD filter is essentially the estimated number of tracked objects. If you have a strong clean signal then it'll get over this flaw, but if not, you'll be leaking away density from your model at every step. So in my own code I renormalise the total mass after simplification - a simple change, hopefully a good one.

And a note about runtime: the size of the birth GMM strongly affects the running speed of the model. If you read through the description of how it works, this might not be obvious because the "pruning" is supposed to keep the number of components within a fixed limit so you might think it allows it to scale fine. However, the if birth GMM has many components, then they all must be cross-fertilised with each observation point at every step, and then pruned afterwards, so even if they don't persist they are still in action for the CPU-heavy part of the process. (The complexity has a kind of dependence on number-of-observations * number-of-birth-Gaussians.) If like me you have a model where you don't know where tracks will be born from, then you need many components to represent a flat distribution. (In my tests, using a single very wide Gaussian led to unpleasant bias towards the Gaussian's centre, no matter how wide I spread it.)

Tuesday 5th February 2013 | science | Permalink

Comment on 'High heels as supernormal stimuli: How wearing high heels affects judgements of female attractiveness'

There's a research paper just out which has gained itself some press: "High heels as supernormal stimuli: How wearing high heels affects judgements of female attractiveness". It's described in the popular press as "proving" that high heels make women attractive, and that's fair enough but it's obviously not very surprising news given that high heels are widely known in current Western society to have that association. The research paper is slightly more specific than that: it finds that whatever "information" is transmitted to the viewer by high heels is even transmitted when we can see nothing but a handful of moving dots, hiding everything about the viewee except their gait.

That's interesting. But unfortunately, the authors go on to make one further step, which strikes me as a step too far - namely they infer that this reflects some evolutionary explanation for the popularity of high heels. The word "supernormal" in the title refers to the idea that high heels might cause women to walk in a way which exaggerates female aspects of gait, i.e. makes them walk even more unlike males than otherwise. There is indeed evidence for this in their paper. But the authors explicitly test for whether the "female" aspects of gait correlate with attractiveness judgments, and they find insignificant or barely significant correlations.

(Technical note: two of the correlations attain p<0.05, but they didn't control for multiple comparisons, so the true significance is probably lower. And the correlations I'm talking about now are in their Table 2, which is looking at differences within the high-heel category and within the flat-shoe category. The main effect demonstrated by the authors is indeed significant: viewers rated the high-heel videos as more attractive.)

So what does this suggest? To me it seems they've demonstrated that
(a) high heels affect gait (as you can tell on most Friday nights in town), and
(b) people recognise the change in gait as being associated with attractiveness and femininity.
But this second finding can just as easily be explained by cultural learning as by something evolutionary, despite the fact that the paper was published in "Evolution and Human Behavior".

In fact, (b) could conceivably be caused by a conjunction of:
(b1) people recognise the change as being caused by high heels (whether consciously or not); and
(b2) people recognise that high heels are associated with attrractiveness and femininity.
(This b1-and-b2 scenario is also a potential explanation for their second set of findings, in which the gaits of high-heeled walkers are less often mistaken for men.)

All of which means that I don't think these experiments manage to discern any difference between effects caused by evolved factors and effects caused by cultural learning. Given that, the obvious way to test that difference would be to show the dot videos to viewers who grew up in a non-Western society which doesn't have a tradition of high heels. (Not a convenient test to do - but I'd definitely be interested in the results!)

Here's one quote from their results, about a minor aspect, whether male or female onlookers have different opinions:

"note that there was no shoetype-gender interaction, showing that both males and females judged high heels to be more attractive than flat shoes. [...] furthermore, there were high correlations between male and female attractiveness ratings of the walkers in both the flat and heels condition demonstrating that males and females agreed which were the attractive and unattractive walkers."

So, in this study, the male and female onlookers showed the same pattern of response to the presence of high heels. Does this perhaps hint that the difference might be learned, rather than from some presumed phwoar-factor inbuilt in men?

This study is an example of what I see as a frustrating tendency for people in biological disciplines to do interesting quantitative studies, but then to plunge into the discussion section and make unwarranted generalisations about the evolutionary reasons for something's existence. As well as invoking evolution, in this case they also discuss women's motivation for how they dress:

"Therefore we suggest that one, conscious or unconscious, motivation for women to wear high heels is to increase their attractiveness."

Firstly, this study explicitly does not explore women's motivations, in any sense. It only studies judgments made by outside observers. Secondly, as the authors have already acknowledged,

"High heels have become a part of the uniform of female attire in a number of different contexts and as such are part of a much more complex set of display rules."

I don't dispute that attractiveness might be a more important motivation for some than other motivations (fashion, identity, confidence, social norms, availability, symbolism), but let's not imply that this hunch is an empirical finding, please. The association of high heels with attractiveness is already a common trope, so the idea that women might be motivated to buy into that trope is perfectly plausible, but this study throws no light on it.

Still, as I said, the main finding is interesting: the differences in gait induced by high heels, and the rating of such gaits as attractive, are demonstrated to be easily perceivable even in a display reduced to a handful of green dots.

Saturday 5th January 2013 | science | Permalink

Academia and flying

When I started in academia I had no idea how much travel was involved. I started a PhD because I was fascinated by computational possibilities in digital sound, and almost by chance I ended up at one of the world-leading research groups in that area, which just happened to be nearby in London. Throughout my PhD I was lucky enough to get funded travel to conferences in interesting cities like Copenhagen and Helsinki, which was an unexpected plus - not just for the research benefits you get from meeting other researchers worldwide, but just for being able to visit and see those places.

Now, two things we know are that international travel tends to involve flying, and that flying is bad for the environment. Having finished my PhD and now working as an academic researcher, there are still plenty of research conferences around the world that are good to go to: they're specifically good for my project right now, and also for my professional development more generally. On average, research conferences are in other countries. So, is it possible to be an academic researcher and avoid flying? Does that harm your career? Could academia be "rearranged" to make it involve less flying?

Here's an example: I was invited to give a seminar in a European country. A nice invitation, and the organisers agreed to try and arrange to travel by train rather than plane. From the UK, this is tricky, because as an island the options are a little restricted: we have some nice (but slow) ferry services, and we have the Eurostar. It's hard for me to request non-plane transport because it tends to be more expensive for the organisers, and it can be really hard to schedule (since there are fewer schedule options and they take longer). So in the end, this time round we had to go for a compromise: I'm taking a plane one way and a train the other way. We couldn't do better than that.

In environmental terms, we can do better - I could decline the invitation. But academic research is international: the experts who are "next door" in terms of the subject are almost never "next door" geographically. If you want to develop your research you have to have meaningful personal interactions with these experts. Email, phone, videoconferencing are all fine, but if that's all you do then you lose out on the meaningful, full-bandwidth interaction that actually leads to new ideas, future collaborations, real understandings.

(For some research that confirms and discusses the importance of face-to-face collaboration, try this interesting story about lasers: Collins, H.M. "Tacit Knowledge, Trust and the Q of Sapphire" Social Studies of Science p. 71-85 31(1) 2001)

As a whole, is there much that the academic world can do to mitigate the amount of travel needed? Well, I'd still say it's worth encouraging teleconferencing and the like, though as I've noted I don't think it completely scratches the itch. Should we try to focus on local-ish conferencing rather than one global summit? That doesn't strike me as a very fruitful idea, since it would reduce the amount of international mixing if it worked (and thus the amount of productive international collaboration), and I don't think it would work since one "local" conference would probably tend to emerge as the stronger.

And if you're a researcher, aware of the issues involved in heavy use of air travel, you have a balance to strike. How much can/should you turn down interesting opportunities for presenting, networking, collaboration, based on geographic distance? Will it harm your own opportunities, while others jet off to take advantage of them? Personally I know there are specific opportunities I've turned down in the past year or so, because it didn't feel right to jet off to certain places just for a couple of days' meeting. In other cases, I've taken up opportunities only after making sure I make the most of the visit by adding other meetings or holidays into the visit.

Your thoughts would be welcome...

Tuesday 20th November 2012 | science | Permalink

How long it takes to get my articles published

NB Updated version here

Today I remembered about an article I submitted ages ago to a journal, accepted but not out yet. I also realised that since I store all my work in version-control, I can pull out all the exact dates when I started writing things, submitted them, rewrote them, etc.

So here's a visualisation of that data. In the following, a solid straight line is a period where the paper is "in my hands" (writing, rewriting, or whatever), and a dashed arc is where the paper is with someone else (a reviewer or a typesetter):

Each GREEN blob is a moment of official acceptance; a RED blob the moment of official rejection; a YELLOW blob the moment of a journal article actually being publicly available. (I also included some conference papers - the yellow blob is the date of presentation for those.) This only covers things since I got my PhD.

One thing you can see straight away is that often the initial review and/or the final typesetting periods are massive compared against the writing periods. I hadn't realised, but for my journal articles it's pretty much at least 1 year between submission and availability.

People often complain about the peer-review process and how slow it can be, but the thing that's puzzling me right now is why these massive post-acceptance delays, which are nothing to do with reviewing? For someone like me who normally submits LaTeX documents, I can't even guess what work is left to do... yet it seems to take a minimum of 4 months!

This is just my own data - I make no claims to generality.

Monday 17th September 2012 | science | Permalink

Notes from EUSIPCO 2012

Just back from the EUSIPCO 2012 conference in Bucharest. (The conference was held in the opulent Palace of the Parliament - see previous article for some thoughts on the palace and the town.) Here some notes about interesting talks/posters I saw:


Lots of stuff relevant to recognition in audio scenes, which is handy because that's related to my current work.

  • David Damm's "System for audio summarisation in acoustic monitoring scenarios". Nice approach and demo (with sounds localised around the Frauenhofer campus), though the self-admitted drawback is that it isn't yet particularly scalebale, using full DTW search etc.
  • Sebastien Fenet's "fingerprint-based detection of repeating objects in multimedia streams" - here a very scaleable approach, using fingerprints (as is done in other large-scale systems such as Shazam). In this paper he compared two fingerprint types: a Shazam-like spectral-peaks method (but using constant-Q spectrum); and a shallow Matching Pursuit applied to multiscale STFT. His results seem to favour the former.
  • Xavier Valero's "Gammatone wavelet features for sound classification in surveillance applications" - this multiscale version of gammatone is apparently better for detecting bangs and bumps (which fits with folk knowledge about wavelets...).
  • M. A. Sehili's "Daily sound recognition using a combination of GMM and SVM for home automation" - they used something called a Sequence Classification Kernel which apparently can be used in an SVM to classify sequential data, even different-length sequential data. Have to check that out.
  • Two separate papers - Anansie Zlatintsi's "AM-FM Modulation Features" and Xavier Valero's "Narrow-band Autocorrelation features" - used features which are complementary to the standard Mel energies, by analysing the fine variation within each band. They each found improved results (for different classification tasks). (In my own thesis I looked at band-wise spectral crest features, hoping to achieve something similar. I found that they did provide complementary information [Sec 3.4] but unfortunately were not robust enough to noise/degradation for my purposes [Sec 3.3]. It'll be interesting to see how these different features hold up - they are more interesting than my spectral crests I think.)

Plenty of informed audio source separation was in evidence too. Not my specialism, more that of others in our group who came along... but I caught a couple of them, including Derry Fitzgerald's "User assisted separation using tensor factorisations" and Juan-Jose Bosch's "Score-informed and timbre-independent lead instrument separation in real-world scenarios".

Other papers that were interesting:

  • T Adali, "Use of diversity in independent decompositions" - for indendence-based decompositions, you can use either of two assumptions about the components: non-Gaussianity or time-dependence. The speaker noted that measuring mutual information rate covers both of these properties, so it seems like a neat thing to use. She used it for some tensor decompositions which were a bit beyond me.
  • C Areliano's poster on "Shape model fitting algorithm without point correspondence": simple idea for matching a hand image against a template which has marked points on it (but the query image doesn't): convert both representations into GMMs then find a good registration between the two GMMs. Could be useful, though the registration search is basically brute-force in this paper I think.
  • Y Panagakis prsented "Music structure analysis by subspace modeling" - it makes a lot of sense, intuitively, that music structure such as verse-chorus-verse should be suited to this idea of fitting different feature subspaces to them. The way music is produced and mixed should make it appropriate for this, I imagine (whereas for audio scenes we probably don't hop from subspace to subspace... unless the mic is moving from indoors to outdoors for example...)
  • Y Bar-Yosef's "Discriminative Algorithm for comacting mixture models with application to language recognition". Taking a GMM and approximating it by a smaller one is a general useful technique - here they were using Hershey and Olsen's 2007 "variational approximation" to the KLD between two GMMs. In this paper, their optimisation tries to preserve the discriminative power between two GMMs, rather than simply keeping the best fit independently.
  • I Ari's "Large scale polyphonic music transcription using randomized matrix decompositions" - some elegant tweaks which mean they can handle a very large matrix of data, using a weighted-random atom selection technique which reminds me a little of a kind of randomised Matching Pursuit (though MP is not what they're doing). They reduce the formal complexity of matrix factorisation, both in time and in space, so that it's much more tractable.
  • H Hu's "Sparsity level in a non-negative matrix factorisation based speech strategy in cochlear implants" - I know they do some good work with cochlear implants at Southampton Uni. This was a nice example: not only did they use Sparse NMF for noise reduction, and test it with human subjects in simulated conditions, but they also implemented it on a hardware device as used in cochlear implants. This latter point is important because at first I was dubious whether this fancy processing was efficient enough to run on a cochlear implant - good to see a paper that answers those kind of questions immediately.


Christian Jutten gave a plenary talk on source-separation in nonlinear mixtures. Apparently there's a proof from the 1980s by Darmois that if you have multiple sources nonlinearly mixed, then ICA cannot guarantee to separate them, for the following simple reason: ICA works by maximising independence, but Darmois proved that for any set of perfectly independent sources you can always construct a nonlinear mixture that preserves this independence. (Jutten gave an example procedure to do this; I think you could use the inverse-copula of the joint distribution as another way.)

Therefore to do source-separation on nonlinear mixtures you need to add some assumptions, either as constraints or regularisation. Constraining just to "smooth mappings" doesn't work. One set of mixture types which does work is "post-nonlinear mixtures", which means mixtures in which nonlinearities are applied separately to the outputs after linear mixing. (This is a reasonable model, for example, if your mics have nonlinearities but you can assume the sounds linearly mixed in the air before they reached the mics.) You have to use nonlinearities which satisfy a particular additivity constraint (f(u+v) = (f(u)+f(v))/(1+f(u)f(v)) ... tanh satisfies this). Or at least, you have to use those kind of nonlinearities in order to use Jutten's method.

Eric Moulines talked about prediction in sparse additive models. There's a lot of sparsity around at the moment (and there were plenty of sparsity papers here); Moulines' different approach is that when you want to predict new values, rather than to reverse-engineer the input values, you don't want to select a single sparsity pattern but aggregate over the predictions made by all sparsity patterns. He uses a particular weighted aggregation scheme which he calls "exponential aggregation" involving the risk calculated for each "expert" (each function in the dictionary).

Now, we don't want to calculate the result for an exponentially large number of sparsity patterns and merge them all, since that would take forever. Moulines uses an inequality to convert the combinatorial problem to a continuous problem; unfortunately, at the end of it all it's still too much to calculate easily (2^m estimators) so he uses MCMC estimation to get his actual results.

I also went to the tutorial on Population Monte Carlo methods (which apparently were introduced by Cappe in 2004). I know about Particle Filters so my learnings are relative to that:

  • Each particle or iteration can have its OWN instrumental distribution, there's no need for it to be common across all particles. In fact the teacher (Petar Djuric) had worked on methods where you have a collection of instrumental distributions, and weighted-sample from all of them, adapting the weights as the iterations progress. This allows it to automatically do the kind of things we might heuristically want: start with broad, heavy-tailed distributions, then focus more on narrow distributions in the final refinement stages.
  • For static MC (i.e. not sequential), you can use the samples from ALL iterations to make your final estimate (though you need to take care to normalise appropriately).
  • Rao-Blackwellisation lets you solve a lower-dimensional problem (approximating a lower-dimensional target distribution) if you can analytically integrate to solve for a subset of the parameters given the other ones. For example, if some parameters are gaussian-distributed when conditioned on the others. This can make your approximation much simpler and faster.
  • It's generally held a good idea to use heavy-tailed distributions, e.g. people use Student's t distribution since heavier-tailed than Gaussian.
Sunday 2nd September 2012 | science | Permalink

Comment on 'Seeing women as objects: The sexual body part recognition bias'

PREFACE: There's a risk that I might come across here as dismissing the research, and doing so for an odd reason. I'd like to be clear that I think this is an interesting study, and I'm not an expert in cognitive psychology but I'm writing because I'm interested in seeing these issues teased apart in more detail. See also the comments section.

Interesting article someone pointed out in European Journal of Social Psychology: Seeing women as objects: The sexual body part recognition bias. The basic idea is to use a psychophysics-type perceptual experiment to explore whether people looking at men and at women process them differently. If perceiving people "as objects" makes a difference to the cognitive processes involved, then that should be detectable.

There's plenty of evidence about our society's exaggerated emphasis on female body image, and the consequences of such objectification. What the researchers do here is use an experiment in which participants are shown images of men and women (either complete or partial images), and ask them to do a kind of spot-the-difference task. They find people get different percentage-correct scores depending on whether it's an image of a man or a woman one is looking at.

The researchers discuss this result as relating to objectification of women, and I think that's broadly OK, but there's an extra hop that I think is glossed over. A tweet summarised the research as "People perceive men using global processing, but women with local processing" but it would be more correct to say "People perceive images of men using global processing, but images of women with local processing". (It's not just the 140-character limit at fault here, the research paper itself makes the leap.)

The point is that the participants were reacting to 2D images, rather than real physical presences of men or women. Now, you might think, is that an important difference, or just quibbling? I'm not claiming that the results are wrong, and I'm not even claiming that the results don't tell us something about objectification of women. But the difference between looking-at-people and looking-at-images is important here since it relates closely to the claims being made - and this highlights the complexity of making measurements of socially-embedded cognitive processing.

Here's why I think it's a difference: In our everyday lives we see "3D" men and women. We also see "2D" images of men and women. So there are four pertinent categories here: 3D men, 3D women, 2D men-images and 2D women-images. We have absorbed general impressions about these four categories from out experiences so far (whether those "categories" are categories we use ourselves is beside the point). It's well known that there are more and different images of women than men, used in advertising and other media. As a person develops they see examples of all four categories around them, and they might learn similarities and differences, things that the categories have in common or not.

[Edit: Maybe a better way of putting it is inanimate-vs-animate, not 2D-vs-3D - see comments]

So, it's reasonable to expect that an average person in Western society is more familiar with objectified images of women around than of men. (Note that I do not claim this state of affairs is OK! I just claim that it's the average person's developmental environment.) It's easier to deal with familiar categories than unfamiliar ones. So we'd expect people to have better processing when presented with 2D body-part-images of women - and it probably correlates with their visual processing of real-life people, but that's not certain and it needs to be tested.

Am I claiming that the research should not be trusted? No. It looks like a decent and interesting experimental result. But the authors make a slight leap, which we should treat with caution: they imply that their statistically significant result on how people visually process 2D-images-of-men and 2D-images-of-women transfers directly to how people visually process men and women in the flesh. Personally I would expect that people's perception of "3D" men and women probably partly generalises from the image perception and partly doesn't. (There might be existing research on that; comments welcome.)

And obviously it's much harder to conduct large experiments by showing people "glimpses of real live men/women" rather than images, so there's a good reason why such research hasn't yet been done.

But that's good news right? - more research needed ;)

Friday 17th August 2012 | science | Permalink

A very simple toy problem for matching pursuits

To help me think about how and why matching pursuits fail, here's a very simple toy problem which defeats matching pursuit (MP) and orthogonal matching pursuit (OMP). [[NOTE: It doesn't defeat OMP actually - see comments.]]

We have a signal which is a sequence of eight numbers. It's very simple, it's four "on" and then four "off". The "on" elements are of value 0.5 and the "off" are of value 0; this means the L2 norm is 1, which is convenient.

signal = array([0.5, 0.5, 0.5, 0.5, 0, 0, 0, 0])
Diagram of signal

Now, we have a dictionary of 8 different atoms, each of which is again a sequence of eight numbers, again having unit L2 norm. I'm deliberately constructing this dictionary to "outwit" the algorithms - not to show that there's anything wrong with the algorithms (because we know the problem in general is NP-hard), but just to think about what happens. Our dictionary consists of four up-then-down atoms wrapped round in the first half of the support, and four double-spikes:

dict = array([
   [0.8, -0.6, 0, 0, 0, 0, 0, 0],
   [0, 0.8, -0.6, 0, 0, 0, 0, 0],
   [0, 0, 0.8, -0.6, 0, 0, 0, 0],
   [-0.6, 0, 0, 0.8, 0, 0, 0, 0],
   [sqrt(0.8), 0, 0, 0, sqrt(0.2), 0, 0, 0],
   [0, sqrt(0.8), 0, 0, 0, sqrt(0.2), 0, 0],
   [0, 0, sqrt(0.8), 0, 0, 0, sqrt(0.2), 0],
   [0, 0, 0, sqrt(0.8), 0, 0, 0, sqrt(0.2)],
Diagram of dictionary atoms

BTW, I'm writing my examples as very simple Python code with Numpy (assuming you've run "from numpy import *"). We can check that the atoms are unit norm, by getting a list of "1"s when we run:

sum(dict ** 2, 0)

So, now if you wanted to reconstruct the signal as a weighted sum of these eight atoms, it's a bit obvious that the second lot of atoms are unappealing because the sqrt(0.2) elements are sitting in a space that we want to be zero. The first lot of atoms, on the other hand, look quite handy. In fact an equal portion of each of those first four can be used to reconstruct the signal exactly:

sum(dict * [2.5, 2.5, 2.5, 2.5, 0, 0, 0, 0], 0)

That's the unique exact solution for the present problem. There's no other way to reconstruct the signal exactly.

So now let's look at "greedy" matching pursuits, where a single atom is selected one at a time. The idea is that we select the most promising atom at each step, and the way of doing that is by taking the inner product between the signal (or the residual) and each of the atoms in turn. The one with the highest inner product is the one for which you can reduce the residual energy by the highest amount on this step, and therefore the hope is that it typically helps us toward the best solution.

What's the result on my toy data?

  • For the first lot of atoms the inner product is (0.8 * 0.5) + (-0.6 * 0.5) which is of course 0.1.
  • For the second lot of atoms the inner product is (sqrt(0.8) * 0.5) which is about 0.4472.

To continue with my Python notation you could run "sum(dict.T * signal, 1)". The result looks like this:

array([ 0.1,  0.1,  0.1,  0.1,  0.4472136,  0.4472136,  0.4472136,  0.4472136])

So the first atom chosen by MP or OMP is definitely going to be one of the evil atoms - more than four times better in terms of the dot-product. (The algorithm would resolve this tie-break situation by picking one of the winners at random or just using the first one in the list.)

What happens next depends on the algorithm. In MP you subtract (winningatom * winningdotproduct) from the signal, and this residual is what you work with on the next iteration. For my purposes here it's irrelevant: both MP and OMP are unable to throw away this evil atom once they've selected it, which is all I needed to show. There exist variants which are allowed to throw away dodgy candidates even after they've picked them (such as "cyclic OMP").

NOTE: see the comments for an important proviso re MP.

Friday 13th July 2012 | science | Permalink

A* orthogonal matching pursuit - or is it

The delightful thing about the A* routing algorithm is that it is provably the optimal algorithm for the purpose, in the sense that it's the algorithm that visits the fewest possible path nodes given the information made available. See the original paper for proof. Despite its simplicity, it is apparently still used in a lot of industrial routing algorithms today, and it can be adapted to help solve other sorts of problem.

A colleague pointed out a paper about "A*OMP" - an algorithm that performs a kind of OMP (Orthogonal Matching Pursuit) with a tree search added to try out different paths towards fitting a good sparse representation. "Aha," I thought, "if they can use A* then they can get some nice recovery properties inherited from the A* search."

However, in reading the paper I find two issues with the A*OMP algorithm which make me reluctant to use the name "A*" for it:

  1. The heuristics used are not "consistent" (this means you can't guarantee they are always less-than-or-equal to the true distance remaining). This lack of consistency means the proof of A*'s optimality doesn't apply. (Remember, A*'s "optimality" is about the number of nodes inspected before finding the best path.) (EDIT: a colleague pointed out that it's actually worse than this - if the heuristic isn't consistent then it's not just sub-optimal search, it may fail to inspect the best path.)
  2. Since A*OMP restricts the number of paths it adds (to the top "B" atoms having largest cross-product with the residual) there are no guarantees that it will even inspect the true basis.

These issues are independent of each other. If you leave out the pragmatic restriction on the number of search paths (to get round the second issue), the first issue still applies. OMP itself is greedy rather than exact, so this doesn't make A*OMP worse than OMP, but to my mind it's "not as good as A*".

In practice, the authors' A*OMP algorithm might indeed get good results, and the experiments shown seem to do so. So my quibbles may be mere quibbles. But the name "A*" led me to expect guarantees that just aren't there (e.g. guarantees of being better than OMP). It's quite easy to construct a toy problem for which A*OMP will not get you nearer the true solution than OMP will.

It's not obvious how to come up with a consistent heuristic. For a given problem, if we knew there was an exact solution (i.e. zero residual was possible within the sparsity constraints) then we could use the residual energy, but since we can't know that in general then the residual energy may overestimate the distance to be "travelled" to the goal.

One minor thing: their "equivalent path pruning" in section 4.2.3 is a bit overkill - I know a simpler way to avoid visiting duplicate paths. I'll leave that as an exercise for the reader :)

Friday 13th July 2012 | science | Permalink

My research about time and sound

I've got three papers accepted in conferences this summer, and they're all different angles on the technical side of how we analyse audio with respect to time.

In our field, time is often treated in a fairly basic way, for reasons of tractability. A common assumption is the "Markov assumption" that the current state only depends on the immediate past - this is really handy because it means our calculations don't explode with all the considerations of what happened the day before yesterday. It's not a particularly hobbling assumption - for example, most speech recognition systems in use have the assumption in there, and they do OK.

It's not obvious whether we "need" to build systems with complicated representations of time. There is some good research in the literature already which does, and they have promising results. And conversely, there are some good arguments that simple representations capture most of what's important.

Anyway, I've been trying to think about how our signal-processing systems can make intelligent use of the different timescales in sound, from the short-term to the long-term. Some of my first work on this is in these three conference talks I'm doing this summer, each on a different aspect:

  1. At EUSIPCO I have a paper about a simple chirplet representation that can do better than standard spectral techniques at representing birdsong. Birdsong has lots of fast frequency modulation, yet the standard spectral approaches assume "local stationarity" - i.e. they assume that within a small-enough window, we can treat the signal as unchanging. My argument is that we're throwing away information at this point in the analysis chain, information that for birdsong at least is potentially very useful.

  2. At MML I have a paper about tracking multiple pitched vibrato sounds, using a technique called the PHD filter which has already been used quite a bit in radar and video tracking. The main point is that when we're trying to track multiple objects over time (and we don't know how many objects), it's suboptimal to just take a model that deals with one object and apply the model multiple times. You benefit from using a technique that "knows" there may be multiple things. The PHD filter is one such technique, and it lets you model things with a linear-gaussian evolution over time. So I applied it to vibrato sources, which don't have a fixed pitch but oscillate around. It seems (in a synthetic experiment) the PHD filter handles them quite nicely, and is able to pull out the "base" pitches as well as the vibrato extent automatically. The theoretical elegance of the filter is very nice, although there are some practical limitations which I'll mention in my talk.

  3. At CMMR I have a paper about estimating the arcs of expression in pianists' tempo modulations. The paper is with Elaine Chew, a new lecturer in our group who works a lot with classical piano performance. She has had students before working on the technical question of automatically identifying the "arcs" that we can see visually in expressive tempo modulation. I wanted to apply a Bayesian formulation to the problem, and I think it gives pretty nice results and a more intuitive way to specify the prior assumptions about scale.

So all of these are about machine learning applied to temporal evolution of sound, at different timescales. Hope to chat to some other researchers in this area over the summer!

Saturday 9th June 2012 | science | Permalink

Implementing the GM-PHD filter

I'm implementing the GM-PHD filter. (The what? The Gaussian mixture Probability Hypothesis Density filter, which is for tracking multiple objects.) Implementing it in python, which is nice, but I'm not completely clear if it's working as intended yet.

Here's a screenshot of progress so far. Look at the first four plots in this picture, which are:

  1. The true trajectory of two simulated objects moving in 1D over time.
  2. Observations received, with "clutter" and occasional missed detections.
  3. The "intensity" calculated by the GM-PHD filter. This is the core state variable of the filter's model.
  4. Filtered trajectories output from the PHD filter.

So what do you think? Good results?

Not sure. It's clearly got rid of lots of the clutter - good. In fact yes it's got rid of the majority of the noise, hooray hooray. But the clutter right close to the the targets is still there, seems a bit mucky, in a kind of way that suggests it's not going to be easy to clear that up.

And there's also a significant "cold start" problem - it takes up to about 20 frames for the filter to be convinced that there's anything there at all. That's no real surprise, since there's an underlying "birth" model which says that a trail could spring into being at any point, but there's no model for "pre-existing" targets. There's nothing in the PHD and GMPHD papers which I've read which even mentions this, let alone accounts for it - I'm pretty sure that we'll either need to initialise the state to account for this, or always do some kind of "warmup" before getting any results out of the filter. That's not great, especially when we might be tracking things that only have a short lifespan themselves.

One thing: this is a one-dimensional problem I'm testing it on. PHD filters are usually used for 2D or 3D problems - and maybe there needs to be enough entropy in the representation for clutter to be distinguished more clearly from signals. That would be a shame, since I'd like to use it on 1D things like spectrogram frames.

More tests needed. Any thoughts gratefully received too

Thursday 29th March 2012 | science | Permalink

Perceptually-modelled audio analysis

This week I went to a research workshop in Plymouth called Making Sense of Sounds. It was all based around an EU project which aims to improve the state of the art in auditory models (i.e. models of what happens imbetween our ear and our consciousness, to turn a physical sound into an auditory perception) and also use them to help computers and machines to understand sound.

I won't blog the whole thing but just a few notes here. There was a lot of research on the streaming paradigm, and it's quite amazing how it's still possible to discover new facts about human hearing using such a simple sound. Basically, the sound is usually something like "bip boop bip, bip boop bip, bip boop bip", and the clever bit is that we can either hear this as a single stream or as two segregated streams (a bip stream and a boop stream), depending on the relative pitches and durations. It's an example of "bistable perception", just like famous optical illusions such as the Necker cube or the faces/vase thing. With modern EEG and fMIR brain scanning, this streaming paradigm shows some interesting facts about how we hear sounds - for example, it seems that our auditory system does entertain both "versions" at some point, but this resolves to just one choice at some point below conscious perception.

I was interested by Maria Chait's talk on change detection, and in conversation afterwards she pointed us to some recent research - see this 2010 paper by Scholl et al - which shows that humans have neurons which are able to detect note offsets, even though it's very well established that in behaviour we're very bad at noticing them - i.e. we often can't tell what happened when a sound stops, but it's usually pretty noticeable when a sound starts!

Those findings aren't completely incompatible, of course. It's plausible that in human evolution, sudden sounds were more important than sudden silences, even though both are informative.

Maneesh Sahani talked about two of his students' work. The one that was new to me was Phillip Herrmann's thesis on pitch perception and was a really interesting approach - rather than using a spectral or autocorrelation method, they started from a generative model in which we assume there is some pitch rate generating an impulse train, and some impulse response convolved with it, and also some gaussian noise etc, then this goes into some auditory model before arriving at a representation which we have to make inferences about. They then did inference applying this model to audio signals. The point is not whether this is an appropriate model for most sounds, just whether this assumption gets you far enough to do pitch perception in similar ways as humans do (with some of the attendant peculiarities).

One particularly nice experiment they came up with is another kind of "bistable perception" experiment where you have a train of impulses separated by 2ms, and every second impulse is optionally attenuated by some amount. So if there's no attenuation, you have a 2ms impulse train; if there's full attentuation, you have a 4ms impulse train; somewhere imbetween, you're somewhere imbetween. If you play these sounds to humans, they can report ambiguous pitch perception, sometimes detecting the higher octave, sometimes the lower, and this Herrmann/Sahani model apparently replicates the human data in a pretty good way that is not reflected in autocorrelation models.

Oh, also, over a diverse dataset, they apparently found a really clear square-root correlation between fundamental frequency and spectral centroid. (In other parts of the literature, it's not clear whether or not the two are correlated.) I'd like to see the data for this one - as I mentioned to Maneesh, there might be reasons to expect some data to do this by design (e.g. professional singers' voices). The point for Herrmann/Sahani is to see if the correlation exists in the data that might have "trained" our perception, so I'm not sure if things like professional singers should be included or not.

Maneesh Sahani also said at the start of his talk that Helmholtz (in the 19th century) came up with this idea of "perception as inference" - but then the electrical/computational signal-processing paradigm came along and everyone treated perception as processing. The modern Bayesian tendency, and its use to model perception, is a return to this "perception as inference". Is there anything that wasn't originally invented by Helmholtz?

Also Tom Walters' demo of his AIMC real-time perceptual model in C++ was nice, and it's code I'd like to make use of some time.

My own contribution, a poster about using chirplets to analyse birdsong, led to some interesting conversations. At least one person was sure I should be using auditory models instead of chirplets - which, given the context, I should have expected :)

Thursday 23rd February 2012 | science | Permalink

The Impact agenda, and public engagement

I was at a meeting recently, going through research proposal documents, and I realised that the previous government's "impact agenda" might be having an unintended effect on public engagement:

One of the things that has happened in research in the past few years is that the government now demands that we now have to state what kind of "impact" our research will have. Now, the problem is that impact is notoriously and demonstrably unpredictable - we don't know if we're going to discover anything world-changing, until we actually try it, and even then we might not realise the impact for decades - but the previous government wanted to try and pin it down somehow.

So every proposal now (in the UK) has to have a two-page "Pathways to Impact" summary. If you're doing applied research it's pretty easy - you say things like "We're going to study the resilience of welded grommets under pressure, which means the grommet industry will produce more reliable grommets and there will be fewer grommet-related fatalities." In you're doing theoretical or basic research, in principle you still have a story to tell: you say something like "Our research will lead to a greater understanding of the number five, which is widely used in the natural sciences, industry and the financial sector. Future researchers will be able to build on these theoretical advances to develop new techniques for counting grommets or whatever."

So, in theory every research project has something they can say about this. (And they don't have to fill up the two pages, if they don't have much to say.) But that's not what happens.

Here's a very rough transcript of a conversation that went on in the meeting:

P: "Your proposal is good, Q, but there's not really anything about impact. The reviewers will have to rate you on impact so you need to say something here."

Q: "Oh blooming heck, but it's basic research, you can't really say what the impact is. I suppose I'll have to stick a schools talk in or something?"

R: "I know a couple of schools, I can arrange for you to do a talk, put that in."

Q: "Yeah OK."

Now I want to emphasise, this was not the end of the conversation. But I'm in favour of public engagement - perhaps a little more imagination is needed than just some generic schools talk, but it's interesting to see that this criterion is pushing people towards that little bit more public engagement.

Also: this is not a particularly unusual approach to filling in those impact pages. Impact is not supposed to be the tail that wags the dog, research excellence is supposed to be the number one criterion. But there are two whole pages which we have to use to say something about impact. And we know that the reviewers have got to read those pages, and rate us in terms of how strong or weak our pathways to impact are.

As I've said, impact is unpredictable. So what can you write, to make a reviewer say, "Yep, that's credible"? Your biggest impact might be to invent a whole new type of science, or to change the way we all think about the universe, but that won't happen for decades and it depends on a whole vague network of people taking your research and running with it. Can you talk about that? You could do, and that might be the truth about the likely impact of the research. But we know we'll get a bigger tick if we have something demonstrable that we can actually propose to do - even if it's not really connected with the research's biggest likely impact on society. A schools talk is a good thing to do, but is it the biggest impact your research will have on society in general? I hope not!

So, it happens quite often that people conflate public engagement with impact. A schools talk is not impact. An article in a newspaper is not impact. They might be tools that help spread research out of the university into the wider world, and they might faciliate impact, but they're not really the point of the hurdle that the government set for us.

Unfortunately, in science - unlike in politics - we formally review each others' work, and we can't hide behind wooly generalities. The strange thing is that regarding impact, the wooly generalities are the truth.

Tuesday 22nd November 2011 | science | Permalink

ISMIR 2011: the year of bigness

I'm blogging from the ISMIR 2011 conference, about music information retrieval. One of the interesting trends is how a lot of people are focusing on how to scale things up, to handle millions of audio files (or users, or scores) rather than just hundreds or thousands. Why? Well, in real-world applications it's often important: big music services like Spotify and iTunes have about 15 million tracks, Facebook has millions of users, etc. In ISMIR one of the stars of the show is the Million Song Dataset, just released, which should help many many researchers to develop and test on a big scale. Here I'm going to note some of the talks/posters I've seen with interesting approaches to scalability:

Brian McFee described a simple tweak to the kd-tree data structure called "spill tree" which improves approximate search. Basically, when you split the data in two you allow some of the data points to spill over and fall on both sides. Simple but apparently effective.

Dominik Schnitzer introduced a nice way to smooth out a search space and reduce the problem of hub-ness. One way to do it could be to use a minimum spanning tree, for example, but that involes a whole-dataset analysis so it might not scale well. In Dominik's approach, for each data point X you find an estimate of what he calls "mutual proximity": randomly sample 100 data points from your dataset and measure their distance to X, then fit a gaussian to those distances. Then to find the "mutual proximity" between two data points X and Y, you just evaluate X's gaussian at Y's location to get a kind of "probability of being a near neighbour". He also makes this a symmetric measure by combining the X->Y measure with the Y->X measure, but I'd imagine you don't always need to do that, depending on your purpose. The end result is a distance measure that pretty much eliminates hubs.

Shazam's music recognition algorithm, described in this 2006 paper, is one of the commercial success stories of scalable audio MIR. Sebastien Fenet tweaked it to be robust to pitch-shifting, essentially by using a log-frequency spectrogram and using delta-log-frequency rather than frequency in the fingerprints.

A small note from the presentation of the Million Song Dataset: apparently if you want a good online linear-predictor than is fast for large data, try out Vowpal Wabbit.

Also, Thierry mentioned that he was a fan of using Amazon's cloud storage/processing - if you store data with Amazon you can run MapReduce jobs over it easily, apparently. Mark Levy of last.fm is also a fan of MapReduce, having done a lot of work using Hadoop (Yahoo's implementation of MapReduce) for big data-crunching jobs.

Mikael Henaff presented a technique for learning a sparse spectrum-derived feature set, similar in spirit to KSVD. The thing I found interesting was how he then made a fast way of decomposing a new signal (once you've derived your feature basis from some training data). Ordinarily you'd have to do an optimisation - the dictionary is overcomplete so it can't be done as easily as an orthogonal transform. But you don't want to do that on a lot of data. Instead, he first trains a nonlinear projection which approximates that decomposition (it's a matrix rotation followed by a shrinkage nonlinearity, really simple mathematically). So you have to train that, but then when you want to analyse new data there's no optimisation needed, you just apply the simple transform.

There's been plenty of interesting stuff here at ISMIR that isn't about bigness, and it was good of Douglas Eck (of Google) to emphasise that there are still lots of interesting and important problems in MIR that don't need scalability and don't even benefit from it. But there are interesting developments in this area, hence this note.

Thursday 27th October 2011 | science | Permalink

Separating the repeating part out of a piece of music

Via the SuperCollider users list I heard about a nice trick for extracting the repeating part out of a recorded piece of music. Source-separation, vocal-extraction etc are massive topics which I won't go into right now, but suffice to say it's not easy. So I was interested to read this nice simple technique (scroll down to "REpeating Pattern Extraction Technique (REPET)") described in an ICASSP paper this year.

Basically it uses spectral subtraction and binary masking - two of the simplest "source separation" tricks you can do to a signal. In general they produce kinda rough results - they don't adapt the phase information at all, for a start, so they can give some smeary MP3ish artefacts. But in this case the authors have applied them to a task where they can produce decent results: here you don't have to try and separate all the instruments out, you just want to divide the recording into two, the repeaty bit and the non-repeaty bit.

If you read the ICASSP paper you'll find they describe it well, it's a nice readable paper. (However they do make the task a bit more complex than it needs to be: they do a load of calculations then take the log-spectrum near the end, whereas if they took the log-spectrum at the start the calculations would be a little simpler.) The basic idea is:

  • Find the tempo of your piece of music. Then you know how long one bar is going to be.
  • Chop the music into bar-long sections, and average their spectrograms. This averaged-spectrogram should in theory represent the repeated bit, with the varying bits getting mostly washed out.
  • Use spectral subtraction to subtract this average from each bar-long segment.
  • Then, for each spectral bin, if there is a significant amount of energy left over, you mark this as being a bin belonging to some non-repeating audio. Otherwise you mark it as belonging to repeating audio.
  • Then you go back to the spectrogram of the original song, and silence the bins you want to get rid of (either the repeating or nonrepeating ones).

From a theoretical point of view there are all sorts of quibbles you could come up with, for example that it might fall apart if a song has varying tempo. But for a fairly large range of tracks, this looks like it could give useful results.

So I decided to implement a real-time version in SuperCollider. I like real-time stuff (meaning you can work with audio as-it-happens rather than just a fixed recording), but the above approach is non-realtime: it takes the average spectrogram over the whole track, for example, so you can't calculate the first ten seconds until you've analysed the whole thing.

What to do? I replaced the usual averaging process with what I call recursive average (can't find a nice online explanation of that right now, hm). You still need to know the tempo, but given the tempo you then have a real-time estimate of the average spectrum caused by the repeating bit.

One interesting thing is that when a new beat kicks in, it's not immediately detected as a loop - so usually, it plays through once and then gets suppressed. You might think of this as a system to separate "novelty" from "boring loops"...?

I've published this for SuperCollider as a UGen called PV_ExtractRepeat (available in the sc3-plugins collection).

Here's an example of it in action, applied to "Rehab" by Amy Winehouse. As you listen, notice a couple of things: (1) during the first bar there is poor separation, then it gets better; (2) the repeating-only bit (the rhythm section) sounds pretty good, could easily be used as a karaoke-version, while the non-repeating bit (mainly the vocals) sounds pretty messy...

Rehab - just the house by danmisc

Rehab - just the wine by danmisc

So, not perfect, but potentially useful, maybe for karaoke or maybe for further audio analysis. Thanks to Zafar Rafii and Bryan Pardo for publishing the method - note that their examples sound better than my real-time example here (real-time often means compromises in analysis).

Wednesday 16th March 2011 | science | Permalink

My PhD thesis now online

I'm glad to say the thesis corrections have been approved so my PhD thesis is now in its finished form - available here:

The title is "Making music through real-time voice timbre analysis: machine learning and timbral control". (Tip for future PhDs, try to choose a title that you can say in one breath...)

I'm really grateful to all the fab people in C4DM - I've got so much from being in a research environment with so many people knowledgeable about such a variety of cool things - and, well, I don't want to rewrite the whole acknowledgments here (they're on page 3) but all the people who took part in experiments or just chatted about research. (Including the folks at humanbeatbox.com)

The thesis is available under creative commons. And, because I uploaded it to archive.org they also seem to have converted it into some crazy ebook formats, so you can presumably read a garbled version of it on your kindle if you like ;) probably best to use the original PDF if possible, though (the TeX source is also included).

Saturday 7th August 2010 | science | Permalink

SMC 2010 conference notes

I've just been at SMC2010, the Sound and Music Computing conference. It's the first time I've been so one question I had was, what differentiates it from other conferences in this research area like NIME, DAFX, ISMIR, ICMC? What's its specialist subject? The answer is that it deliberately tries not to over-specialise, they keep the topic broad to encourage cross-disciplinary thinking, and there's a good strong representation of young researchers so it's a good place for fresh ideas and making new connections. My paper about timbre remapping came across pretty well I think.

One reason I was keen to go to this conference was that it was hosted by UPF's Music Technology Group in Barcelona, because that group is the main place where people have done research on very similar lines as my PhD topic of beatbox-based control. It was great to meet Jordi Janer whose PhD was about singing-based control, and Marco Marchini and Hendrik Purwins who presented a poster about a kind of rhythmic beatboxing equivalent to the continuator - give it a piece of rhythmic audio and it will try to continue by chopping up the sound and outputting patterns in (hopefully) the same style. The most interesting part of their work is the automatic approach to clustering, where they hierarchically cluster all the sound events, and then let the system choose the appropriate clustering level (i.e. how many clusters to lump the events into) at playback time, by judging how 'informative' the markov-model resynthesis is at each level of clumpiness.

Also interesting was Ho-Hsiang Wu and Juan Bello's poster about representing the musical structure of a song. We all know that many songs have repetition in them, whether it's verse-chorus-verse-chorus or something else - and we can analyse this automatically from the audio, for example by detecting repeated sub-sequences of chord patterns or timbre. Their contribution is to visualise this detected repetition using 'arc plots', pretty little monochrome rainbows that reminded me of the kind of information aesthetics practised by Information Is Beautiful. The end result is that pop songs create little plots which generally all look quite similar but with little shape differences that you could spot by eye, whereas I imagine classical music pieces would probably each have their own visual signature that could be quite different. Could be a nice way to get an instant visual impression of the musical structure of a piece of recorded music.

The keynote talk by Ricard Sole was thought-provoking, discussing the theory of complex networks, with some results of his created by applying this theory to languages, software, and other things. Sound and music wasn't mentioned, but I know it's useful stuff that was food for thought for many people. (In our group we have some researchers who have looked at this kind of thing already - when you consider the network of MySpace bands & friends, for example, that's a complex network where issues of small-world-ness, hubs, etc. Which reminds me, I wonder how Kurt is getting on with his thesis... :)

In fact some of the research presented at SMC was grappling with these issues too, such as the work by Martin Gasser et al showing that the problem of hubs in music similarity (i.e. songs that keep getting returned as good similarity matches to various input songs, even if they don't sound that similar) may be affected by the "homogeneity" of the audio in the music database.

The concert programme was packed full of things: lots of soundscape-based work, and more generally electroacoustic stuff. My favourites out of those were Impulsus I by Lina Bativa (an audio-visual piece which had a great narrative energy despite being really abstract), and Juan Parra Cancino's reacTable performance which I mentioned in my post about the reacTable.

But one of the things I was most grateful for was the deliberate non-art-music session. Electroacoustic stuff is all very well, but I can't generally cope with so much of it packed into a week and after all, this is a broad conference where many of the researchers are working on pop music, techno, breakbeats, and stuff like that. As the conference chair (Xavier Serra) said, it's actually quite difficult to get the non-art-music in the conference, since research conferences aren't usually their scene and most of the good examples of techno-enhanced popular music are quite happily making music in front of normal crowds... So, many of us were glad to spend an hour listening to Japanese pop made using Vocaloid, and a dance set made using Loopmash. (Sergi Jorda also told me he had hoped to get a dance music set in the reacTable concert, but the performer wasn't available.)

This is something that we need to work on as a research community - the SMC hosts did well, assisted by the fact that some of their own technology has gone directly and quite notably into music tech used by producers - but it's one of those things that's going to need a constant bit of extra effort to try and encourage that kind of thing into these conferences.

Monday 26th July 2010 | science | Permalink

Automatic birdsong analysis

I've started my first project after my PhD, a small feasibility study into automatic birdsong analysis.

The picture visualises a few seconds of a skylark recording by Dr Elodie Briefer (in QMUL's School of Biological and Chemical Sciences), from her PhD research into the structure of skylark song.

What we're doing is looking at the potential for automatically analysing birdsong signals, which could mean picking them out of recordings, identifying species, identifying individual "syllables" in the song... who knows.

There are already a fair few published research papers about automatic birdsong analysis. I'm looking at the state of the art to determine the scope for future work, such as applying machine learning techniques we've developed in our group, or particular forms of signal analysis such as adaptive transforms.

In my PhD I was looking a lot at voice and music. Birdsong has interesting similarities to both music and spoken language - plus differences of course. So watch this space. And of course get in touch if you're interested.

Monday 5th July 2010 | science | Permalink

Unpredictable impact

There's a big change happening in UK science+engineering at the moment, and it goes by the name of Impact. What does it mean? When we do science we often do it just to find new things out, yet whether we intend it or not one of the great things about science is that it actually makes important changes to the world outside our research group. Impact is formally defined as being that effect that we have - on business and economy, on health, on public policy, on culture and the arts. There are billions of ways that impact spreads.

This has always been a very unpredictable thing and pretty hard to measure, so the government now has created a formal process for trying to account for the types of impact that we get out of research - and even further, to think hard about impact when deciding what research to fund. In a lot of cases the predicted impact will now account for up to 25% of the considerations in rating academic departments or allocating funding.

Sounds reasonable? Well many scientists are against it - and it's not because they don't like having to justify themselves (they already have to do that when they write grant applications etc), but because the real impact of science often happens in surprising ways, sometimes many years down the line. Take DNA fingerprinting for example. The scientists who came up with it were working with DNA, trying to measure various things, but they had no idea that the best thing they could do was make an unruly collection of DNA form patterns on a sheet of film - they discovered it by accident. And now it's an important part of many of the most serious court cases we have. Think of all the people who were convicted or freed based on DNA evidence - that's some serious impact there.

There are lots more examples of unpredictable impact - such as:

  • Email, when it was invented, was only able to send messages to people using the same mainframe. No-one predicted that tweaking it to send messages around the world would make it one of the most important communication tools we have.
  • Gregor Mendel - a lone priest planting peas in a garden, trying out different cross-breeds and making careful notes. It wasn't until years after his death that biologists realised how Mendel's laws of inheritance fit with Darwinian evolution, and formed the foundation of modern biology, with massive impact throughout society.
  • Texting. A phone is for phoning, right? Text messages were never planned to be the mainstay of what mobile phones were about, just a way to get a message through when you couldn't talk. But now many people text more than they call.
  • Liquid crystal displays eventually arose from the basically curiosity-driven research of Friedrich Reinitzer looking at the chemical cholesteryl benzoate. Now it's used in TVs, phones, watches...
  • Fibre optics was demonstrated as a curiosity and a demonstration of physical principles in the 19th century; but it wasn't until way into the 20th century that it became important for data transmission, for example in phone networks.

And the opposite is also true - history is littered with examples of discoveries/inventions that were widely expected to change the world, but didn't:

  • Video messaging: the phone companies seem to have thought that if we liked text messaging we were going to love video messaging. No.
  • Artificial intelligence: In the 1960s the artificial intelligence research community was an incredibly optimistic one, with leading lights such as Marvin Minsky basically thinking they would be able to recreate the intelligence of a whole human brain within a few years, and then we'd all be having conversations with robot pals. That optimism came crashing down. Sure, you can now buy robot pals, and sure, we're still researching artificial intelligence and indeed using it in various applications, but it hasn't yet been the revolutionary impact it was going to be.
  • Hovercrafts and maglev: these have become the clich�s of misplaced futurology. After their invention they seemed to have been poised to take over the world - but no, we're still mostly using the good old wheel to get around.

So with all this evidence, it's not surprising that scientists are worried about this new approach of trying to plan your impact - much of the curiosity-driven stuff that has real impact could well get sidelined in favour of things which might be a bit less imaginitive but which seem like they'll definitely make some public or business connection.

OK fine - seems like there's some misguided bureaucracy coming down from government, and we have to try and make sure it doesn't end up stifling what it's supposed to be helping. But there's a bigger question that maybe we can think about. As I've said, "impact" is very hard to pin down or predict, and we don't really know how predictable it could or should be. But in many grant applications and suchlike, scientists are now writing down their predictions about the impact they'll have. Are those predictions useful data? Could we use "impact plans" as a great big study about whether impact can be predictable?

We could for example wait for five years, then look back at the pile of impact plans and ask, how many of those predictions (the ones which got funded, at least) came true? What percentage? What proportion of the observable scientific+engineering impact made over the next five years will have been predicted, in writing, in advance?

It would still leave a million questions unanswered, especially about unidentifiable impact (subtle things which are hard to count), long-term impact, and really it would still be a very reductive way to think about how science affects our society. But I wonder... would that make all these "impact statements" worth their while?

POSTSCRIPT: Some further examples of unexpected impact, collected after this article was written:

  • Nice unpredictable-impact example from Martin Rees in his Reith lecture (Tuesday 1st June 2010), of the laser: its possibility was there in Einstein's ideas at the start of the 20th century; then there was a 40yr gap before it was actually conceived and made to happen; and the inventors of the laser would never have predicted DVD players and laser eye surgery 40yrs after that...
  • The bizarre and useful technique of optogenetics was enabled after researchers studied light-sensitivity in "pond scum" microbes.
  • "DNA restriction enzymes, once the province of obscure microbiological investigation, ultimately enabled the entire recombinant DNA revolution." (Quoted from this Science editorial 2013)
  • "Measurement of the ratios of heavy and light isotopes of oxygen, once a limited area of geochemistry, eventually allowed the interpretation of prior climate change." (Quoted from this Science editorial 2013)
Friday 4th December 2009 | science | Permalink

Tree recursion, python/octave/matlab/sc3, informal benchmark

I'm writing a tree data structure as part of my research. I'm not going to describe the algorithm in detail, but it takes a set of data points and repeatedly chops them into two groups so that you can divide a dataset up into spatial subgroups.

Anyway, my first implementation (in SuperCollider 3) was running fairly slowly so I tried it in three other languages, to see which would be most practical for my situation.

It's an informal kind of benchmark - informal cos I'm not going to show you the code, and I haven't run the tests dozens of times, etc. (Some of the tests I ran just once, since they took so long.) The datasets consisted of artificially-generated 3D points sampled from a mixture of a cubic and a toroidal distribution. In the following graph, lower results (shorter times) are better:

The results show a couple of interesting things. SuperCollider was my starting point and it was never developed for large data-crunching tasks so I'm not surprised that it becomes the worst performer once we get to large datasets, although it actually doesn't do too badly. To be ten times as slow as Python or Matlab on big datasets is not embarrassing when both of those have had so many more person-hours of development effort specifically for big data crunching.

The comparison against Octave is illuminating. Octave was originally my open-source Matlab alternative of choice, but I've come to feel like it has all the drawbacks of Matlab (mainly the godawful design of the Matlab language) and none of the advantages (under-the-hood optimisation tricks, great plotting). Here I was running exactly the same code in Matlab (7.4) and Octave (3.0.5). I expected Octave to be roughly competitive, since this branching recursive code is quite difficult to auto-optimise, but Matlab generally handles it something like ten times as fast. So here I find another sign that Octave isn't quite there.

I now know, of course, that Python + numpy is the open-source Matlab alternative of choice. The language design is much better, and numpy (the module that provides all the matrix-crunching tools) has undergone lots of development effort and become better and better. And this (informal!) benchmark shows python (2.5.4, with numpy 1.3.0) performing just as well as Matlab on the large data.

(There is one thing that Python definitely lacks compared to Matlab: decent well-integrated 3D plotting. matplotlib doesn't have it except in old deprecated versions; python's gnuplot interface is poorly developed; other python plotting libs have drawbacks such as non-interactivity. I've mentioned this before.)

So I'll probably be using my Python implementation of the tree data structure. It's right up there in terms of speed, plus the code is conceptually cleaner than the Matlab version, so it'll easier to maintain, and easier for others to grok, so it's better for reproducible research. Remember, this benchmark was only informal so do your own tests if you care about this kind of thing...

Tuesday 10th November 2009 | science | Permalink

Are probiotics real, or meaningless?

Today Danone was forced to withdraw an advert for probiotic yoghurt because the scientific evidence didn't support it. The company claimed it boosted children's "defences" and cited various research studies to support it. The Advertising Standards Authority read the studies and found that although the studies were good, most of them weren't about the children in question, some of them used the wrong dosage of yoghurt or an inappropriate test group, and overall the results were inconsistent and didn't particularly support the claim.

I'm interested in this because probiotics is one of those weird new turns in commercialism in which you can't quite tell if there's real science there, or if there is nothing but an actor on screen grinning and rubbing her belly, saying "I trust good bacteria" over and over again.

I've heard some scientists saying that probiotics have been shown to be good for ill people recovering in hospital (whose natural gut flora might need "topping up") but that the evidence isn't there yet for any point at all in healthy people gulping down these yoghurts once a day as if they were your daily medication.

There are moves afoot in the EU which sound to me like a good idea. In 2006 a new EU law came in, stipulating that all medical-sounding marketing claims must be verified, and they now have a committee which looks at the evidence and pronounces yes or no on them. The claims for various yoghurt drinks, as well as all kinds of other products, has been submitted to this committee. They made the judgment that general probiotic claims aren't supported by evidence, although they'll be looking at more specific manufacturers' claims later.

The change hasn't actually come into force yet, but when it does, hopefully it won't be down to us to peer at the TV advert and think to ourselves, "Is that science or is that bullshit?" - it's only reasonable that we shouldn't have to do that, and companies should have to prove their stuff works before they parade it around in scientific clothing.

Wednesday 14th October 2009 | science | Permalink

InterSpeech09 conference: emotional speech

The InterSpeech conference was in Brighton this year - now, my research is all about "non-speech" voice (e.g. beatboxing) but I took the opportunity to go down and see what the speech folks were up to.

Automatic speech recognition is the "traditional" problem for computers+speech, but there's been a tendency recently to try and automatically recognise the emotional content too. This year was the first year of the InterSpeech "emotion challenge", in which researchers were challenged to automatically detect a range of emotions in a dataset of audio - recorded from schoolchildren who were trying to guide an Aibo round a track, apparently with emotive consequences...

I was surprised that many of the approaches to emotion recognition were so similar to the standard speech-recognition model: take MFCCs plus maybe some other measurements, model them with GMMs, classify the results (maybe with a HMM), so far so 1960s. The spectral measures (MFCCs) were typically augmented with prosodic measures such as the amount of pauses in a sentence, or measures about how the speaking pitch varied, and in quite a few of the papers it seemed that these prosodic features actually perform pretty strongly, often beating the spectral features. But I was surprised they were still relatively simple measures - no intricate prosody-specific models of temporal variation, for example, most seemed to use the average+minimum+maximum pitch. Combining the two types of data (spectral plus prosodic) was often the best but didn't seem to give a dramatic uplift vs using just one type. I suspect that more specific models could push the prosodic side a long way in the next few years. The winner of the "emotion challenge" was a kind of hand-designed decision-tree approach, pretty nice because they'd designed the classifier from theoretical motivations.

One thing about "emotion" is the same problem as for "timbre" (the musical attribute which I deal with in my research): it's still very hard to pin down exactly what you mean by it, specifically whether it's a continuous attribute or a set of categories. It seems that many datasets are labelled categorically - people mark a given word or sentence as being neutral/scared/happy/anxious/etc. But increasingly people are focusing on the continuous approach where emotion is treated as a 3D space, where one dimension is "arousal" (varying from calm to excited), one is "valence" (bad to good), and one is "potency" (dominated to dominant). If you combine those 3 dimensions variously you can cover the standard emotions pretty well (excitement, depression, boredom, anger, etc etc). This 3D approach gets around various cultural issues in the exact meaning of the labels, allows for some more refined analysis, and I believe it comes from a pretty well-validated area in psychology, although I don't know the literature on that.

Oh and there was a nice talk about automatically analysing and detecting laughter. Laughter is characterised by the bouts of vocal effort we push in, via the lungs and the tension in the vocal folds. That distinguishes it quite well from ordinary speech. So what these people did was a nice simple technique to estimate the glottal pulses (the moments of energy that come from our vocal folds), and to spot when these became more effortful and more frequent. You can't use an ordinary pitch tracker because each laugh is far too brief for a standard tracker to latch on to the quick pitch changes, but their custom analysis (plus a very basic classifier) seemed able to detect moments of laughter in TV talk shows etc. The analysis method (the zero-frequency filter) is technically very simple and potentially a useful trick...

Saturday 12th September 2009 | science | Permalink

Does processed meat cause cancer?

It's been on the radio news this morning, so it's timely that David Colquhoun has written this excellent article about diet and health. He goes through what the scientific evidence can and can't say about questions such as "Does eating processed food cause cancer?" - it's a long article but really clears things up.

Monday 17th August 2009 | science | Permalink

Vitamin supplements: avoid them?

This caught my eye in the paper this weekend: someone wrote in to the doctor's column asking if they should take vitamin A and E supplements to prevent cancer and heart disease, and the doctor's response was:

Several long-term and large trials have shown that taking extra vitamins A (such as betacarotene) and E does not reduce heart attack risk. In fact, some of the trials were stopped because there were more deaths in the vitamin groups than in those given placebos. As long ago as 14 June 2003 the Lancet reviewed the evidence and strongly discouraged any more research into the long-term use of such vitamin supplements. We get enough for our needs from anormal diet.

Blimey! I already knew that vitamin supplements were pointless (for healthy people) as long as you eat right. But do they actually do harm?

The doctor was referring to this 2004 review in the Lancet, which is a pretty good source. A web search also finds a 2008 Cochrane review of the evidence (another good source, but it's essentially an update of the earlier paper), which concludes:

We found no evidence to support antioxidant supplements for primary or secondary prevention. Vitamin A, beta-carotene, and vitamin E may increase mortality. Future randomised trials could evaluate the potential effects of vitamin C and selenium for primary and secondary prevention. Such trials should be closely monitored for potential harmful effects. Antioxidant supplements need to be considered medicinal products and should undergo sufficient evaluation before marketing.

This is pretty scary. According to these authors, there's no evidence that these supplements prevent cancer but there are hints that they might increase mortality? Such meta-analyses, when done properly, are very good ways to summarise the current state of research, but they're not set in stone - for example, when that review was published in the Lancet, the next issue featured some responses from some of the studies involved, who took issue with the general conclusion. But then, if the possibility of a negative effect looms strongly enough out of a systematic review like this, then it certainly needs to be considered.

Even this year more evidence arrives: this 2009 study finds that supplements of vitamins C or E or beta-carotene have no statistically significant effect on mortality (they don't increase or decrease the risk of death).

A couple of things to note:

  • This isn't about all vitamins, just about the vitamins mentioned above. As one correspondence notes, most people don't get enough Vitamin D, so maybe it's still worth taking Vitamin D supplements? (I haven't looked up any evidence about that yet.)
  • This is about vitamin supplements, not about vitamins in general. Fresh fruit and veg is a much better source of these vitamins in my opinion, and the evidence would seem to bear it out: here's a 2003 review which says, "A great deal of epidemiologic evidence has indicated that fruits and vegetables are protective against numerous forms of cancer." And here's a 2005 review which says a similar thing, and considers reasons why fruit and veg might be better than supplements.
Tuesday 28th July 2009 | science | Permalink

How does a PhD affect your salary?

In the lab we're chatting about what effect a PhD has on your career and your earning potential. This article is slightly old (2001) but it has some solid figures which are interesting:

Seems that a PhD in an electrical-engineering discipline (the closest match to ours) could raise your salary by around 8 or 9 percent.

Of course the economic car-crash puts a lot of things in question. But I'm glad at least that a PhD doesn't on average push your salary down, which some people say (and maybe it's true for some disciplines).

Wednesday 22nd April 2009 | science | Permalink

Distance analysis methods: Multidimensional Scaling and SplitsTree try to unravel the Tube map

In scientific research, one of the things you sometimes need to do is take a set of distance measurements (e.g. "it's 5 metres from A to B, 4 metres from A to C, and 3 metres from B to C") and try to reconstruct the actual spatial layout underlying that data.

So how to do it? Well one approach is Multidimensional Scaling (MDS) and it's been known for a few decades in timbre research. It assumes that the data exist in a Euclidean space (a pretty straightforward space like ordinary 3D space we're used to) and arranges the points in a layout that gives the least disagreement with the distance measurements. So if we have a set of musical timbre judgments (e.g. "a bassoon sounds quite like an oboe, but not much like a violin") we can try and force those objects into a spatial arrangement that suits their relationships, and then view the resulting map.

But there's a problem. Who in the world said that audio timbre behaved like a standard Euclidean space? Does it depend on context? (Yes.) Is the difference between A and B always the same as the difference between B and A? Does timbre behave more like categories (e.g. woody vs metallic vs watery) than like a space?

That's a big problem and there's no clear solution. I saw a talk by Ashley Burgoyne at ICMC 2007 which suggested some modifications to MDS to help account for the weirdness of timbre-space. Some of it makes intuitive sense: e.g. the use of "specificities" builds in the idea that one data-point may be more unique than it should be, having its own special distance to cover the fact that a trumpet sounds uniquely different from everything else. And he argued that the nonlinear versions coped better with the evidence about timbre judgments.

Then I heard about another completely different approach. Geneticists have developed rather clever ways of analysing the genes of different creatures, to produce "genetic distance" measures and then use those to reconstruct what the evolutionary tree could have been. The maths can be applied to any set of distance measurements (aha!) and creates a tree that best represents them - the "tree" is actually a kind of space, not the same as Euclidean space.

For an introduction to the maths involved, see Metric Spaces in Pure and Applied Mathematics.

I needed to get my head around how this approach might work, and whether it might be useful. So I decided to apply it to a weird space in which distance measurements might not correspond to actual spatial distances... the Tube map.

If you've been on the Tube you'll know that some journeys are longer than they should do, and the durations don't actually match up with the geographic distances they take you. You'll also know that the Tube map itself is highly nonlinear, the geographical layout is warped to make it neat and easy to read.

So I took this section of the Tube map:

and from the web I found two different sorts of data:

  1. how long it takes (in minutes) to walk from one station to another, overground;
  2. how long it takes to get from one station to another by tube.

Now the first set of data should be "more Euclidean" since walking is basically going in a straight line except for the buildings in the way; while the tube timings should be weirder because you're strongly constrained, there's only a few pipes you can go down and they don't always connect up in all the obvious ways.

So when you feed the walking-times into MDS you get this (I've painted the tube-lines back onto the map to make things more obvious):

Not bad eh? The arrangement is actually quite a lot like the real-world layout of the tube stations.

And here's what happened when the same walking-times were fed into SplitsTree:

Yes, it kind of works, except that Russell Square pokes out a bit weirdly, I think due to the algorithm's requirement that the data points sit at the edge of the graph. The SplitsTree representation is almost-but-not-quite happy to represent the data in 2D, shown by the patchwork of almost-rectangles.

Here's where the differences really show up though: the tube-timing data. The walking-time data was "easy"...

Tube-timing data after MDS:

Tube-timing data after SplitsTree:

Note that both algorithms push the circle line (the yellow line) away from the others, out towards the top-right of the space. That's because the circle line, although it crosses over the others, doesn't have as many intersections as it might do (it doesn't have a stop at Euston or Warren Street, for example). Both algorithms spot that Kings Cross is a hub in this network (meaning it's easy to get to most of these stops from Kings Cross), placing it right at the heart of the layout. More generally, neither algorithm reconstructs the geographical layout of the stations, simply because the time it takes to get from A to B isn't so much defined by geography but by the peculiarities of London Underground.

The SplitsTree representation seems here to use a lot of 3D boxes, and there are some convoluted goings-on inside the way it tries to rationalise all the distances.

Notice also that on the SplitsTree diagram, most stations have their own little spike to live on. These are similar to the "specificities" I mentioned earlier - each tube station takes that little bit of extra time because of the time needed to get up and down the escalators (or whatever). For the Piccadilly (dark blue) line, SplitsTree seems to suggest that the majority of the time taken is in getting up and down and the actual journey between stations is pretty quick, which I think pretty much reflects reality.

I did all this in order to try and grok the tree reconstruction algorithms. Not sure if I've got there yet, but this was definitely helpful...

Wednesday 11th February 2009 | science | Permalink

10 new PhD places in Media and Arts Technology

Our research group has 10 new fully-funded PhD places in Media and Arts Technology thanks to a big grant we've been awarded. The places include working with an industrial partner such as last.fm, the BBC, or Sony. If you know anyone who might be into that, let them know...

Tuesday 10th February 2009 | science | Permalink

Chaos theory is like biology used to be

Looking through the International Journal of Bifurcation and Chaos, the thing that strikes me is that chaos theory seems to be at the same kind of point that biology was at, before Darwin's work gave it a structure and an explanation. In the 19th century biologists would publish articles describing new species they'd found, saying it's a bit like this one, a bit like that one, but without evolution and genetics you can't really say much more than that - and you get the same feeling from modern chaos papers: look, I've found a new chaotic attractor, it's a double-scroll, it makes patterns like this.

There are all sorts of ways of categorising chaotic systems, characterising their general surface behaviour, even controlling them, but it looks like nothing really gets to the heart of what's going on. Is the study of chaotic systems waiting for some big explanation?

Wednesday 4th February 2009 | science | Permalink

My work in a BBC radio programme

The BBC reported on the "Augmented Instruments" concert that Jean-Baptiste Thiebaut organised a couple of weeks ago. As part of the feature, I gave a quick demo of my beatboxing synthesiser interface... Check out the podcast of the radio programme:

Tuesday 23rd September 2008 | science | Permalink

Some notes from DAFx08

Just returning from DAFX 2008 in Espoo, Finland, which was a good do. My first visit to DAFX - it's a smaller and friendlier conference than some others I've been to, a nice size (about 120 people). Met up with lots of good digital audio people, some new, some old. Some notes about a few topics that came up:

  • Vesa Valimaki's digital sound synthesis tutorial was good, including some tips about low-cost synth techniques ("Differentiated Parabolic Wave") coming from his lab, new to me. Similarly Ville Pulkki's spatial sound tutorial and demo, featuring the DirAC technique which seemed to give some nice sonic results.
  • Our lab was well-represented, and it was nice that Anssi Klapuri picked up on Becky Stewart's spatial music navigation ideas in his keynote. My talk on voice timbre went fine too, despite the interruption of an automatic blackboard...
  • The keynote by Hyri Huopaniemi (of Nokia) didnt have as much news as I was hoping, but it was nice to see a bit about how the Princeton group's mobile-phone synth system is put together, a python interface onto a C++ synthesis core.
  • Naofumi Aoki's poster on bandwidth extension of mobile phone audio was interesting, although not specifically for the bandwidth extension but for the steganography trick used to embed metadata into audio. This means you can do fancy things with mobile phone audio without having to change the way the worldwide phone system works...
  • There were quite a few good papers about guitar synthesis and guitar amp emulation, etc. Worth mentioning is Fredrik Eckerholm's guitar synth, just because to my ears it sounded very nice and had a lot of features (e.g. pickup placement, pick parameters).
  • Jari Kleimola's sound synthesis trick - essentially XOR on audio - caught a few people's attention, making some quite nice sounds despite its simplicity.
  • Damian Murphy's results on the quality of different DWM reverb techniques were interesting, although it's not my field so I can't judge it in detail.
  • Was nice to see spectutils which is a nice set of spectrogram plotting tools for GNU Octave. Should be useful.

170420081996 The conference banquet was v good too, good food and in a really nicely-architected building called Dipoli. Also had a good time in and around Helsinki but I've documented that elsewhere.

Saturday 6th September 2008 | science | Permalink

Beatboxing with a very different voice

Someone has written a very nice popular-science-type article... about me :)

Friday 27th June 2008 | science | Permalink

My reading list: the past 18 months

I decided to make a public archive of my Bibtex file - i.e. almost everything I've read, or not read, in my PhD so far.

This bibliography might be useful to people interested in sound/music technology, vocal timbre, real-time audio processing, etc.

The general angle of my research topic is summarised on my QMUL homepage

Wednesday 11th June 2008 | science | Permalink

Laryngographs of my beatboxing

So after I beatboxed for the scientists they've sent me some of the output from the laryngograph tests. Here it is!

First of all here's me doing a kick-drum-plus-bass sound:

  • laryng-kick-drum-plus-bass.mp3 WARNING: this is NOT a normal recording. On the LEFT channel you get the normal recording from a microphone, and on the RIGHT channel you get the direct output from the laryngograph - essentially, you get to listen to what my larynx is doing itself, without any of the complicated stuff that happens afterwards (in the throat, lips, tongue). Use your computer's left-right balance controls to choose what to listen to.

Here's a picture of that same clip:

Laryngogram of kick-plus-bass

In that picture the audio recording is the blue "Sp" line in the middle, and the larynx trace is the green "Lx" just below it - the signal goes up when vocal cords close, goes down when they open.

Towards the end of the clip my larynx is opening and closing normally, a regular opening-and-closing just like in normal speech. But towards the beginning it's a bit more chaotic than that, and it almost looks like there are two different frequencies competing to take over. I'm not entirely sure what this implies, but the researchers pointed that feature out, and maybe it's connected to the sound that's produced somehow.

OK, now here's a bit of "vocal scratching":

  • laryng-vocal-scratching.mp3 WARNING: this is NOT a normal recording either. On the LEFT channel you get the normal recording from a microphone, and on the RIGHT channel you get the direct output from the laryngograph.

Here's a picture of that same clip:

Laryngogram of vocal scratching

The main thing they were looking at on the scratching was the very fast pitch changes - look at the lowest panel and the green "Fx" line, which is the fundamental frequency. It changes by up to one-and-a-half octaves in 150 milliseconds, which apparently is ridiculously fast. Now I'm not the best vocal-scratcher in the world, so I bet that it goes even faster than that for others...

Thursday 13th March 2008 | science | Permalink

They put a CAMERA up my NOSE

And it was all in the name of science. I volunteered for an experiment which wanted to look at beatboxer's voice-boxes while they were beatboxing, so I went and let someone put a camera up my nose (a nasal endoscopy). This was also being filmed for a Science Museum beatboxing project, so as well as the actual scientists there was a one-woman film crew plus a Science Museum person co-ordinating the thing and handing me the SM58 so I could bust some beats in the little clinic room.

I couldn't see the screen so I wasn't sure what my larynx was looking like but I dropped some of the usual beatbox stuff (some old-school hip-hop ones, a slightly poor DnB one, a quick rendition of If Your Mother Only Knew) and they seemed interested in what was happening. They'll take a while to do a proper analysis of the results but apparently there's a lot of muscular activity happening around and above the larynx while I'm doing kicks and snares and suchlike.

Some voice specialists are worried that beatboxing is bad for your voice so it was good to know that, after 7 years of beatboxing, I don't seem to have anything weird or wrong with my vocal folds, I'm not doing myself any damage.

One of the sounds that worries specialists is vocal scratching, so I gave them a bit of that. They confirmed that it involves a lot of constriction to produce that sound, and they also confirmed that there are lots of really fast pitch changes (one-and-a-half octaves in 150 milliseconds!). Whether that means it is bad for you I'm not sure. I don't actually do much vocal scratching myself.

There'll be more sessions, and at some point there'll be a video online, but that's all for now. I have a printed-out photo of my larynx but you don't want to see that ;)

There were also some tests with a laryngograph, which showed some of the controlled-weirdness involved in beatboxing, and some interesting discussion about whether super-deep bass tones were bad for you or not. The "received wisdom" is that they're dangerous since they involve your "false vocal folds" pushing down on your real vocal folds, but some researchers have evidence that if you do it right, that's not what's happening, instead your false vocal folds are basically flapping on their own. Watch this YouTube video on "Extreme vocal effects" to see what's happening when singers make deep growly sounds...

Tuesday 11th March 2008 | science | Permalink

Echinacea: Science says

There was an advert on the tube claiming echinacea could reduce my chance of developing a cold by 65%. Blimey, a big claim. So I went and found the source of the cliam, and a couple of other review papers. My summary of the research is this:

  • Although it's hard to be certain (partly because there are so many different sorts of echinacea plant and different ways to prepare it), it does look like echinacea helps to shorten the duration of a cold and make it less severe. It might also prevent a cold happening in the first place, but that's less clear. The most likely useful type of echinacea is echinacea purpurea.

There are all sorts of caveats on this summary. Firstly it's not recommended for children, or for people with immune problems such as arthritis or HIV, or people who might have an allergic reaction. Secondly we basically don't know how it might work (it contains a few chemicals that probably interact with the immune system... but in what way?). Thirdly we need more big studies before we can be sure about the effect on outcomes - so the picture might change, might even change dramatically, as more science gets done.

But I want to emphasise: in terms of real-life evidence, echinacea has much better evidence than homeopathy, or than other herbs or other such stuff you might find in that same section at the chemist's.

My main sources for all this are two recent research summaries, a meta-analysis published in a Lancet journal and a Cochrane systematic review.

I noticed that some science bloggers tried a little bit to poo-poo the meta-analysis, and to be blunt I suspect that's because it finds quite decisively in favour of echinacea. (None of my favourite science bloggers had this prejudice: David Colquhoun and Ben Goldacre are my favourites by the way.)

I do personally have an instinctive scepticism of complementary medicines because of the way things often try to side-step proper evaluation while at the same time giving themselves a white-coated pseudo-medical image. But in this case I'm happy to say that both the reviews find generally that there is a positive effect on cold from (some) echinacea preparations.

Friday 30th November 2007 | science | Permalink

A beatbox experiment

After a good few months of working on my PhD I'm finally ready to get some people to use my stuff and see what they make of it.

If you're near London and you're a beatboxer check this out, I'm recruiting for a beatbox experiment

Thursday 29th November 2007 | science | Permalink

Smoking ban definitely improved health

Good news from Scotland, where the smoking ban came in before ours in England. A large study has found measurable health improvements due to the ban, such as a large decrease in heart attack admissions (including a noticeable effect on non-smokers due to less passive smoking). Woo.

Monday 10th September 2007 | science | Permalink

Onscreen violence really is bad for us

Given the shootings in the USA this week, the main feature in this week's New Scientist is eerily apt. As summarised in their editorial, the research on the effect of TV / video game violence seems to be persuasive, that it has generally bad effects including aggression/desensitisation/etc.

While the report does concede that you can get useful skills from modern media (such as the dexterity and quick thinking which can be demonstrated to come from computer games), it makes the point quite clearly that the bad outweighs the good. I'm not sure what the picture would be like for people who see only "non-violent" media... I've never read any research papers on the subject so I can only be vague.

The strange prevalence of violence in films and computer games puzzles me quite a bit. I'm not one of those people that automatically tuts about violent media but it's weird how much violence there is. It must be what people want, but why? One answer might be "escapism", escaping from humdrum life into exciting scenarios, and maybe violence is one of the easiest ways to make things exciting. But there are loads of imaginitive ways to escape from the world... just look at some of the weird imaginitive stuff that the Japanese come up with. The Japanese come up with lots of really sick and violent stuff too of course ;) and maybe the grass looks a little greener on the other side, but our media's imaginitive range seems a bit stifled in comparison. Is poverty of imagination really anything to do with it? Or am I making it up?

Thursday 19th April 2007 | science | Permalink

Gillian McKeith stops calling herself a doctor!

The assumptions you make, eh? Not that I ever paid much attention to Gillian McKeith's TV programmes, but when someone called "Dr Gillian McKeith" appears regularly on Channel 4 telling people what they should be eating, who publishes books and so on, you tend to assume they've got medical qualifications in the straightforward sense just like my GP does.

This interesting article on Gillian McKeith throws a different light on the matter. Someone complained to the Advertising Standards Authority that calling herself "Dr Gillian McKeith" in advertising was misleading (since she's only a "Dr" by virtue of a correspondence course with a non-accredited American college). In order to avoid falling foul of a pending Advertising Standards Authority ruling (apparently a draft ruling seemed to be inclined in favour of the complaint) she's agreed not to use the term in future advertising.

The article has some really choice words to say about the woman, including quoting some of the very bizarre medical claims she's made, and the "Wild Pink Yam and Horny Goat Weed products" her company briefly marketed before the Medicines and Healthcare Regulatory Agency ordered her to stop selling them and said they "were never legal for sale in the UK". The article's written by a doctor and it makes quite a lot of good points in general about the difference between science and nonscience, and real doctors and sort-of-doctors...

Tuesday 13th February 2007 | science | Permalink

How many eggs should I eat?

OK, here's yet another food dilemma: should you eat plenty of eggs, because they contain various healthy vitamins and minerals? Or should you not eat many eggs, because of the cholesterol they contain? As usual I'm determined to find an evidence-based answer.

The first things I find in a web search come from the egg marketing boards. So, bearing in mind that they're obviously quite biased, I check out "Healthy eggs" from britegg.co.uk and "Eggs and cholesterol" from nutritionandeggs.co.uk. So, as expected, they confirm that eggs are full of lots and lots of nutritious things, but they also argue that recent evidence shows that eggs aren't bad for health. They have two scientific studies to support this argument: one which looked at a large number of people in the USA and found eggs didn't increase the risk of heart disease; and one which reviewed the current state of scientific knowledge and found that saturated fat (rather than dietary cholesterol) was the main cause of people having high blood cholesterol levels.

So far, so good, although the source is not what you'd call 100% neutral. And even if saturated fat is the main cause of high blood cholesterol, could dietary cholesterol be a lesser but still important cause?

So, I found the cholesterol review article and had a look. It's a very tricky subject to unpick, actually. For example, the study finds that people who eat more dietary fat tend to eat more dietary cholesterol too. So it could be tricky to separate out the effect of these two. There are methods for doing this, of course, and in the multiple regression analysis used by the researchers, it seems that there were three significant influences on a person's blood cholesterol levels: their intakes of saturated fat, polyunsaturated fat (the more people eat, the lower their cholesterol, for polyunsaturates), and cholesterol. However, although these influences all took part, the cholesterol influence is strongly outweighed by the influence of saturated fat vs unsaturated fat - if I gloss over some of the details to come up with a very approximate rule of thumb, the study finds that reducing saturated fat is someting on the scale of ten times more influential than reducing cholesterol.

OK, so what about some other sources of information? The BBC often has a lot of health information, but searching their site didn't actually find very much. The story Eggs 'protect against breast cancer' reports on a USA study of women, finding that eating eggs in teenage years seems to help lessen the likelihood of breast cancer; the study involved a large number of people and was published in a reputable journal so it seems trustworthy. The only other article I found was An egg a day 'is good for you' which seems to be based on the same studies as the ones I mentioned above. They did however confirm with a British Nutrition Foundation scientist, who agreed that there was unlikely to be a health risk from eating an egg a day (they recommend 2 or 3 a week apparently). There is opposition from the Vegan Society, but once again, they're hardly an unbiased source of information about whether people should eat eggs or not!

What about UK government advice? The UK government seems to be quite keen on inventing websites for public information these days, and one of their sites I searched is eatwell.gov.uk. They have two useful pages here: a page about eggs (including the section: "How many eggs?" - aha!) and a Q&A about eggs and cholesterol. The message from them is: eggs are good for you, and you don't need to cut down on them (unless your doctor tells you to for a specific reason). Just eat a balanced diet, as they always say.

And that's pretty much my conclusion. It seems that people used to (reasonably) assume that eating food with cholesterol in, would raise your blood cholesterol, and that was a reason not to eat too many eggs. But that assumption is too simple, and dietary cholesterol isn't that worrying after all. As long as you eat a balanced diet you can enjoy your eggs.

Sunday 10th December 2006 | science | Permalink

Does burnt food cause cancer?

Does burnt food cause cancer? Someone said to me that burnt food was "as dangerous as a cigarette", which is a pretty big claim, so I've been searching the web and some research databases, looking for evidence.

There's very little on the web about it, besides a lot of idle speculation on messageboards. This ScienceNews article from 2005 says that the US government now lists certain chemicals found in "meats when they're cooked too long at high temperature" as carcinogenic. It also says:

Finally, the report notes that while inconclusive, published studies in people "provide some indication" of human risks from eating broiled [grilled] or fried foods "that may contain IQ and/or other heterocyclic amines." The National Cancer Institute conducted one of those suggestive studies. It compared the diets of 176 stomach cancer patients and another 503 cancerfree individuals. Overall, people who regularly ate their beef medium-well or well-done faced more than three times the stomach cancer risk of those who ate their meat rare or medium-rare, according to a 1997 report of the research.

More information about this is in a very helpful summary by the USA National Cancer Institute. Note that one of the studies quoted looked at cooking at 200ºC or 250ºC, which is much hotter than ordinary baking/roasting. However, that is the kind of temperature you use to cook a pizza...

Statistics like "three times the cancer risk" always sound scary, but you need to ask, three times what? We need to know how the risk compares against other things. More on that later.

I found a messageboard thread on which someone said "You can put tomato sauce on it. I heard it helps lessen the production of carcinogen which causes the cancer." This is a big mistake. Fruits like tomatoes or cherries do contain antioxidants which counteract the formation of the carcinogens, but only during the cooking process, mixed in with the meat (e.g. in a burger mixture). Putting ketchup on afterwards will make zero difference.

I also found a journal article discussing the increased cancer risk from barbecued food especially (Lijinsky W, (1991), Mutation Research 259 (3-4): 251-261). It suggested that the reason for the risk was that fat will drip off the meat, then burn at high temperatures when it hits the coals, forming the cancer-causing substances that then mix in with the barbecue smoke and may then coat the outside of the meat being cooked. This explanation was proposed to explain their finding that the chemicals were mainly found in fattier foods cooked over burning logs.

Other relevant journal articles:

  1. One found a similar connection: the highest concentrations found in the Italian diet were in pizzas cooked in wood-burning ovens, and in barbecued beef and pork. Ludovici M et al (1995), Food Additives and Contaminants 12 (5): 703-713)
  2. One found that the Indian tradition of cooking with homemade clay-stoves, called "Chulha", created a lot of smoke containing the problematic chemicals.(Bhargava A et al (2004), Atmospheric Environment 38 (28): 4761-4767) This was said to increase the risk for people who cook with them - remember that inhaling carcinogens is typically much more dodgy than swallowing them, because the route into the body is more direct.

The relative risk? Is a barbecued steak as dangerous as a cigarette, as certain internet message boards might lead you to believe? Clearly not: many people eat well-cooked meat, yet nine-out-of-ten cancer deaths can be attributed to smoking. (The nine-out-of-ten figure comes from a study of USA deaths in 1995: source.) Scientists can calculate guideline statistics such as the "incremental cancer risk", an averaged-out measure of the risk from something. For cigarettes it's 0.079 (source); for burnt meat it's somewhere between 0.00001 and 0.00038 (source). So the risk is somewhere between 200 and 8000 times lower - there's no comparison between one cigarette and one burnt steak.

My conclusions:

  1. Regularly eating burnt or barbecued meat, especially meat that's been cooked at high temperatures for a long time, is relatively risky behaviour. But don't panic: it's not comparable to smoking.
  2. For non-meat food the research is less clear-cut: it's not obvious whether all smoke-cooked or overcooked food carries risks. Certainly if you don't eat it regularly there's nothing to worry about.
Monday 2nd October 2006 | science | Permalink

Digital vs analogue clocks

Both me and Philippa insist that it's easier to read an analogue clock-face (i.e. one with hands) than a digital clock-face. So I wondered: is there any research on the subject?

Of course there is! There's research about everything. But it doesn't seem to agree with us.

In Processing of visually presented clock times (Goolkasian, P and Park, D.C., 1980) the experimenters looked at the differences in speed for judging the time difference between two clocks, and found that "same/different reactions to digitally presented times were faster than to times presented on a clock face, and this format effect was found to be a result of differences in processing that occurred after encoding."

Minding the clock (Kathryn Bock, David E. Irwin, Douglas J. Davidson and W. J. M. Levelt, 2003) looked at explicitly linguistic effects (e.g. difference between Dutch and American English speakers). It also found that "responses to analog clocks were faster with relative expressions and responses to digital clocks were faster with absolute expressions," although overall it found again that digital clock-reading was faster than analogue in all cases. Note that the experimental method was explicitly linguistic - the speed measurements were measurements of how quickly the participants began to speak when they correctly named the time.

This is one of the most interesting (and most recent) results I found, partly because the experimental design included displaying the clocks for a short amount of time (as low as 0.1 seconds). 0.1 seconds is too quick for the eye to rove around the clock-face and fixate directly on the different parts of the display, and "the results from the 100 ms exposure conditions indicated that sufficient information for fairly accurate production can be extracted from the display without fixating the crucial information directly."

The effects of response format and other variables on comparisons of digital and dial displays (Miller R.J. and Penningroth S., 1997) "compared dial and digital clock displays to determine which could be read faster by 25 young adults" and found that "in general, digital displays led to faster responses than did dial displays. However, several combinations of the other variables, particularly those using the before-the-hour response format, effectively eliminated the superiority of digital displays. We suggest that in designing displays requiring such a response format, designers should not assume that a digital display is necessarily the best choice, especially if other factors encourage the selection of a dial display."

I haven't read the full paper (not available electronically; will have to visit my uni library) so I'm not sure if the experimental design was again based on participants reading the time out loud - and if so, I have an issue with that which I'll come to later. But this effect of before-the-hour responses is tantalising. For example: Philippa is a radio producer, and one of the things they need to do is glance at the clock to know how much time they've got before the programme ends at 6 o'clock precisely, so they can judge when to end interviews, when to bring in the next piece of music, etc. Philippa finds it much quicker to glance at an analogue clock in order to do this, and intuitively I can see why. You can literally see how much time is left (i.e. the size of the gap between the minute hand and the 12 o'clock mark), whereas with a digital clock you have to take in all the numbers and then do a quick arithmetic operation - not difficult, of course, but probably much slower, cognitively.

Judging a duration like this is very different from speaking the time. Reading out numbers is a one-to-one transformation which we do in so many contexts that it's very very easy; when reading out from a dial clock, we need to translate the hands' position into numbers before we can speak it. When using a dial clock to determine actions, however, we don't necessarily need to put the numerical step in the middle.

I'd like to run an experiment different to the ones I've found so far, one which tests the ability to comprehend clock-faces from a short glance - e.g. starting at 0.1 seconds and getting shorter. Rather than measuring the speed of vocalising, the measurement would be the minimum "glance time" for which the time could be correctly identified. My hunch is that the threshold will be a much shorter glance for analogue clocks.

Friday 23rd June 2006 | science | Permalink

A load of Boswelox

Here's an interesting article about the science, or lack of science, behind skincare products and their arbitrary pseudo-scientific claims. It even uses the word "Boswelox", possibly the funniest word ever.
Wednesday 22nd February 2006 | science | Permalink
Creative Commons License
Dan's blog articles may be re-used under the Creative Commons Attribution-Noncommercial-Share Alike 2.5 License. Click the link to see what that means...