When you work with birdsong you encounter a lot of rapid frequency modulation (FM), much more than in speech or music. This is because songbirds have evolved specifically to be good at it: as producers they have muscles specially adapted for rapid FM (Goller and Riede 2012), and as listeners they're perceptually and behaviourally sensitive to well-executed FM (e.g. Vehrencamp et al 2013).
Standard methods for analysing sound signals - spectrograms (or Fourier transforms in general) or filterbanks - assume that the signal is locally stationary, which means that when you consider a single small window of time, the statistics of the process are unchanging across the length of the window. For music and speech we can use a window size such as 10 milliseconds, and the signal evolves slowly enough that our assumption is OK. For birdsong, it often isn't, and you can see the consequences when a nice sharp chirp comes out in a spectrogram as a blurry smudge across many pixels of the image.
So, to analyse birdsong, we'd like to analyse our signal using representations that account for nonstationarity. Lots of these representations exist. How can we choose?
If you're impatient, just scroll down to the Conclusions at the bottom of this blog. But to start off, let's state the requirements. We'd like to take an audio signal and convert it into a representation that:
- Characterises FM compactly - i.e. FM signals as well as fixed-pitch signals have most of their energy represented in a similar small number of coefficients;
- Handles multiple overlapping sources - since we often deal with recordings having multiple birds;
- Copes with discontinuity of the frequency tracks - since not only do songbirds make fast brief audio gestures, but also, unlike us they have two sets of vocal folds which they can alternate between - so if a signal is a collage of chirpy fragments rather than a continuously-evolving pitch, we want to be able to reflect that;
- Ideally is fairly efficient to calculate - simply because we often want to apply calculations at big data scales;
- Does the transformation need to be invertible? (i.e. do we need a direct method to resynthesise a signal, if all we know is the transformed representation?) Depends. If we're interested in modifying and resynthesising the sounds then yes. But I'm primarily interested in extracting useful information, for which purposes, no.
Last year we published an empirical comparison of four FM methods (Stowell and Plumbley 2014). The big surprise from that was that the dumbest method was the best-performing for our purposes. But I've encountered a few different methods, including a couple that I learnt about very recently, so here's a list of methods for reference. This list is not exhaustive - my aim is to list an example of each paradigm, and only for those paradigms that might be particularly relevant to audio, in particular bird sounds.
- Let's start with the stupid method: take a spectrogram, then at each time-point find out which frequency has the most energy. From this list of peaks, draw a straight line from each peak to the one that comes immediately next. That set of discontinuous straight lines is your representation. It's a bit chirplet-like in that it expresses each moment as a frequency and a rate-of-change of frequency, but any signal processing researcher will tell you not to do this. In principle it's not very robust, and it's not even guaranteed to find peaks that correspond to the actual fundamental frequency. In our 2014 paper we tested this as a baseline method, and... it turned out to be surprisingly robust and useful for classification! It's also extremely fast to compute. However, note that this doesn't work with polyphonic (multi-source) audio at all. For big data analysis it's handy to be able to do this, but I don't expect it to make any sense for analysing a sound scene in detail.
- Chirplets. STFT analysis assumes the signal is composed of little packets, and each packet contains a sine-wave with a fixed frequency. Chirplet analysis generalises that to assume that each packet is a sine-wave with parametrically varying frequency (you can choose linearly-varying, quadratically-varying, etc). See chirplets on wikipedia for a quick intro. There are different ways to turn the concept of a chirplet into an analysis method. Here are some applied to birds:
- Stowell and Plumbley 2012, "Framewise heterodyne chirp analysis of birdsong" - a method by us, designed to scan quickly over a fixed dictionary of chirplets. It's efficient, but overcomplete and not invertible. In our 2014 paper we found it worked pretty well for bird classification, and also that it was very robust to audio degradation.
- Aoi et al 2015, "An Approach to Time-Frequency Analysis With Ridges of the Continuous Chirplet Transform" - pretty heavy-duty but interesting. Again it's overcomplete. They apply it to zebra finch audio. Note that the detection of ridges implies a certain requirement of continuity, but this can be tuned. I've not explored this method.
- Gribonval 2001, "Fast matching pursuit with a multiscale dictionary of Gaussian chirps". This sits in the sparse coding paradigm, using the efficient matching pursuit method to decompose a signal into an invertible sparse representation of chirp packets. You can use this method in the MPTK toolkit for matlab - I analysed birdsong using this method while visiting Gribonval's lab in 2013. In our 2014 paper however, we found that matching pursuit representations were highly non-robust to noise and other audio degradations, and also didn't give us great classification results. This is a shame because the method is handy for analysis-resynthesis (as illustrated in the blog post that I just linked to).
- Filter diagonalisation method - an interesting method from quantum mechanics, FDM models a chunk of signal as a sum of purely exponentially decaying sinusoids. Our PhD student Luwei Yang recently applied this to tracking vibrato in string instruments. I think this is the first use of FDM for audio. It's not been explored much - I believe it satisfies most of the requirements I stated above, but I've no idea of its behaviour in practice.
- Subspace-based methods such as ESPRIT. See for example this ESPRIT paper by Badeau et al. These are one class of sinusoidal tracking techniques, because they analyse a signal by making use of an assumed continuity from one frame to the next. In fact, this is a problem for birdsong analysis. Roland Badeau tested a birdsong recording for me and found that the very fast FM was a fatal problem for this type of method: the method simply needs to be able to rely on some relatively smooth continuity of pitch tracks, in order to give strong tracking results.
- Fan chirp transform (Weruaga and Kepesi 2007) - when you take the FFT of a signal, we might say you analyse it as a series of "horizontal lines" in the time-frequency plane. The fan chirp transform tilts all these lines at the same time: imagine the lines, instead of being horizontal, all converge on a single vanishing point in the distance. So it should be particularly good for analysing harmonic signals that involve pitch modulation. Note that the angles are all locked together, so it's best for monophonic-but-harmonic signals, not polyphonic signals. My PhD student Veronica Morfi, before she joined us, extended the fan-chirp model to non-linear FM curves too: Morfi et al 2015.
- Spectral reassignment methods. When you take the FFT of a signal, note that you analyse it as a series of equally-spaced packets on the frequency axis. The clever idea in spectral reassignment is to say, if we assume the packets weren't actually sitting on that grid, but we analysed them with the FFT anyway, let's take the results and move every one of those grid-points to an irregular location that best matches the evidence. You can extend this idea to allow each packet to be chirpy rather than fixed-freqency, so there you have it: run a simple FFT on a frame of audio, and then magically transform the results into a more-detailed version that can allow each bin to have its own AM and FM. This is good because it makes sense for polyphonic audio.
- A particular example of this is the distribution derivative method (code available here). I worked with Sasho Musevic a couple of years ago, who did his PhD on this method, and we found that it yielded good informative information for multiple birdsong tracking. (Stowell et al 2013) Definitely promising. (In my later paper where I compared different FM methods, this gave a strong performance again. The main disbenefit, in that context, was simply that it took longer to compute than the methods I was comparing it against.) Also you have to make some peak-picking decisions, but that's doable. This summer, I did some work with Jordi Bonada and we saw the distribution derivative method getting very good precise results on a dataset of chaffinch recordings.
- The JanÄoviÄ and KÃ¶kÃ¼er method is a specific method for detecting modulated sinusoids by the specific shapes they create in a short-time spectrum. The method is specifically designed to take advantage of the behaviour of heavily-modulated tones, so unlike many other methods it should actually work particularly well on them. Nifty. I don't think it's been used by anyone else, though the authors themselves have incorporated their tracker into a bird classifier. I'd like to see this evaluated more.
- There's lots of work on multi-pitch trackers, and it would be incomplete if I didn't mention that general idea. Why not just apply a multi-pitch tracker to birdsong audio and then use the pitch curves coming out from that? Well, as with the ESPRIT method I mentioned above, the methods developed for speech and music tend to build upon assumptions such as relatively long, smooth curves often with hard limits to the depth of FM that can exist.
- How about feature learning? Rather than design a feature transform, we could simply feed a learning algorithm with a large amount of birdsong data and get it to learn what FM patterns exist in the audio. That's what we did last year in this paper on large-scale birdsong classification - that was based on spectrogram patches, but it definitely detected characteristic FM patterns. That representation didn't explicitly recover pitch tracks or individual chirplets, but there may be ways to develop things in that direction. In particular, there's quite a bit of effort in deep learning on "end-to-end" learning which asks the learning algorithm to find its own transformation from the raw audio data. The transformations learnt by such systems might themselves be useful representations for other tasks.
So.... It's too soon to have conclusions about the best signal representations for FM in birdsong. But out of this list, the distribution derivative method is the main "off-the-shelf" tool that I'd suggest for high-resolution bird FM analysis (code available here), while feature-learning and filter diagonalisation and the JanÄoviÄ and KÃ¶kÃ¼er method are the approaches that I'd like to see more research on.
At the same time, I should also emphasise that machine learning methods don't need a nice clean understandable representation as their input. Even if a spectrogram turns birdsong into a blur when you look at it, that doesn't necessarily mean it shouldn't be used as the input to a classifier. Machine learning often has different requirements than the human eye.
(You might think I'm ignoring the famous rule garbage in, garbage out when I say a classifier might work fine with blurry data - well, yes and no. A spectrogram contains a lot of high-dimensional information, so it's rich enough that the crucial information can still be embedded in there. Even the "stupid method" I mentioned, which throws away so much information, preserves something of the important aspects of the sound signal. However modern classifiers work well with rich high-dimensional data.)
But if you're trying to do something specific such as clearly characterise the rates of FM used by a particular bird species, a good representation will help you a lot.