People who do technical work with sound use spectrograms a heck of a lot. This standard way of visualising sound becomes second nature to us.
As you can see from these photos, I like to point at spectrograms all the time:
(Our research group even makes some really nice software for visualising sound which you can download for free.)
It's helpful to transform sound into something visual. You can point at it, you can discuss tiny details, etc. But sometimes, the spectrogram becomes a stand-in for listening. When we're labelling data, for example, we often listen and look at the same time. There's a rich tradition in bioacoustics of presenting and discussing spectrograms while trying to tease apart the intricacies of some bird or animal's vocal repertoire.
But there's a question of validity. If I look at two spectrograms and they look the same, does that mean the sounds actually sound the same?
In strict sense, we already know that the answer is "No". Us audio people can construct counterexamples pretty easily, in which there's a subtle audio difference that's not visually obvious (e.g. phase coherence). But it could perhaps be even worse than that: similarities might not just be easier or harder to spot, they might actually be different. If we have a particular sound X, it could audibly be more similar to A than B, while visually it could be more similar to B than A. If this was indeed true, we'd need to be very careful about performing tasks such as clustering sounds or labelling sounds while staring at their spectrograms.
So - what does the research literature say? Does it give us guidance on how far we can trust our eyes as a proxy for our ears? Well, it gives us hints but so far not a complete answer. Here are a few relevant factoids which dance around the issue:
- Agus et al (2012) found that people could respond particularly fast to voice stimuli vs other musical stimuli (in a go/no-go discrimination task), and that this speed wasn't explained by the "distance" measured between spectrograms. (There's no visual similarity judgment here, but a pretty good automatic algorithm for comparing spectrograms [actually, "cochleagrams"] acts as a proxy.)
- Another example which Trevor Agus sent me - I'll quote him directly: "My favourite counterexample for using the spectrogram as a surrogate for auditory perception is Thurlow (1959), in which he shows that we are rubbish at reporting the number of simultaneous pure tones, even when there are just 2 or 3. This is a task that would be trivial with a spectrogram. A more complex version would be Gutschalk et al. (2008) in which sequences of pure tones that are visually obvious on a spectrogram are very difficult to detect audibly. (This builds on a series of results on the informational masking of tones, but is a particularly nice example and audio demo.)"
- Zue (1985) gives a very nice introduction and study of "spectrogram reading of speech" - this is when experts learn to look at a spectrogram of speech and to "read" from it the words/phonemes that were spoken. It's difficult and anyone who's good at it will have had to practice a lot, but as the paper shows, it's possible to get up to 80-90% accuracy on labelling the phonemes. I was surprised to read that "There was a tendency for consonants to be identified more accurately than vowels", because I would have thought the relatively long duration of vowels and the concentration of energy in different formants would have been the biggest clue for the eye. Now, the paper is arguing that spectrogram reading is possible, but I take a different lesson from it here: 80-90% is impressive but it's much much worse than the performance of an expert who is listening rather than looking. In other words, it demonstrates that looking and listening are very different, when it comes to the task of identifying phonemes. There's a question one can raise about whether spectrogram reading would reach a higher accuracy if someone learned it as thoroughly as we learn to listen to speech, but that's unlikely to be answered any time soon.
- Rob Lachlan pointed out that most often we look at standard spectrograms which have a linear frequency scale, whereas our listening doesn't really treat frequency that way - it is more like a logarithmic scale, at least at higher frequencies. This could be accommodated by using spectrograms with log, mel or ERB frequency scales. People do have a habit of using the standard spectrogram, though, perhaps because it's the common default in software and because it's the one we tend to be most familiar with.
- We know that listening can be highly accurate in many cases. This is exploited in the field of "auditory display" in which listening is used to analyse scatter plots and all kinds of things. Here's a particularly lovely exmaple quoted from Barrett & Kramer (1999): "In an experiment dating back to 1945, pilots took only an hour to learn to fly using a sonified instrument panel in which turning was heard by a sweeping pan, tilt by a change in pitch, and speed by variation in the rate of a “putt putt” sound (Kramer 1994a, p. 34). Radio beacons are used by rescue pilots to home-in on a tiny speck of a life-raft in the vast expanse of the ocean by listening to the strength of an audio signal over a set of radio headphones."
- James Beauchamp sent me their 2006 study - again, they didn't use looking-at-spectrograms directly, but they did compare listening vs spectrogram analysis, as in Agus et al. The particularly pertinent thing here is that they evaluated this using small spectral modifications, i.e. very fine-scale differences. He says: "We attempted to find a formula based on spectrogram data that would predict percent error that listeners would incur in detecting the difference between original and corresponding spectrally altered sounds. The sounds were all harmonic single-tone musical instruments that were subjected to time-variant harmonic analysis and synthesis. The formula that worked best for this case (randomly spectrally altered) did not work very well for a later study (interpolating between sounds). Finding a best formula for all cases seems to still be an open question."
Really, what does this all tell us? It tells us that looking at spectrograms and listening to sounds are different in so many myriad ways that we definitely shouldn't expect the fine details to match up. We can probably trust our eyes for broad-brush tasks such as labelling sounds that are quite distinct, but for the fine-grained comparisons (which we often need in research) one should definitely be careful, and use actual auditory perception as the judge when it really matters. How to know when this is needed? Still a question of judgment, in most cases.
My thanks go to Trevor Agus, Michael Mandel, Rob Lachlan, Anto Creo and Tony Stockman for examples quoted here, plus all the other researchers who kindly responded with suggestions.
Know any research literature on the topic? If so do email me - NB there's plenty of literature on the accuracy of looking or of listening in various situations; here the question is specifically about comparisons between the two modalities.
Last year I took part in the Dagstuhl seminar on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR). Many fascinating discussions with phoneticians, roboticists, and animal behaviourists (ethologists).
One surprisingly difficult topic was to come up with a basic data model for describing multi-party interactions. It was so easy to pick a hole in any given model: for example, if we describe actors taking "turns" which have start-times and end-times, then are we really saying that the actor is not actively interacting when it's not their turn? Do conversation participants really flip discretely between an "on" mode and an "off" mode, or does that model ride roughshod over the phenomena we want to understand?
I was reminded of this modelling question when I read this very interesting new journal article by a Japanese research group: "HARKBird: Exploring Acoustic Interactions in Bird Communities Using a Microphone Array". They have developed this really neat setup with a portable microphone array attached to a laptop which does direction-estimation and decodes which birds are heard from which direction. In the paper they use this to help annotate the time-regions in which birds are active, a bit like on/off model I mentioned above. Here's a quick sketch:
From this type of data, Suzuki et al calculate a measure called the transfer entropy which quantifies the extent to which one individual's vocalisation patterns contain information that predicts the patterns of another. It gives them a hypothesis test for whether one particular individual affects another, in a network: who is listening to whom?
That's a very similar question to the question we were asking in our journal article last year, "Detailed temporal structure of communication networks in groups of songbirds". I talked about our model at the Dagstuhl event. Here I'll merely emphasise that our model doesn't use regions of time, but point-like events:
So our model works well for short calls, but is not appropriate for data that can't be well-described via single moments in time (e.g. extended sounds that aren't easily subdivided). The advantage of our model is that it's a generative probabilistic model: we're directly estimating the characteristics of a detailed temporal model of the communication. The transfer-entropy method, by contrast, doesn't model how the birds influence each other, just detects whether the influence has happened.
I'd love to get the best of both worlds. a generative and general model for extended sound events influencing one another. It's a tall order because for point-like events, we have point process theory; for extended events I don't think the theory is quite so well-developed. Markov models work OK but don't deal very neatly with multiple parallel streams. The search continues.
A colleague pointed out this new review paper in the journal "Animal Behaviour": Applications of machine learning in animal behaviour studies.
It's a useful introduction to machine learning for animal behaviour people. In particular, the distinction between machine learning (ML) and classical statistical modelling is nicely described (sometimes tricky to convey that without insulting one or other paradigm).
The use of illustrative case studies is good. Most introductions to machine learning base themselves around standard examples predicting "unstructured" outcomes such as house prices (i.e. predict a number) or image categories (i.e. predict a discrete label). Two of the three case studies (all of which are by the authors themselves) similarly are about predicting categorical labels, but couched in useful biological context. It was good to see the case study relating to social networks and jackdaws. Not only because it relates to my own recent work with colleagues (specifically: this on communication networks in songbirds and this on monitoring the daily activities of jackdaws - although in our case we're using audio as the data source), but also because it shows an example of using machine learning to help elucidate structured information about animal behaviour rather than just labels.
The paper is sometimes mathematically imprecise: it's incorrect that Gaussian mixture models "lack a global optimum solution", for example (it's just that the global optimum can be hard to find). But the biggest omission, given that the paper was written so recently, is any real mention of deep learning. Deep learning has been showing its strengths for years now, and is not yet widely used in animal behaviour but certainly will be in years to come; researchers reading a review of "machine learning" should really come away with at least a sense of what deep learning is, and how it sits alongside other methods such as random forests. I encourage animal behaviour researchers to look at the very readable overview by LeCun et al in Nature.
InterSpeech 2016 was a very interesting conference. I have been to InterSpeech before, yes - but I'm not a speech-recognition person so it's not my "home" conference. I was there specifically for the birds/animals special session (organised by Naomi Harte and Peter Jancovic), but it was also a great opportunity to check in on what's going on in speech technology research.
Here's a note of some of the interesting papers I saw. I'll start with some of the birds/animals papers:
- Localizing Bird Songs Using an Open Source Robot Audition System with a Microphone Array by Suzuki et al. Some really interesting work from a Japanese group I hadn't met before - they use a simple USB multi-microphone device together with source-localisation software, and then use that to recover a multi-species transcript of activity. Potentially very useful.
- Recognition of Multiple Bird Species Based on Penalised Maximum Likelihood and HMM-Based Modelling of Individual Vocalisation Elements by Jancovic and Kokuer - I like their modulated-sinusoid detector - here it's incorporated into a system for multi-species recognition in polyphonic audio. Good to see this being developed. It'd be great if others could make use of this technique (I don't think source code is available...).
- Bird Song Synthesis Based on Hidden Markov Models by Bonada et al - this is work Jordi Bonada developed while on a research visit to our lab (C4DM), with a colleague of mine Rob Lachlan. Good probabilistic generation of birdsong needs to be developed, you see, because at the moment a lot of bird behaviour experiments use a small number of WAV files as play back, and the unnatural repetitiveness that results from this makes me uncomfortable! Hence we need work like this on flexible bird sound synthesis.
- Cost Effective Acoustic Monitoring of Bird Species by Ciira wa Maina. A neat paper about a prototype 'full solution', little Raspberry Pi sound recorders with a straightforward single-species sound detector.
- Feature Learning and Automatic Segmentation for Dolphin Communication Analysis by Kohlsdorf et al. I don't know what audio tracking methods dolphin researchers currently favour, but this is an interesting highly-scalable method based on feature-learning in spectrogram data.
- Robust Detection of Multiple Bioacoustic Events with Repetitive Structures by Frank Kurth. To be honest I don't grok this yet - it's a kind of enhanced autocorrelation, but I'm not sure what leads to it being said to be robust.
- A Real-Time Parametric General-Purpose Mammalian Vocal Synthesiser by Roger K Moore: a PureData patch for general-purpose mammal sound synthesis. Fun, and maybe useful for someone out there working on mammal sounds.
That's not all the bird/animal papers, sorry, just the ones I have comments about.
And now a sampling of the other papers that caught my interest:
- Retrieval of Textual Song Lyrics from Sung Inputs by Anna Kruspe. Nice to see work on aligning song lyrics against audio recordings - it's something that the field of MIR is in need of. The example application here is if you sing a few words, can a system retrieve the right song audio from a karaoke database?
- The auditory representation of speech sounds in human motor cortex - this journal article has some of the amazing findings presented by Eddie Chang in his fantastic keynote speech, discovering the way phonemes are organised in our brains, both for production and perception.
- Today's Most Frequently Used F0 Estimation Methods, and Their Accuracy in Estimating Male and Female Pitch in Clean Speech by Sofia StrÃ¶mbergsson. This survey is a great service for the community. The general conclusion is that Praat's pitch detection is really among the best off-the-shelf recommendations (for speech analysis, here - the evaluation hasn't been done for non-human sounds!).
- Supervised Learning of Acoustic Models in a Zero Resource Setting to Improve DPGMM Clustering by Heck et al - "zero-resource" speech analysis is interesting to me because it could be relevant for bird sounds. "Zero resource" means analysing languages for which we have no corpora or other helpful data available - all we have is audio recordings. (Sounds familiar?) In this paper the authors used some adaptation techniques to improve a method introduced last year based on unsupervised nonparametric clustering.
- Speech reductions cause a de-weighting of secondary acoustic cues by Varnet et al: a study of some niche aspects of human listening. Through tests of people listening to speech examples in noise they found that people's use of secondary cues - i.e. clues that help to distinguish one phoneme from another, which clues are embedded elsewhere in the word than the phoneme itself - changes according to the nature of the stimulus. Yet more evidence that perception is an active, context-sensitive process etc.
One thing you won't realise from my own notes is that InterSpeech was heavily dominated by deep learning. Convolutional neural nets (ConvNets), recurrent neural nets (RNNs), they were everywhere. Lots of discussion about connectionist temporal classification (CTC) - some people say it's the best, some people say it requires too much data to train properly, some people say they have other tricks so they can get away without it. It will be interesting to see how that discussion evolves. However, many of the other deep-learning based papers were much of a muchness: lots of people use a ConvNet or an RNN and, as we all know, in many cases they can get good results. They apply these to many tasks in speech technology. However, in many cases there was application without a whole lot of insight. That's the way the state of the art is at the moment, I guess. Therefore, many of my most interesting moments at InterSpeech were deep-learning-less :) see above.
(Also, I had to miss the final day, to catch my return flight. Wish I'd been able to go to the VAD and Audio Events session, for example.)
Another aspect of speech technology is the emphasis on public data challenges - there are lots of them! Speech recognition, speaker recognition, language recognition, distant speech recognition, zero-resource speech recognition, de-reverberation... Some of these have been running for years and the dedication of the organisers is worth praising. Useful to check in on how these things are organised, as we develop similar initiatives in general and natural sound scene analysis.
MLSP 2016 - i.e. the IEEE International Workshop on Machine Learning for Signal Processing - was a great, well-organised workshop, held last week on Italy's Amalfi coast. (Yes, lovely place to go for work - if only I'd had some spare time for sightseeing on the side! Anyway.)
Here are a few of the papers that caught my interest:
- Approximate State-Space Gaussian Processes Via Spectral Transformation by Toni Karvonen and Simo SÃ¤rkkÃ¤. This is an important contribution to the current work on Gaussian processes and in particular on running efficient Gaussian process inference. It builds on other work from the SÃ¤rkkÃ¤ lab converting Gaussian processes to state-space models, which often involves a (mild) approximation. This paper introduces some new methods in that vein, with proofs, and in fact the paper includes various ways to approximate a GP. A veritable mathematical toolkit. It seems the Taylor expansion (the most immediately comprehensible IMHO) is not the best.
Actually, there was substantial work involving Gaussian processes at MLSP. Is it a growth area? Well, if the use of GPs can be made more scalable (as in the above paper) then yes, it certainly should be. They are a very flexible and general tool, and nicely Bayesian too. Richard Turner's keynote about Gaussian processes was a beautiful introduction - he manages to make GPs extremely understandable. If you get a chance to see him speak on them then do.
- Localizing Users And Items From Paired Comparisons by O'Shaughnessy and Davenport. This is a nicely conceived addition to the literature on recommendation algorithms, and with good demonstrations of how the approach is robust to issues such as incoherent paired comparisons.
- "Data Privacy Protection By Kernel Subspace Projection And Generalized Eigenvalue Decomposition" by Diamantaras and Kung. Privacy-preserving computing is an important area for current research. It's made obvious when we see how much a large company like Facebook or Tesco can infer about its users. Here, the authors treat privacy as a classification task - i.e. the data to be kept private is some kind of discrete label - and they apply an LDA-like method: maximise the scatter between the target classes for the "allowed" task, while minimising the scatter between the private classes. (I raised an issue with their "Privacy Index", noting that the desired accuracy for the private task was not in fact zero but ignorance. I'd presume that a metric based on mutual information would be a nice alternative.)
- "Scale and shift invariant time/frequency representation using auditory statistics: application to rhythm description" by Marchand and Peeters. They use the "Scale Transform", a class of Mellin transform. Equivalent to exponentially time-warping a signal then weighting by an exponential window. Since it's not shift-invariant you don't want to apply it directly to audio, but to e.g. autocorrelation. From there, they argue you get a good featureset for characterising musical rhythm.
- Score-Matching Estimators For Continuous-Time Point-Process Regression Models by Sahani, Bohner and Meyer - good to see this. I've been using point process models to analyse bird communication and so I'm interested in efficient ways to do such analysis, which commonly seem to come from the computational neuroscience literature at the moment. Notable that this approach doesn't require any time discretisation, so could be useful. The functions analysed need to be differentiable, so to work with impulsive time series they actually convolve/correlate them with basis functions; feels like a minor hack but there you go.
Also, I was very pleased that Pablo A Alvarado Duran presented his work on Gaussian processes for music audio modelling - his first publication as part of his PhD with me!
I just read this new paper, "Metrics for Polyphonic Sound Event Detection" by Mesaros et al. Very relevant topic since I'm working on a couple of projects to automatically annotate bird sound recordings.
I was hoping this article would be a complete and canonical reference that I could use as a handy citation to refer to in any discussion of evaluating sound event detection. It isn't that, for a single reason:
Just so you know, the context of that paper is the DCASE2016 challenge. For the purposes of the challenge, they've released a public python toolbox with their evaluation metrics, and that's a great way to go about things. This paper, then, is oriented around the evaluation paradigm used in DCASE2016.
In that paradigm, they evaluate systems which supply a list of inferred event annotations which are entirely on or off. They're not probabilistic, or ranked, or annotated with certainty/uncertainty. Fair enough, this happens a lot, and it's a perfectly justifiable way to set up the contest. However, in many of my scenarios, we work with systems that output a probabilistic or rankable set of output events - you can turn this into a definite annotation simply by thresholding, but actually what we'd like to do is evaluate the fully "nuanced" output.
Why? Why should evaluation care about whether a system labels an event confidently or weakly? Well, it's all about what happens downstream. An example: imagine you have an automatic system for detecting events, and you apply it to a dataset of 1000 hours of audio. No automatic system is perfect, and so you often want to either (a) only focus on the strongly-detected items in later analysis, or (b) ask a human expert to go through the results to cross-check them. In the latter case, the expert does not have time to listen to all 1000 hours; instead you'd like to prioritise their work, for example by focussing on the annotations that are the most ambiguous. This kind of work is very likely in the applications I'm working with.
The statistics focussed on in the above paper (F-measure, precision, recall, accuracy, error rate) are all based on definite binary annotations, so they don't make use of the nuance. I'm generally an advocate of the "area under the ROC curve" (AUC) statistic, which doesn't tell the whole story but it helps make use of the nuance by averaging over a whole range of possible detection thresholds.
A nice example of a paper which uses AUC for event detection is "Chime-home: A dataset for sound source recognition in a domestic environment" by Foster et al. The above paper does mention this in passing, but doesn't really tease out why anyone would use AUC or how it differs from the DCASE2016 paradigm.
I want to be clear that the Mesaros et al. paper is not "wrong" or anything like that. I just wish it had a section on evaluating ranked/probabilistic outputs, why that might matter, and what metrics come in useful. Similarly, the sed_eval toolbox doesn't have an implementation of AUC for event detection. Presumably fairly straightforward to add it to its "segment-wise" metrics. Maybe one day!
On Monday I did a bit of what-you-might-call standup - at the Science Showoff night. Here's a video - including a bonus extra bird-imitation contest at the end!
I'm nearing the end of a great three-week research visit to the Max-Planck Institute for Ornithology at Seewiesen (Germany). It's a lovely place dedicated to the study of birds. Full of birds and ornithologists:
I'm visiting Manfred Gahr's group. We had some ideas in advance and ...
When you work with birdsong you encounter a lot of rapid frequency modulation (FM), much more than in speech or music. This is because songbirds have evolved specifically to be good at it: as producers they have muscles specially adapted for rapid FM (Goller and Riede 2012), and as listeners ...
I'm happy to say I'm now supervising two PhD students, Pablo and Veronica. Veronica is working on my project all about birdsong and machine learning - so I've got some notes here about recommended reading for someone starting on this topic. It's a niche topic but it ...