People who do technical work with sound use spectrograms a heck of a lot. This standard way of visualising sound becomes second nature to us.
As you can see from these photos, I like to point at spectrograms all the time:
(Our research group even makes some really nice software for visualising sound which you can download for free.)
It's helpful to transform sound into something visual. You can point at it, you can discuss tiny details, etc. But sometimes, the spectrogram becomes a stand-in for listening. When we're labelling data, for example, we often listen and look at the same time. There's a rich tradition in bioacoustics of presenting and discussing spectrograms while trying to tease apart the intricacies of some bird or animal's vocal repertoire.
But there's a question of validity. If I look at two spectrograms and they look the same, does that mean the sounds actually sound the same?
In strict sense, we already know that the answer is "No". Us audio people can construct counterexamples pretty easily, in which there's a subtle audio difference that's not visually obvious (e.g. phase coherence). But it could perhaps be even worse than that: similarities might not just be easier or harder to spot, they might actually be different. If we have a particular sound X, it could audibly be more similar to A than B, while visually it could be more similar to B than A. If this was indeed true, we'd need to be very careful about performing tasks such as clustering sounds or labelling sounds while staring at their spectrograms.
So - what does the research literature say? Does it give us guidance on how far we can trust our eyes as a proxy for our ears? Well, it gives us hints but so far not a complete answer. Here are a few relevant factoids which dance around the issue:
- Agus et al (2012) found that people could respond particularly fast to voice stimuli vs other musical stimuli (in a go/no-go discrimination task), and that this speed wasn't explained by the "distance" measured between spectrograms. (There's no visual similarity judgment here, but a pretty good automatic algorithm for comparing spectrograms [actually, "cochleagrams"] acts as a proxy.)
- Another example which Trevor Agus sent me - I'll quote him directly: "My favourite counterexample for using the spectrogram as a surrogate for auditory perception is Thurlow (1959), in which he shows that we are rubbish at reporting the number of simultaneous pure tones, even when there are just 2 or 3. This is a task that would be trivial with a spectrogram. A more complex version would be Gutschalk et al. (2008) in which sequences of pure tones that are visually obvious on a spectrogram are very difficult to detect audibly. (This builds on a series of results on the informational masking of tones, but is a particularly nice example and audio demo.)"
- Zue (1985) gives a very nice introduction and study of "spectrogram reading of speech" - this is when experts learn to look at a spectrogram of speech and to "read" from it the words/phonemes that were spoken. It's difficult and anyone who's good at it will have had to practice a lot, but as the paper shows, it's possible to get up to 80-90% accuracy on labelling the phonemes. I was surprised to read that "There was a tendency for consonants to be identified more accurately than vowels", because I would have thought the relatively long duration of vowels and the concentration of energy in different formants would have been the biggest clue for the eye. Now, the paper is arguing that spectrogram reading is possible, but I take a different lesson from it here: 80-90% is impressive but it's much much worse than the performance of an expert who is listening rather than looking. In other words, it demonstrates that looking and listening are very different, when it comes to the task of identifying phonemes. There's a question one can raise about whether spectrogram reading would reach a higher accuracy if someone learned it as thoroughly as we learn to listen to speech, but that's unlikely to be answered any time soon.
- Rob Lachlan pointed out that most often we look at standard spectrograms which have a linear frequency scale, whereas our listening doesn't really treat frequency that way - it is more like a logarithmic scale, at least at higher frequencies. This could be accommodated by using spectrograms with log, mel or ERB frequency scales. People do have a habit of using the standard spectrogram, though, perhaps because it's the common default in software and because it's the one we tend to be most familiar with.
- We know that listening can be highly accurate in many cases. This is exploited in the field of "auditory display" in which listening is used to analyse scatter plots and all kinds of things. Here's a particularly lovely exmaple quoted from Barrett & Kramer (1999): "In an experiment dating back to 1945, pilots took only an hour to learn to fly using a sonified instrument panel in which turning was heard by a sweeping pan, tilt by a change in pitch, and speed by variation in the rate of a “putt putt” sound (Kramer 1994a, p. 34). Radio beacons are used by rescue pilots to home-in on a tiny speck of a life-raft in the vast expanse of the ocean by listening to the strength of an audio signal over a set of radio headphones."
- James Beauchamp sent me their 2006 study - again, they didn't use looking-at-spectrograms directly, but they did compare listening vs spectrogram analysis, as in Agus et al. The particularly pertinent thing here is that they evaluated this using small spectral modifications, i.e. very fine-scale differences. He says: "We attempted to find a formula based on spectrogram data that would predict percent error that listeners would incur in detecting the difference between original and corresponding spectrally altered sounds. The sounds were all harmonic single-tone musical instruments that were subjected to time-variant harmonic analysis and synthesis. The formula that worked best for this case (randomly spectrally altered) did not work very well for a later study (interpolating between sounds). Finding a best formula for all cases seems to still be an open question."
Really, what does this all tell us? It tells us that looking at spectrograms and listening to sounds are different in so many myriad ways that we definitely shouldn't expect the fine details to match up. We can probably trust our eyes for broad-brush tasks such as labelling sounds that are quite distinct, but for the fine-grained comparisons (which we often need in research) one should definitely be careful, and use actual auditory perception as the judge when it really matters. How to know when this is needed? Still a question of judgment, in most cases.
My thanks go to Trevor Agus, Michael Mandel, Rob Lachlan, Anto Creo and Tony Stockman for examples quoted here, plus all the other researchers who kindly responded with suggestions.
Know any research literature on the topic? If so do email me - NB there's plenty of literature on the accuracy of looking or of listening in various situations; here the question is specifically about comparisons between the two modalities.
My blog has been running for more than a decade, using the same cute-but-creaky old software made by my chum Sam. It was a lo-fi PHP and MySQL blog, and it did everything I needed. (Oh and it suited my stupid lo-fi blog aesthetics too, the clunky visuals are entirely my fault.)
Now, if you were starting such a project today you wouldn't use PHP and you wouldn't use MySQL (just search the web for all the rants about those technologies). But if it isn't broken, don't fix it. So it ran for 10 years. Then my annoying web provider TalkTalk messed up and lost all the databases. They lost all ten years of my articles. SO. What to do?
Well, one thing you can do is simply drop it and move on. Make a fresh start. Forget all those silly old articles. Sure. But I have archivistic tendencies. And the web's supposed to be a repository for all this crap anyway! The web's not just a medium for serving you with Facebook memes, it's meant to be a stable network of stuff. So, ideal would be to preserve the articles, and also to prevent link rot, i.e. make sure that the URLs people have been using for years will still work...
So, job number one, find your backups. Oh dear. I have a MySQL database dump from 2013. Four years out of date. And anyway, I'm not going back to MySQL and PHP, I'm going to go to something clean and modern and ideally Python-based... in other words Pelican. So even if I use that database I'm going to have to translate it. So in the end I found three different sources for all my articles:
- The old MySQL backup from 2013. I had to install MySQL software on my laptop (meh), load the database, and then write a script to iterate through the database entries and output them as nice markdown files.
- archive.org's beautiful Wayback Machine. If you haven't already given money to archive.org then please do. They're the ones making sure that all the old crap from the web 5 years ago is still preserved in some form. They're also doing all kinds of neat things like preserving old video games, masses and masses of live music recordings, and more. ... Anyway I can find LOTS of old archived copies of my blog items. There are two problems with this though: firstly they don't capture everything and they didn't capture the very latest items; and secondly the material is not stored in "source" format but in its processed HTML form, i.e. the form you actually see. So to address the latter, I had to write a little regular expression based script to snip the right pieces out and put them into separate files.
- For the very latest stuff, much of it was still in Google's web cache. If I'd thought of this earlier, I could have rescued all the latest items, since Google is I think the only service that crawls fast enough and widely enough to have captured all the little pages on my little site. So, just like with archive.org, I can grab the HTML files from Google, and scrape the content out using regular expressions.
That got me almost everything. I think the only thing missing is one blog article from a month ago.
Next step: once you've rescued your data, build a new blog. This was easy because Pelican is really nice and well-documented too. I even recreated my silly old theme in their templating system. I thought I'd have problems configuring Pelican to reproduce my old site, but it's basically all done, even the weird stuff like my separate "recipes" page which steals one category from my blog and reformats it.
Now how to prevent linkrot? The Pelican pages have URLs like "/blog/category/science.html" instead of the old "/blog/blog.php?category=science", and if I'm moving away from PHP then I don't really want those PHP-based links to be the ones used in future. I need to catch people who are going to one of those old links, and point them straight at the new URLs. The really neat thing is that I could use Pelican's templating system to output a little lookup table, a CSV file listing all the URL rewrites needed. Then I write a tiny little PHP script which uses that files and emits HTTP Redirect messages. ........... and relax. a URL like http://www.mcld.co.uk/blog/blog.php?category=science is back online.
Last year I took part in the Dagstuhl seminar on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR). Many fascinating discussions with phoneticians, roboticists, and animal behaviourists (ethologists).
One surprisingly difficult topic was to come up with a basic data model for describing multi-party interactions. It was so easy to pick a hole in any given model: for example, if we describe actors taking "turns" which have start-times and end-times, then are we really saying that the actor is not actively interacting when it's not their turn? Do conversation participants really flip discretely between an "on" mode and an "off" mode, or does that model ride roughshod over the phenomena we want to understand?
I was reminded of this modelling question when I read this very interesting new journal article by a Japanese research group: "HARKBird: Exploring Acoustic Interactions in Bird Communities Using a Microphone Array". They have developed this really neat setup with a portable microphone array attached to a laptop which does direction-estimation and decodes which birds are heard from which direction. In the paper they use this to help annotate the time-regions in which birds are active, a bit like on/off model I mentioned above. Here's a quick sketch:
From this type of data, Suzuki et al calculate a measure called the transfer entropy which quantifies the extent to which one individual's vocalisation patterns contain information that predicts the patterns of another. It gives them a hypothesis test for whether one particular individual affects another, in a network: who is listening to whom?
That's a very similar question to the question we were asking in our journal article last year, "Detailed temporal structure of communication networks in groups of songbirds". I talked about our model at the Dagstuhl event. Here I'll merely emphasise that our model doesn't use regions of time, but point-like events:
So our model works well for short calls, but is not appropriate for data that can't be well-described via single moments in time (e.g. extended sounds that aren't easily subdivided). The advantage of our model is that it's a generative probabilistic model: we're directly estimating the characteristics of a detailed temporal model of the communication. The transfer-entropy method, by contrast, doesn't model how the birds influence each other, just detects whether the influence has happened.
I'd love to get the best of both worlds. a generative and general model for extended sound events influencing one another. It's a tall order because for point-like events, we have point process theory; for extended events I don't think the theory is quite so well-developed. Markov models work OK but don't deal very neatly with multiple parallel streams. The search continues.
A colleague pointed out this new review paper in the journal "Animal Behaviour": Applications of machine learning in animal behaviour studies.
It's a useful introduction to machine learning for animal behaviour people. In particular, the distinction between machine learning (ML) and classical statistical modelling is nicely described (sometimes tricky to convey that without insulting one or other paradigm).
The use of illustrative case studies is good. Most introductions to machine learning base themselves around standard examples predicting "unstructured" outcomes such as house prices (i.e. predict a number) or image categories (i.e. predict a discrete label). Two of the three case studies (all of which are by the authors themselves) similarly are about predicting categorical labels, but couched in useful biological context. It was good to see the case study relating to social networks and jackdaws. Not only because it relates to my own recent work with colleagues (specifically: this on communication networks in songbirds and this on monitoring the daily activities of jackdaws - although in our case we're using audio as the data source), but also because it shows an example of using machine learning to help elucidate structured information about animal behaviour rather than just labels.
The paper is sometimes mathematically imprecise: it's incorrect that Gaussian mixture models "lack a global optimum solution", for example (it's just that the global optimum can be hard to find). But the biggest omission, given that the paper was written so recently, is any real mention of deep learning. Deep learning has been showing its strengths for years now, and is not yet widely used in animal behaviour but certainly will be in years to come; researchers reading a review of "machine learning" should really come away with at least a sense of what deep learning is, and how it sits alongside other methods such as random forests. I encourage animal behaviour researchers to look at the very readable overview by LeCun et al in Nature.
Last year, when I took part in the Dagstuhl workshop on Vocal Interactivity in-and-between Humans, Animals and Robots, we had a brainstorming session, fantasising about how advanced robots might help us with animal behaviour research. "Spy" animals, if you will. Imagine a robot bird or a robot chimp, living as part of an ecosystem, but giving us the ability to modify its behaviour and study what happens. If you could send a spy to live among a group of animals, sharing food, communicating, collaborating, imagine how much you could learn about those animals!
So it particularly makes me smile to see the BBC nature doc Spy in the Wild, in which they've... gone there and done it already.
--- Well, not quite. It's a great documentary, some really astounding footage that makes you think again about what animals' inner lives are like. They use animatronic "spy" animals with film cameras in, which let them get up very close, to film from the middle of an animal's social group. These aren't autonomous robots though, they're remotely operated, and they're not capable of the full range of an animal's behaviours. They're pretty capable though: in order both to blend in and to interact, the spies can do things such as adopt submissive body language - crouching, ear movements, mouth movements, etc. And...
...some of them vocalise too. Yes there's some vocal interaction between animals and (human-piloted) robots. The vocal interaction is at a pretty simple level, it seems some of the robots have one or two pre-recorded calls built in and triggered by the operator, but it's interesting to see some occasional vocal back-and-forth between the animals and their electrical counterparts.
There are obviously some limitations. The spies generally can't move fast or dramatically. The spy birds can't fly. But - maybe soon?
In the mean time, watch the programme, it has loads of great moments caught on film.
If you're looking for a New Year's resolution how about this one: make more eye contact with strangers.
I was reading this powerful little list of Twenty Lessons from the 20th Century by some Professor of History. One idea that struck me is a very simple one:
11: Make eye contact and small talk. This is not just polite. It is a way to stay in touch with your surroundings, break down unnecessary social barriers, and come to understand whom you should and should not trust.
In a large city like the one I live in, eye contact and small talk are rare. They're even rarer thanks to smartphones, of course - although, twenty years ago, Londoners were still avoiding each other, but using newspapers, novels and Gameboys instead. Anyway I do think smartphones create a mode of interaction which reduces incidental eye contact etc.
So I decided to take the advice. Over the past month or so I took those little opportunities - at the bus stop, at the pedestrian crossing, at the supermarket. A bit of eye contact, a few words about the traffic or whatever. I was surprised how many opportunities for effortless (and not awkward!) tiny bits of smalltalk there were and how worthwhile it was to take them. After the year we've had, this is a little tweak you can try, and who knows, it might help.
I've been cooking vegetarian in 2016. It's about climate change: meat-eating is a big part of our carbon footprint, and it's something we can change. So here I'm sharing some of the best veggie recipes I found this year. Most of them are not too complex ...
The Twelvetrees Ramp is open! It's the "missing link" in the walk down the River Lea from the Olympic Park all the way down to Cody Dock. Previously, to complete the walk you had to come off the river at Three Mills and go on an ugly detour round ...
The House of Commons Science and Technology Committee has published its report into the implications of leaving the EU for UK science and research. The report is accompanied by a set of conclusions and recommendations.
By the way: the implications of Brexit (if indeed the UK ends up going through ...
This is a good hearty Sunday lunch for a vegetarian. One thing I'm missing as I increase my vegetarian-ness is something that's a proper centrepiece for a Sunday roast - those "nut roast" things which are fairly common are OK but I don't think I've had one ...