Perceptually-modelled audio analysis

This week I went to a research workshop in Plymouth called Making Sense of Sounds. It was all based around an EU project which aims to improve the state of the art in auditory models (i.e. models of what happens imbetween our ear and our consciousness, to turn a physical sound into an auditory perception) and also use them to help computers and machines to understand sound.

I won't blog the whole thing but just a few notes here. There was a lot of research on the streaming paradigm, and it's quite amazing how it's still possible to discover new facts about human hearing using such a simple sound. Basically, the sound is usually something like "bip boop bip, bip boop bip, bip boop bip", and the clever bit is that we can either hear this as a single stream or as two segregated streams (a bip stream and a boop stream), depending on the relative pitches and durations. It's an example of "bistable perception", just like famous optical illusions such as the Necker cube or the faces/vase thing. With modern EEG and fMIR brain scanning, this streaming paradigm shows some interesting facts about how we hear sounds - for example, it seems that our auditory system does entertain both "versions" at some point, but this resolves to just one choice at some point below conscious perception.

I was interested by Maria Chait's talk on change detection, and in conversation afterwards she pointed us to some recent research - see this 2010 paper by Scholl et al - which shows that humans have neurons which are able to detect note offsets, even though it's very well established that in behaviour we're very bad at noticing them - i.e. we often can't tell what happened when a sound stops, but it's usually pretty noticeable when a sound starts!

Those findings aren't completely incompatible, of course. It's plausible that in human evolution, sudden sounds were more important than sudden silences, even though both are informative.

Maneesh Sahani talked about two of his students' work. The one that was new to me was Phillip Herrmann's thesis on pitch perception and was a really interesting approach - rather than using a spectral or autocorrelation method, they started from a generative model in which we assume there is some pitch rate generating an impulse train, and some impulse response convolved with it, and also some gaussian noise etc, then this goes into some auditory model before arriving at a representation which we have to make inferences about. They then did inference applying this model to audio signals. The point is not whether this is an appropriate model for most sounds, just whether this assumption gets you far enough to do pitch perception in similar ways as humans do (with some of the attendant peculiarities).

One particularly nice experiment they came up with is another kind of "bistable perception" experiment where you have a train of impulses separated by 2ms, and every second impulse is optionally attenuated by some amount. So if there's no attenuation, you have a 2ms impulse train; if there's full attentuation, you have a 4ms impulse train; somewhere imbetween, you're somewhere imbetween. If you play these sounds to humans, they can report ambiguous pitch perception, sometimes detecting the higher octave, sometimes the lower, and this Herrmann/Sahani model apparently replicates the human data in a pretty good way that is not reflected in autocorrelation models.

Oh, also, over a diverse dataset, they apparently found a really clear square-root correlation between fundamental frequency and spectral centroid. (In other parts of the literature, it's not clear whether or not the two are correlated.) I'd like to see the data for this one - as I mentioned to Maneesh, there might be reasons to expect some data to do this by design (e.g. professional singers' voices). The point for Herrmann/Sahani is to see if the correlation exists in the data that might have "trained" our perception, so I'm not sure if things like professional singers should be included or not.

Maneesh Sahani also said at the start of his talk that Helmholtz (in the 19th century) came up with this idea of "perception as inference" - but then the electrical/computational signal-processing paradigm came along and everyone treated perception as processing. The modern Bayesian tendency, and its use to model perception, is a return to this "perception as inference". Is there anything that wasn't originally invented by Helmholtz?

Also Tom Walters' demo of his AIMC real-time perceptual model in C++ was nice, and it's code I'd like to make use of some time.

My own contribution, a poster about using chirplets to analyse birdsong, led to some interesting conversations. At least one person was sure I should be using auditory models instead of chirplets - which, given the context, I should have expected :)

Thu 23 February 2012 | science | Permalink

mcld.co.uk

Other things on this site...