Other things on this site...


InterSpeech09 conference: emotional speech

The InterSpeech conference was in Brighton this year - now, my research is all about "non-speech" voice (e.g. beatboxing) but I took the opportunity to go down and see what the speech folks were up to.

Automatic speech recognition is the "traditional" problem for computers+speech, but there's been a tendency recently to try and automatically recognise the emotional content too. This year was the first year of the InterSpeech "emotion challenge", in which researchers were challenged to automatically detect a range of emotions in a dataset of audio - recorded from schoolchildren who were trying to guide an Aibo round a track, apparently with emotive consequences...

I was surprised that many of the approaches to emotion recognition were so similar to the standard speech-recognition model: take MFCCs plus maybe some other measurements, model them with GMMs, classify the results (maybe with a HMM), so far so 1960s. The spectral measures (MFCCs) were typically augmented with prosodic measures such as the amount of pauses in a sentence, or measures about how the speaking pitch varied, and in quite a few of the papers it seemed that these prosodic features actually perform pretty strongly, often beating the spectral features. But I was surprised they were still relatively simple measures - no intricate prosody-specific models of temporal variation, for example, most seemed to use the average+minimum+maximum pitch. Combining the two types of data (spectral plus prosodic) was often the best but didn't seem to give a dramatic uplift vs using just one type. I suspect that more specific models could push the prosodic side a long way in the next few years. The winner of the "emotion challenge" was a kind of hand-designed decision-tree approach, pretty nice because they'd designed the classifier from theoretical motivations.

One thing about "emotion" is the same problem as for "timbre" (the musical attribute which I deal with in my research): it's still very hard to pin down exactly what you mean by it, specifically whether it's a continuous attribute or a set of categories. It seems that many datasets are labelled categorically - people mark a given word or sentence as being neutral/scared/happy/anxious/etc. But increasingly people are focusing on the continuous approach where emotion is treated as a 3D space, where one dimension is "arousal" (varying from calm to excited), one is "valence" (bad to good), and one is "potency" (dominated to dominant). If you combine those 3 dimensions variously you can cover the standard emotions pretty well (excitement, depression, boredom, anger, etc etc). This 3D approach gets around various cultural issues in the exact meaning of the labels, allows for some more refined analysis, and I believe it comes from a pretty well-validated area in psychology, although I don't know the literature on that.

Oh and there was a nice talk about automatically analysing and detecting laughter. Laughter is characterised by the bouts of vocal effort we push in, via the lungs and the tension in the vocal folds. That distinguishes it quite well from ordinary speech. So what these people did was a nice simple technique to estimate the glottal pulses (the moments of energy that come from our vocal folds), and to spot when these became more effortful and more frequent. You can't use an ordinary pitch tracker because each laugh is far too brief for a standard tracker to latch on to the quick pitch changes, but their custom analysis (plus a very basic classifier) seemed able to detect moments of laughter in TV talk shows etc. The analysis method (the zero-frequency filter) is technically very simple and potentially a useful trick...

| science | Permalink