Other things on this site...

MCLD
music
Evolutionary sound
Listen to Flat Four Internet Radio
Learn about
The Molecules of HIV
MCLD
software
Make Oddmusic!
Make oddmusic!

Some papers seen at InterSpeech 2016

InterSpeech 2016 was a very interesting conference. I have been to InterSpeech before, yes - but I'm not a speech-recognition person so it's not my "home" conference. I was there specifically for the birds/animals special session (organised by Naomi Harte and Peter Jancovic), but it was also a great opportunity to check in on what's going on in speech technology research.

Here's a note of some of the interesting papers I saw. I'll start with some of the birds/animals papers:

That's not all the bird/animal papers, sorry, just the ones I have comments about.

And now a sampling of the other papers that caught my interest:

  • Retrieval of Textual Song Lyrics from Sung Inputs by Anna Kruspe. Nice to see work on aligning song lyrics against audio recordings - it's something that the field of MIR is in need of. The example application here is if you sing a few words, can a system retrieve the right song audio from a karaoke database?
  • The auditory representation of speech sounds in human motor cortex - this journal article has some of the amazing findings presented by Eddie Chang in his fantastic keynote speech, discovering the way phonemes are organised in our brains, both for production and perception.
  • Today's Most Frequently Used F0 Estimation Methods, and Their Accuracy in Estimating Male and Female Pitch in Clean Speech by Sofia Strömbergsson. This survey is a great service for the community. The general conclusion is that Praat's pitch detection is really among the best off-the-shelf recommendations (for speech analysis, here - the evaluation hasn't been done for non-human sounds!).
  • Supervised Learning of Acoustic Models in a Zero Resource Setting to Improve DPGMM Clustering by Heck et al - "zero-resource" speech analysis is interesting to me because it could be relevant for bird sounds. "Zero resource" means analysing languages for which we have no corpora or other helpful data available - all we have is audio recordings. (Sounds familiar?) In this paper the authors used some adaptation techniques to improve a method introduced last year based on unsupervised nonparametric clustering.
  • Speech reductions cause a de-weighting of secondary acoustic cues by Varnet et al: a study of some niche aspects of human listening. Through tests of people listening to speech examples in noise they found that people's use of secondary cues - i.e. clues that help to distinguish one phoneme from another, which clues are embedded elsewhere in the word than the phoneme itself - changes according to the nature of the stimulus. Yet more evidence that perception is an active, context-sensitive process etc.

One thing you won't realise from my own notes is that InterSpeech was heavily dominated by deep learning. Convolutional neural nets (ConvNets), recurrent neural nets (RNNs), they were everywhere. Lots of discussion about connectionist temporal classification (CTC) - some people say it's the best, some people say it requires too much data to train properly, some people say they have other tricks so they can get away without it. It will be interesting to see how that discussion evolves. However, many of the other deep-learning based papers were much of a muchness: lots of people use a ConvNet or an RNN and, as we all know, in many cases they can get good results. They apply these to many tasks in speech technology. However, in many cases there was application without a whole lot of insight. That's the way the state of the art is at the moment, I guess. Therefore, many of my most interesting moments at InterSpeech were deep-learning-less :) see above.

(Also, I had to miss the final day, to catch my return flight. Wish I'd been able to go to the VAD and Audio Events session, for example.)

Another aspect of speech technology is the emphasis on public data challenges - there are lots of them! Speech recognition, speaker recognition, language recognition, distant speech recognition, zero-resource speech recognition, de-reverberation... Some of these have been running for years and the dedication of the organisers is worth praising. Useful to check in on how these things are organised, as we develop similar initiatives in general and natural sound scene analysis.

Sunday 18th September 2016 | science | Permalink

Add your comments:

Name:
Email:
Website:
Comment:
I am a:
Everything is optional - and email addresses will be marmalised to protect you
Creative Commons License
Dan's blog articles may be re-used under the Creative Commons Attribution-Noncommercial-Share Alike 2.5 License. Click the link to see what that means...