For two months I've been visiting Richard Turner and the Machine Learning Group at Cambridge University. It's been a very stimulating visit. As part of my fellowship applying machine learning to bird sounds this was planned as a time to think about methods appropriate for the various purposes we want to analyse bird sounds - in particular given the constraints of uncontrolled audio recorded in the wild.
NMF is conceptually simple and easy to optimise, and there have been some interesting recent extensions to hierarchical representations and so forth, which might allow for a structured decomposition of an audio scene. One thing I'd love to do is augment NMF with Markov renewal process temporal modelling, and it looked like Cemgil-style NMF would give us a way to inject that in as a prior on the activation patterns, but then we found a hole in our maths which meant it wasn't going to give us that. NMF models are interesting and very clear, but it's not always obvious when your problem will admit a cute algorithm to solve it. Still lots of interesting things one can do with NMF.
We then put most of our time into looking at convolutional auto-encoders (ConvAEs). As with the rest of the neural net renaissance, these offer very flexible ways to model data. An auto-encoder is good for unsupervised learning, and has a lot of potential for learning useful representations of data, given appropriate constraints. These have been used for all sorts of purposes, and occasionally for audio.
Some interesting recent papers look at how to get a structured/semantic representation out of an autoencoder. This is often helped by having speech/vision datasets which are highly structured themselves (e.g. a photo of the same face from many angles and many lighting conditions). With natural birdsong we don't really have that opportunity, so the interesting question is whether we can design a system to do something along those lines despite the uncontrolled (and often unlabelled!) data.
I'm not going to say too much about the method here because the work isn't finished, but here's a work-in-progress image, showing (in the top row) a spectrogram of some birdsong contaminated by background noise. In the lower two rows the autoencoder is outputting an estimate of the foreground and of the background. Not perfect but certainly encouraging.
Thanks to Rich and the group for their welcome in Cambridge!