Getting neural networks and deep learning right for audio (WASPAA/SANE)

I'm just back from a conference visit to the USA, to attend WASPAA and SANE. Lots of interesting presentations and discussions about intelligent audio analysis.

One of the interesting threads of discussion was about deep learning and modern neural networks, and how best to use them for audio signal processing. The deep learning revolution has already had a big impact on audio: famously, deep learning gets powerful results on speech recognition and is now used pervasively in industry for that task. It's also widely studied in image and video processing.

But that doesn't mean the job is done. Speech recognition is only one of many ways we get information out of audio, and other "tasks" are not direct analogies, they have different types of inputs and outputs. Secondly, there are many different neural net architectures, and still much lively research in which architectures are best for which purposes. Part of the reason that big companies get great results for speech recognition is that they have masses and masses of data. In cases where we have modest amounts of data, or data without labels, or data with fuzzy labels, getting the architecture just right is an important thing to focus on.

And audio signal processing insights are important for getting neural nets right for audio. This was one of the main themes of Emmanuel Vincent's WASPAA keynote, titled "Is audio signal processing still useful in the era of machine learning?" (Slides.) He mentioned, for example, that the intelligent application of data augmentation is a good way for audio insight to help train deep nets well. I agree, but in the long-term I think the more important point is that our expertise should be used to help get the architectures right. There's also the thorny question (and hot topic in deep learning) of how to make sense of what deep nets are actually doing: in a sense this is the flip-side of the architecturing issue, making sense of an architecture once it's been found to work!

It's common knowledge that convolutional nets (ConvNets) and recurrent neural nets (specifically LSTMs) are powerful architectures, and in principle LSTMs should be particularly appropriate for time-series data such as audio. Lots of recent work confirms this. At the SANE workshop Tuomas Virtanen presented results showing strong performance at sound event detection (recovering a "transcript" of the events in an audio scene), and Ron Weiss presented impressive deep learning that could operate directly from raw waveforms to perform beamforming and speech recognition from multi-microphone audio. Weiss was using an architecture combining convolutional units (to create filters) and LSTM units (to handle temporal dependences). Pablo Sprechmann discussed a few different architectures, including one "unfolded NMF"-type architecture. (The "deep unfolding" approach is certainly a fruitful idea for deep learning architectures. Introduced a couple of years ago by Hershey et al. [EDIT: It's been pointed out that the unfolding idea was first proposed by Gregor and Lecun in 2010, and unfolded NMF was described by Sprechmann et al. in 2012. The contribution of Hershey et al. comes from the curious step of untying the unfolded parameters, which turns a truncated iterative algorithm into something more like a deep network.])

I'd like to focus on a couple of talks at SANE that exemplified how domain issues inform architectural issues:

John Hershey presented "Deep clustering: discriminative embeddings for single-channel separation of multiple sources". The task being considered was source separation of two or more speaking people recorded in a mixture, which is usually handled by applying binary masking to a spectrogram of the mixture. The task then becomes how to identify which "pixel" of the spectrogram should be assigned to which speaker. In some sense, it's a big multilabel classification task, with each pixel needing a label. Except as John pointed out, it's not really a classification task but a clustering task, because when we get a mixture of two speakers and we want to separate them, we usually have no prior labels and no reason to care who is "Speaker 1" and who is "Speaker 2". Motivated by this, Hershey described an approach where a deep learning system is trained to cluster the pixels in a latent space. The objective function happens to be the same as the K-means objective, except that instead of learning which items go in which cluster, the net is being trained to move the items around in the latent space so that the cluster separation is maximised. (The work is described in this arxiv preprint.)
Paris Smaragdis presented "NMF? Neural Nets? Itâs all the same...". Smaragdis is well-known for his work on NMF (non-negative matrix factorisation) methods. He presented a great narrative arc of how you might start with NMF and then throw away the things that irritate you about it - such as the spectrogram, instead working with convolutional filters learnt from raw audio.

(Note this other recurring theme: I already mentioned that Ron Weiss was also talking about waveform-based methods. Others have worked on this before, such as Sander Dieleman's paper on "end-to-end" deep learning for music audio. It's still not clear if ditching the spectrogram is actually that beneficial. Certainly if you do, you need lots of data in order to train successfully, as Weiss demonstrated empirically. I don't think I'd recommend ditching the spectrogram yet unless you're really sure what you're doing...)

The really surprising thing about Smaragdis' talk (given his previous work) was that by the time he'd deconstructed NMF and built up a neural net having similar source-separation properties, the end result was a surprisingly recognisable autoencoder - however, with a nicely principled architecture, and also some specific modifications (some "skip" connections and choices about tying/untying parameters). This autoencoder is not the same as NMF - it doesn't have the same non-negativity constraints, for example - but is inspired by similar motivations.

Happily, these discussions relate to some work I've been involved in this year. I spent some time visiting Rich Turner in Cambridge, and we had good debates and a small study about how to design a neural network well for audio. We have a submitted paper about denoising audio without access to clean data using a partitioned autoencoder which is the first fruit of that visit. The paper focuses on the "partitioning" issue but the design of the autoencoder itself has some similarities to what Paris Smaragdis was describing, and for similar reasons.

There's sometimes a temptation to feel despondent about deep learning: the sense of foreboding that a generic deep network will always beat your clever insightful method, just because of the phenomenal amounts of data and compute-hours that some large company can throw at it. All of the above discussions feed into a more optimistic interpretation, that domain expertise is crucial for getting machine-learning systems to learn the right thing - as long as you can learn to jettison some of your cherished habits (e.g. MFCCs!) at the right moment.

Mon 26 October 2015 | science | Permalink

mcld.co.uk

Other things on this site...