Non-negative matrix factorisation in Stan

Non-negative matrix factorisation (NMF) is a technique we use for various purposes in audio analysis, to decompose a sound spectrogram.

I've been dabbling with the Stan programming environment here and there. It's an elegant design for specifying and solving arbitrary probabilistic models.

(One thing I want to point out is that it's really for solving for continuous-valued parameters only - this means you can't explicitly do things like clustering etc (unless your approach makes sense with fuzzy cluster assignments). So it's not a panacea. In my experience it's not always obvious which problems it's going to be most useful for.)

So let's try putting NMF and Stan together.

First off, NMF is not always a probabilistic approach - at its core, NMF simply assumes you have a matrix V, which happens to be the product of two "narrower" matrices W and H, and all these matrices have non-negative values. And since Stan is a probilistic environment we need to choose a generative model for that matrix. Here are two alternatives I tried:

We can assume that our data was generated by an independent random complex Gaussian for each "bin" in the spectrogram, each one scaled by some weight value specified by a "pixel" of WH. If we're working with the power spectrogram, this set of assumptions matches the model of Itakura-Saito NMF, as described in Fevotte et al 2009. (See also Turner and Sahani 2014, section 4A.)
We can assume that our spectrogram data itself, if we normalise it, actually just represents one big multinomial probability distribution. Imagine that a "quantum" of energy is going to appear at some randomly-selected bin on your spectrogram (a random location in time AND frequency). There's a multinomial distribution which represents the probabilities, and we assume that our spectrogram represents it. This is a bit weird but if you assume we got our spectrogram by sampling lots of independent quanta and piling them up in a histogram, it would converge to that multinomial in the limit. This is the model used in PLCA.

So here is the Stan source code for my implementations of these models, plus a simple toy dataset as an example. They both converge pretty quickly and give decent results.

I designed these implementations with audio transcription in mind. When we're transcribing music or everyday sound, we often have some pre-specified categories that we want to identify. So rather than leaving the templates W completely free to choose, in these implementations I specify pre-defined spectral templates "Winit".

(Specifying these also breaks a permutation symmetry in the model, which probably helps the model to converge since it shouldn't keep flipping around through different permutations of the solution. Another thing I do is fix the templates W to sum up to 1 each [i.e. I force them to be simplexes] because otherwise there's a scaling indeterminacy: you could double W and halve H and have the same solution.)

I use a concentration parameter "Wconc" to tell the model how closely to stick to the Winit values, i.e. how tight to make the prior around them. I also use an exponential prior on the activations H, to encourage sparsity.

My implementation of the PLCA assumptions isn't quite traditional, because I think in PLCA the spectrogram is assumed to be a sample from a multinomial (which implies it's quantised). I felt it would be a bit nicer to assume the spectrogram is itself a multinomial, sampled from a Dirichlet. There's little difference in practice.

Mon 02 February 2015 | science | Permalink

mcld.co.uk

Other things on this site...