Non-negative matrix factorisation (NMF) is a technique we use for various purposes in audio analysis, to decompose a sound spectrogram.

I've been dabbling with the Stan programming environment here and there. It's an elegant design for specifying and solving arbitrary probabilistic models.

(One thing I want to point out is that it's really for solving for continuous-valued parameters only - this means you can't explicitly do things like clustering etc (unless your approach makes sense with fuzzy cluster assignments). So it's not a panacea. In my experience it's not always obvious which problems it's going to be most useful for.)

So let's try putting NMF and Stan together.

First off, NMF is not always a probabilistic approach - at its core, NMF simply assumes you have a matrix V, which happens to be the product of two "narrower" matrices W and H, and all these matrices have non-negative values. And since Stan is a probilistic environment we need to choose a generative model for that matrix. Here are two alternatives I tried:

- We can assume that our data was generated by an independent random complex Gaussian for each "bin" in the spectrogram, each one scaled by some weight value specified by a "pixel" of WH. If we're working with the power spectrogram, this set of assumptions matches the model of
**Itakura-Saito NMF**, as described in Fevotte et al 2009. (See also Turner and Sahani 2014, section 4A.) - We can assume that our spectrogram data itself, if we normalise it, actually just represents one big multinomial probability distribution. Imagine that a "quantum" of energy is going to appear at some randomly-selected bin on your spectrogram (a random location in time AND frequency). There's a multinomial distribution which represents the probabilities, and we assume that our spectrogram represents it. This is a bit weird but if you assume we got our spectrogram by sampling lots of independent quanta and piling them up in a histogram, it would converge to that multinomial in the limit. This is the model used in
**PLCA**.

So here is the Stan source code for my implementations of these models, plus a simple toy dataset as an example. They both converge pretty quickly and give decent results.

I designed these implementations with **audio transcription** in mind. When we're transcribing music or everyday sound, we often have some pre-specified categories that we want to identify. So rather than leaving the templates W completely free to choose, in these implementations I specify pre-defined spectral templates "Winit".

(Specifying these also breaks a permutation symmetry in the model, which probably helps the model to converge since it shouldn't keep flipping around through different permutations of the solution. Another thing I do is fix the templates W to sum up to 1 each [i.e. I force them to be simplexes] because otherwise there's a scaling indeterminacy: you could double W and halve H and have the same solution.)

I use a concentration parameter "Wconc" to tell the model how closely to stick to the Winit values, i.e. how tight to make the prior around them. I also use an exponential prior on the activations H, to encourage sparsity.

My implementation of the PLCA assumptions isn't quite traditional, because I think in PLCA the spectrogram is assumed to be a sample from a multinomial (which implies it's quantised). I felt it would be a bit nicer to assume the spectrogram is itself a multinomial, sampled from a Dirichlet. There's little difference in practice.

Dan's blog articles may be re-used under the Creative Commons Attribution-Noncommercial-Share Alike 2.5 License.