Non-negative matrix factorisation (NMF) is a technique we use for various purposes in audio analysis, to decompose a sound spectrogram.
I've been dabbling with the Stan programming environment here and there. It's an elegant design for specifying and solving arbitrary probabilistic models.
(One thing I want to point out is that it's really for solving for continuous-valued parameters only - this means you can't explicitly do things like clustering etc (unless your approach makes sense with fuzzy cluster assignments). So it's not a panacea. In my experience it's not always obvious which problems it's going to be most useful for.)
So let's try putting NMF and Stan together.
First off, NMF is not always a probabilistic approach - at its core, NMF simply assumes you have a matrix V, which happens to be the product of two "narrower" matrices W and H, and all these matrices have non-negative values. And since Stan is a probilistic environment we need to choose a generative model for that matrix. Here are two alternatives I tried:
So here is the Stan source code for my implementations of these models, plus a simple toy dataset as an example. They both converge pretty quickly and give decent results.
I designed these implementations with audio transcription in mind. When we're transcribing music or everyday sound, we often have some pre-specified categories that we want to identify. So rather than leaving the templates W completely free to choose, in these implementations I specify pre-defined spectral templates "Winit".
(Specifying these also breaks a permutation symmetry in the model, which probably helps the model to converge since it shouldn't keep flipping around through different permutations of the solution. Another thing I do is fix the templates W to sum up to 1 each [i.e. I force them to be simplexes] because otherwise there's a scaling indeterminacy: you could double W and halve H and have the same solution.)
I use a concentration parameter "Wconc" to tell the model how closely to stick to the Winit values, i.e. how tight to make the prior around them. I also use an exponential prior on the activations H, to encourage sparsity.
My implementation of the PLCA assumptions isn't quite traditional, because I think in PLCA the spectrogram is assumed to be a sample from a multinomial (which implies it's quantised). I felt it would be a bit nicer to assume the spectrogram is itself a multinomial, sampled from a Dirichlet. There's little difference in practice.