Just back from the EUSIPCO 2012 conference in Bucharest. (The conference was held in the opulent Palace of the Parliament - see previous article for some thoughts on the palace and the town.) Here some notes about interesting talks/posters I saw:
Lots of stuff relevant to recognition in audio scenes, which is handy because that's related to my current work.
- David Damm's "System for audio summarisation in acoustic monitoring scenarios". Nice approach and demo (with sounds localised around the Frauenhofer campus), though the self-admitted drawback is that it isn't yet particularly scalebale, using full DTW search etc.
- Sebastien Fenet's "fingerprint-based detection of repeating objects in multimedia streams" - here a very scaleable approach, using fingerprints (as is done in other large-scale systems such as Shazam). In this paper he compared two fingerprint types: a Shazam-like spectral-peaks method (but using constant-Q spectrum); and a shallow Matching Pursuit applied to multiscale STFT. His results seem to favour the former.
- Xavier Valero's "Gammatone wavelet features for sound classification in surveillance applications" - this multiscale version of gammatone is apparently better for detecting bangs and bumps (which fits with folk knowledge about wavelets...).
- M. A. Sehili's "Daily sound recognition using a combination of GMM and SVM for home automation" - they used something called a Sequence Classification Kernel which apparently can be used in an SVM to classify sequential data, even different-length sequential data. Have to check that out.
- Two separate papers - Anansie Zlatintsi's "AM-FM Modulation Features" and Xavier Valero's "Narrow-band Autocorrelation features" - used features which are complementary to the standard Mel energies, by analysing the fine variation within each band. They each found improved results (for different classification tasks). (In my own thesis I looked at band-wise spectral crest features, hoping to achieve something similar. I found that they did provide complementary information [Sec 3.4] but unfortunately were not robust enough to noise/degradation for my purposes [Sec 3.3]. It'll be interesting to see how these different features hold up - they are more interesting than my spectral crests I think.)
Plenty of informed audio source separation was in evidence too. Not my specialism, more that of others in our group who came along... but I caught a couple of them, including Derry Fitzgerald's "User assisted separation using tensor factorisations" and Juan-Jose Bosch's "Score-informed and timbre-independent lead instrument separation in real-world scenarios".
Other papers that were interesting:
- T Adali, "Use of diversity in independent decompositions" - for indendence-based decompositions, you can use either of two assumptions about the components: non-Gaussianity or time-dependence. The speaker noted that measuring mutual information rate covers both of these properties, so it seems like a neat thing to use. She used it for some tensor decompositions which were a bit beyond me.
- C Areliano's poster on "Shape model fitting algorithm without point correspondence": simple idea for matching a hand image against a template which has marked points on it (but the query image doesn't): convert both representations into GMMs then find a good registration between the two GMMs. Could be useful, though the registration search is basically brute-force in this paper I think.
- Y Panagakis prsented "Music structure analysis by subspace modeling" - it makes a lot of sense, intuitively, that music structure such as verse-chorus-verse should be suited to this idea of fitting different feature subspaces to them. The way music is produced and mixed should make it appropriate for this, I imagine (whereas for audio scenes we probably don't hop from subspace to subspace... unless the mic is moving from indoors to outdoors for example...)
- Y Bar-Yosef's "Discriminative Algorithm for comacting mixture models with application to language recognition". Taking a GMM and approximating it by a smaller one is a general useful technique - here they were using Hershey and Olsen's 2007 "variational approximation" to the KLD between two GMMs. In this paper, their optimisation tries to preserve the discriminative power between two GMMs, rather than simply keeping the best fit independently.
- I Ari's "Large scale polyphonic music transcription using randomized matrix decompositions" - some elegant tweaks which mean they can handle a very large matrix of data, using a weighted-random atom selection technique which reminds me a little of a kind of randomised Matching Pursuit (though MP is not what they're doing). They reduce the formal complexity of matrix factorisation, both in time and in space, so that it's much more tractable.
- H Hu's "Sparsity level in a non-negative matrix factorisation based speech strategy in cochlear implants" - I know they do some good work with cochlear implants at Southampton Uni. This was a nice example: not only did they use Sparse NMF for noise reduction, and test it with human subjects in simulated conditions, but they also implemented it on a hardware device as used in cochlear implants. This latter point is important because at first I was dubious whether this fancy processing was efficient enough to run on a cochlear implant - good to see a paper that answers those kind of questions immediately.
Christian Jutten gave a plenary talk on source-separation in nonlinear mixtures. Apparently there's a proof from the 1980s by Darmois that if you have multiple sources nonlinearly mixed, then ICA cannot guarantee to separate them, for the following simple reason: ICA works by maximising independence, but Darmois proved that for any set of perfectly independent sources you can always construct a nonlinear mixture that preserves this independence. (Jutten gave an example procedure to do this; I think you could use the inverse-copula of the joint distribution as another way.)
Therefore to do source-separation on nonlinear mixtures you need to add some assumptions, either as constraints or regularisation. Constraining just to "smooth mappings" doesn't work. One set of mixture types which does work is "post-nonlinear mixtures", which means mixtures in which nonlinearities are applied separately to the outputs after linear mixing. (This is a reasonable model, for example, if your mics have nonlinearities but you can assume the sounds linearly mixed in the air before they reached the mics.) You have to use nonlinearities which satisfy a particular additivity constraint (f(u+v) = (f(u)+f(v))/(1+f(u)f(v)) ... tanh satisfies this). Or at least, you have to use those kind of nonlinearities in order to use Jutten's method.
Eric Moulines talked about prediction in sparse additive models. There's a lot of sparsity around at the moment (and there were plenty of sparsity papers here); Moulines' different approach is that when you want to predict new values, rather than to reverse-engineer the input values, you don't want to select a single sparsity pattern but aggregate over the predictions made by all sparsity patterns. He uses a particular weighted aggregation scheme which he calls "exponential aggregation" involving the risk calculated for each "expert" (each function in the dictionary).
Now, we don't want to calculate the result for an exponentially large number of sparsity patterns and merge them all, since that would take forever. Moulines uses an inequality to convert the combinatorial problem to a continuous problem; unfortunately, at the end of it all it's still too much to calculate easily (2^m estimators) so he uses MCMC estimation to get his actual results.
I also went to the tutorial on Population Monte Carlo methods (which apparently were introduced by Cappe in 2004). I know about Particle Filters so my learnings are relative to that:
- Each particle or iteration can have its OWN instrumental distribution, there's no need for it to be common across all particles. In fact the teacher (Petar Djuric) had worked on methods where you have a collection of instrumental distributions, and weighted-sample from all of them, adapting the weights as the iterations progress. This allows it to automatically do the kind of things we might heuristically want: start with broad, heavy-tailed distributions, then focus more on narrow distributions in the final refinement stages.
- For static MC (i.e. not sequential), you can use the samples from ALL iterations to make your final estimate (though you need to take care to normalise appropriately).
- Rao-Blackwellisation lets you solve a lower-dimensional problem (approximating a lower-dimensional target distribution) if you can analytically integrate to solve for a subset of the parameters given the other ones. For example, if some parameters are gaussian-distributed when conditioned on the others. This can make your approximation much simpler and faster.
- It's generally held a good idea to use heavy-tailed distributions, e.g. people use Student's t distribution since heavier-tailed than Gaussian.