I'm blogging from the ISMIR 2011 conference, about music information retrieval. One of the interesting trends is how a lot of people are focusing on how to scale things up, to handle millions of audio files (or users, or scores) rather than just hundreds or thousands. Why? Well, in real-world applications it's often important: big music services like Spotify and iTunes have about 15 million tracks, Facebook has millions of users, etc. In ISMIR one of the stars of the show is the Million Song Dataset, just released, which should help many many researchers to develop and test on a big scale. Here I'm going to note some of the talks/posters I've seen with interesting approaches to scalability:
Brian McFee described a simple tweak to the kd-tree data structure called "spill tree" which improves approximate search. Basically, when you split the data in two you allow some of the data points to spill over and fall on both sides. Simple but apparently effective.
Dominik Schnitzer introduced a nice way to smooth out a search space and reduce the problem of hub-ness. One way to do it could be to use a minimum spanning tree, for example, but that involes a whole-dataset analysis so it might not scale well. In Dominik's approach, for each data point X you find an estimate of what he calls "mutual proximity": randomly sample 100 data points from your dataset and measure their distance to X, then fit a gaussian to those distances. Then to find the "mutual proximity" between two data points X and Y, you just evaluate X's gaussian at Y's location to get a kind of "probability of being a near neighbour". He also makes this a symmetric measure by combining the X->Y measure with the Y->X measure, but I'd imagine you don't always need to do that, depending on your purpose. The end result is a distance measure that pretty much eliminates hubs.
Shazam's music recognition algorithm, described in this 2006 paper, is one of the commercial success stories of scalable audio MIR. Sebastien Fenet tweaked it to be robust to pitch-shifting, essentially by using a log-frequency spectrogram and using delta-log-frequency rather than frequency in the fingerprints.
Also, Thierry mentioned that he was a fan of using Amazon's cloud storage/processing - if you store data with Amazon you can run MapReduce jobs over it easily, apparently. Mark Levy of last.fm is also a fan of MapReduce, having done a lot of work using Hadoop (Yahoo's implementation of MapReduce) for big data-crunching jobs.
Mikael Henaff presented a technique for learning a sparse spectrum-derived feature set, similar in spirit to KSVD. The thing I found interesting was how he then made a fast way of decomposing a new signal (once you've derived your feature basis from some training data). Ordinarily you'd have to do an optimisation - the dictionary is overcomplete so it can't be done as easily as an orthogonal transform. But you don't want to do that on a lot of data. Instead, he first trains a nonlinear projection which approximates that decomposition (it's a matrix rotation followed by a shrinkage nonlinearity, really simple mathematically). So you have to train that, but then when you want to analyse new data there's no optimisation needed, you just apply the simple transform.
There's been plenty of interesting stuff here at ISMIR that isn't about bigness, and it was good of Douglas Eck (of Google) to emphasise that there are still lots of interesting and important problems in MIR that don't need scalability and don't even benefit from it. But there are interesting developments in this area, hence this note.