Via the SuperCollider users list I heard about a nice trick for extracting the repeating part out of a recorded piece of music. Source-separation, vocal-extraction etc are massive topics which I won't go into right now, but suffice to say it's not easy. So I was interested to read this nice simple technique (scroll down to "REpeating Pattern Extraction Technique (REPET)") described in an ICASSP paper this year.
Basically it uses spectral subtraction and binary masking - two of the simplest "source separation" tricks you can do to a signal. In general they produce kinda rough results - they don't adapt the phase information at all, for a start, so they can give some smeary MP3ish artefacts. But in this case the authors have applied them to a task where they can produce decent results: here you don't have to try and separate all the instruments out, you just want to divide the recording into two, the repeaty bit and the non-repeaty bit.
If you read the ICASSP paper you'll find they describe it well, it's a nice readable paper. (However they do make the task a bit more complex than it needs to be: they do a load of calculations then take the log-spectrum near the end, whereas if they took the log-spectrum at the start the calculations would be a little simpler.) The basic idea is:
From a theoretical point of view there are all sorts of quibbles you could come up with, for example that it might fall apart if a song has varying tempo. But for a fairly large range of tracks, this looks like it could give useful results.
So I decided to implement a real-time version in SuperCollider. I like real-time stuff (meaning you can work with audio as-it-happens rather than just a fixed recording), but the above approach is non-realtime: it takes the average spectrogram over the whole track, for example, so you can't calculate the first ten seconds until you've analysed the whole thing.
What to do? I replaced the usual averaging process with what I call recursive average (can't find a nice online explanation of that right now, hm). You still need to know the tempo, but given the tempo you then have a real-time estimate of the average spectrum caused by the repeating bit.
One interesting thing is that when a new beat kicks in, it's not immediately detected as a loop - so usually, it plays through once and then gets suppressed. You might think of this as a system to separate "novelty" from "boring loops"...?
I've published this for SuperCollider as a UGen called PV_ExtractRepeat (available in the sc3-plugins collection, only in source-code form at present).
Here's an example of it in action, applied to "Rehab" by Amy Winehouse. As you listen, notice a couple of things: (1) during the first bar there is poor separation, then it gets better; (2) the repeating-only bit (the rhythm section) sounds pretty good, could easily be used as a karaoke-version, while the non-repeating bit (mainly the vocals) sounds pretty messy...
So, not perfect, but potentially useful, maybe for karaoke or maybe for further audio analysis. Thanks to Zafar Rafii and Bryan Pardo for publishing the method - note that their examples sound better than my real-time example here (real-time often means compromises in analysis).