I've got three papers accepted in conferences this summer, and they're all different angles on the technical side of how we analyse audio with respect to time.
In our field, time is often treated in a fairly basic way, for reasons of tractability. A common assumption is the "Markov assumption" that the current state only depends on the immediate past - this is really handy because it means our calculations don't explode with all the considerations of what happened the day before yesterday. It's not a particularly hobbling assumption - for example, most speech recognition systems in use have the assumption in there, and they do OK.
It's not obvious whether we "need" to build systems with complicated representations of time. There is some good research in the literature already which does, and they have promising results. And conversely, there are some good arguments that simple representations capture most of what's important.
Anyway, I've been trying to think about how our signal-processing systems can make intelligent use of the different timescales in sound, from the short-term to the long-term. Some of my first work on this is in these three conference talks I'm doing this summer, each on a different aspect:
At EUSIPCO I have a paper about a simple chirplet representation that can do better than standard spectral techniques at representing birdsong. Birdsong has lots of fast frequency modulation, yet the standard spectral approaches assume "local stationarity" - i.e. they assume that within a small-enough window, we can treat the signal as unchanging. My argument is that we're throwing away information at this point in the analysis chain, information that for birdsong at least is potentially very useful.
At MML I have a paper about tracking multiple pitched vibrato sounds, using a technique called the PHD filter which has already been used quite a bit in radar and video tracking. The main point is that when we're trying to track multiple objects over time (and we don't know how many objects), it's suboptimal to just take a model that deals with one object and apply the model multiple times. You benefit from using a technique that "knows" there may be multiple things. The PHD filter is one such technique, and it lets you model things with a linear-gaussian evolution over time. So I applied it to vibrato sources, which don't have a fixed pitch but oscillate around. It seems (in a synthetic experiment) the PHD filter handles them quite nicely, and is able to pull out the "base" pitches as well as the vibrato extent automatically. The theoretical elegance of the filter is very nice, although there are some practical limitations which I'll mention in my talk.
At CMMR I have a paper about estimating the arcs of expression in pianists' tempo modulations. The paper is with Elaine Chew, a new lecturer in our group who works a lot with classical piano performance. She has had students before working on the technical question of automatically identifying the "arcs" that we can see visually in expressive tempo modulation. I wanted to apply a Bayesian formulation to the problem, and I think it gives pretty nice results and a more intuitive way to specify the prior assumptions about scale.
So all of these are about machine learning applied to temporal evolution of sound, at different timescales. Hope to chat to some other researchers in this area over the summer!