I just read this new paper, "Metrics for Polyphonic Sound Event Detection" by Mesaros et al. Very relevant topic since I'm working on a couple of projects to automatically annotate bird sound recordings.
I was hoping this article would be a complete and canonical reference that I could use as a handy citation to refer to in any discussion of evaluating sound event detection. It isn't that, for a single reason:
Just so you know, the context of that paper is the DCASE2016 challenge. For the purposes of the challenge, they've released a public python toolbox with their evaluation metrics, and that's a great way to go about things. This paper, then, is oriented around the evaluation paradigm used in DCASE2016.
In that paradigm, they evaluate systems which supply a list of inferred event annotations which are entirely on or off. They're not probabilistic, or ranked, or annotated with certainty/uncertainty. Fair enough, this happens a lot, and it's a perfectly justifiable way to set up the contest. However, in many of my scenarios, we work with systems that output a probabilistic or rankable set of output events - you can turn this into a definite annotation simply by thresholding, but actually what we'd like to do is evaluate the fully "nuanced" output.
Why? Why should evaluation care about whether a system labels an event confidently or weakly? Well, it's all about what happens downstream. An example: imagine you have an automatic system for detecting events, and you apply it to a dataset of 1000 hours of audio. No automatic system is perfect, and so you often want to either (a) only focus on the strongly-detected items in later analysis, or (b) ask a human expert to go through the results to cross-check them. In the latter case, the expert does not have time to listen to all 1000 hours; instead you'd like to prioritise their work, for example by focussing on the annotations that are the most ambiguous. This kind of work is very likely in the applications I'm working with.
The statistics focussed on in the above paper (F-measure, precision, recall, accuracy, error rate) are all based on definite binary annotations, so they don't make use of the nuance. I'm generally an advocate of the "area under the ROC curve" (AUC) statistic, which doesn't tell the whole story but it helps make use of the nuance by averaging over a whole range of possible detection thresholds.
A nice example of a paper which uses AUC for event detection is "Chime-home: A dataset for sound source recognition in a domestic environment" by Foster et al. The above paper does mention this in passing, but doesn't really tease out why anyone would use AUC or how it differs from the DCASE2016 paradigm.
I want to be clear that the Mesaros et al. paper is not "wrong" or anything like that. I just wish it had a section on evaluating ranked/probabilistic outputs, why that might matter, and what metrics come in useful. Similarly, the sed_eval toolbox doesn't have an implementation of AUC for event detection. Presumably fairly straightforward to add it to its "segment-wise" metrics. Maybe one day!