In the early twentieth century when the equations of quantum physics were born, physicists found themselves in a difficult position. They needed to interpret what the quantum equations meant in terms of their real-world consequences, and yet they were faced with paradoxes such as wave-particle duality and "spooky action at a distance". They turned to philosophy and developed new metaphysics of their own. Thought-experiments such as Schrodinger's cat, originally intended to highlight the absurdity of the standard "Copenhagen interpretation", became standard teaching examples.
In the twenty-first century, researchers in artificial intelligence (AI) and machine learning (ML) find themselves in a roughly analogous position. There has been a sudden step-change in the abilities of machine learning systems, and the dream of AI (which had been put on ice after the initial enthusiasm of the 1960s turned out to be premature) has been reinvigorated - while at the same time, the deep and widespread industrial application of ML means that whatever advances are made, their effects will be felt. There's a new urgency to long-standing philosophical questions about minds, machines and society.
So I was glad to see that Neil Lawrence, an accomplished research leader in ML, published an article on these social implications. The article is "Living Together: Mind and Machine Intelligence". Lawrence makes a noble attempt to provide an objective basis for considering the differences between human and machine intelligences, and what those differences imply for the future place of machine intelligence in society.
In case you're not familiar with the arXiv website I should point out that articles there are un-refereed, they haven't been through the peer-review process that guards the gate of standard scientific journals. And let me cut to the chase - in this paper, I'm not sure which journal he was targeting, but if I was a reviewer I wouldn't have recommended acceptance. Lawrence's computer science is excellent, but here I find his philosophical arguments are disappointing. Here's my review:
Embodiment? No: containment
A key difference between humans and machines, notes Lawrence, is that we humans - considered for the moment as abstract computational agents - have high computational capacity but a very limited bandwidth to communicate. We speak (or type) our thoughts, but really we're sharing the tiniest shard of the information we have computed, whereas modern computers can calculate quite a lot (not as much as does a brain) but they can communicate with such high bandwidth that the results are essentially not "trapped" in the computer. For Lawrence this is a key difference, making the boundaries between machine intelligences much less pertinent than the boundaries between natural intelligences, and suggesting that future AI might not act as a lot of "agents" but as a unified subconscious.
Lawrence quantifies this difference as the numerical ratio between computational capacity and communicative bandwidth. Embarrassingly, he then names this ratio the "embodiment factor". The embodiment of cognition is an important idea in much modern thought-about-thought: essentially, "embodiment" is the rejection of the idea that my cognition can really be considered as an abstract computational process separate from my body. There are many ways we can see this: my cognition is non-trivially affected by whether or not I have hay-fever symptoms today; it's affected by the limited amount of energy I have, and the fact I must find food and shelter to keep that energy topped up; it's affected by whether I've written the letter "g" on my hand (or is it a "9"? oh well); it's affected by whether I have an abacus to hand; it's affected by whether or not I can fly, and thus whether in my experience it's useful to think about geography as two-dimensional or three-dimensional. (For a recent perspective on extended cognition in animals see the thoughts of a spiderweb.) I don't claim to be an expert on embodied cognition. But given the rich cognitive affordances that embodiment clearly offers, it's terribly embarrassing and a little revealing that Lawrence chooses to reduce it to the mere notion of being "locked in" (his phrase) with constraints on our ability to communicate.
Lawrence's ratio could perhaps be useful, so to defuse the unfortunate trivial reduction of embodiment, I would like to rename it "containment factor". He uses it to argue that while humans can be considered as individual intelligent agents, for computer intelligences the boundaries dissolve and they can be considered more as a single mass. But it's clear that containment is far from sufficient in itself: natural intelligences are not the only things whose computation is not matched by their communication. Otherwise we would have to consider an air-gapped laptop as an intelligent agent, but not an ordinary laptop.
Agents have their own goals and ambitions
The argument that the boundaries between AI agents dissolve also rests on another problem. In discussing communication Lawrence focusses too heavily on 'altruistic' or 'honest' communication: transparent communication between agents that are collaborating to mutually improve their picture of the world. This focus leads him to neglect the fact that communicating entities often have differing goals, and often have reason to be biased or even deceitful in the information shared.
The tension between communication and individual aims has been analysed in a long line of thought in evolutionary biology under the name of signalling theory. For example the conditions under which "honest signalling" is beneficial to the signaller. It's important to remember that the different agents each have their own contexts, their own internal states/traits (maybe one is low on energy reserves, and another is not) which affect communicative goals even if the overall avowed aim is common.
In Lawrence's description the focus on honest communication leads him to claim that "if an entity's ability to communicate is high [...] then that entity is arguably no longer distinct from those which it is sharing with" (p3). This is a direct consequence of Lawrence's elision: it can only be "no longer distinct" if it has no distinct internal traits, states, or goals. The elision of this aspect recurs throughout, e.g. "communication reduces to a reconciliation of plot lines among us" (p5).
Unfortunately the implausible unification of AI into a single morass is a key plank of the ontology that Lawrence wants to develop, and also key to the societal consequences he draws.
There is no System Zero
Lawrence considers some notions of human cognition including the idea of "system 1 and system 2" thinking, and proposes that the mass of machine intelligence potentially forms a new "System Zero" whose essentially unconscious reasoning forms a new stratum of our cognition. The argument goes that this stratum has a strong influence on our thought and behaviour, and that the implications of this on society could be dramatic. This concept has an appeal of neatness but it falls down too easily. There is no System Zero, and Lawrence's conceptual starting-point in communication bandwidth shows us why:
- Firstly, the morass of machine intelligence has no high-bandwidth connection to System 1 or to System 2. The reason we talk of "System 1 and System 2" coexisting in the same agent is that they're deeply and richly connected in our cognition. (BTW I don't attribute any special status to "System 1 and System 2", they're just heuristics for thinking about thinking - that doesn't really matter here.) Lawrence's own argument about the poverty of communication channels such as speech also goes for our reception of information. However intelligent, unified or indeed devious AI becomes, it communicates with humans through narrow channels such as adverts, notifications on your smartphone, or selecting items to show to you. The "wall" between ourselves as agents and AI will be significant for a long time.
- Direct brain-computer interfacing is a potential counterargument here, and if that technology were to develop significantly then it is true that our cognition could gain a high-bandwidth interface. I remain sceptical that such potential will be non-trivially realised in my lifetime. And if they do come to pass, they would dissolve human-human bottlenecks as much as human-computer bottlenecks, so in either case Lawrence's ontology does not stand.
- Secondly AI/ML technologies are not unified. There's no one entity connecting them all together, endowing them with the same objective. Do you really think that Google and Facebook, Europe and China, will pool their machine intelligences together, allowing unbounded and unguarded communication? No. And so irrespective of how high the bandwidth is within and between these silos, they each act as corporate agents, with some degrees of collusion and mutual inference, sure, but they do not unify into an underlying substrate of our intelligence. This disunification highlights the true ontology: these agents sit relative to us as agents - powerful, information-rich and potentially dangerous agents, but then so are some humans.
Disturbingly Lawrence claims "Sytem Zero is already aligned with our goals". This starts from a useful observation - that many commercial processes such as personalised advertising work because they attempt to align with our subconscious desires and biases. But again it elides too much. In reality, such processes are aligned not with our goals but with the goals of powerful social elites, large companies etc, and if they are aligned with our "system 1" goals then that is a contingent matter.
Importantly, the control of these processes is largely not democratic but controlled commercially or via might-makes-right. Therefore even if AI/ML does align with some people's desires, it will preferentially align with the desires of those with cash to spend.
We need models; machines might not
On a positive note: Lawrence argues that our limited communication bandwidth shapes our intelligence in a particular way: it makes it crucial for us to maintain "models" of others, so that we can infer their internal state (as well as our own) from their behaviour and their signalling. He argues that conversely, many ML systems do not need such structured models - they simply crunch on enough data and they are able to predict our behaviour pretty well. This distinction seems to me to mark a genuine difference between natural intelligence and AI, at least according to the current state of the art in ML.
He does go a little too far in this as well, though. He argues that our reliance on a "model" of our own behaviour implies that we need to believe that our modelled self is in control - in Freudian terms, we could say he is arguing that the existence of the ego necessitates its own illusion that it controls the id. The argument goes that if the self-model knew it was not in control,
"when asked to suggest how events might pan out, the self model would always answer with "I don't know, it would depend on the precise circumstances"."
This argument is shockingly shallow coming from a scientist with a rich history of probabilistic machine learning, who knows perfectly well how machines and natural agents can make informed predictions in uncertain circumstances!
I also find unsatisfactory the eagerness with which various dualisms are mapped onto one another. The most awkward is the mapping of "self-model vs self" onto Cartesian dualism (mind vs body); this mapping is a strong claim and needs to be argued for rather than asserted. It would also need to account for why such mind-body dualism is not a universal, across history nor across cultures.
However, Lawrence is correct to argue that "sentience" of AI/ML is not the overriding concern in its role in our society; rather, its alignment or otherwise with our personal and collective goals, and its potential to undermine human democratic agency, is the prime issue of concern. This is a philosophical and a political issue, and one on which our discussion should continue.
This season, I'm lead organiser for two special conference sessions on machine listening for bird/animal sound: EUSIPCO 2017 in Kos, Greece, and IBAC 2017 in Haridwar, India. I'm very happy to see the diverse selection of work that has been accepted for presentation - the diversity of the research itself, yes, but also the diversity of research groups and countries from which the work comes.
The official programmes haven't been announced yet, but as a sneak preview here are the titles of the accepted submissions, so you can see just how lively this research area has become!
Accepted talks for IBAC 2017 session on "Machine Learning Methods in Bioacoustics":
A two-step bird species classification approach using silence durations in song bouts
Automated Assessment of Bird Vocalisation Activity
Deep convolutional networks for avian flight call detection
Estimating animal acoustic diversity in tropical environments using unsupervised multiresolution analysis
JSI sound: a machine-learning tool in Orange for classification of diverse biosounds
Prospecting individual discrimination of maned wolves’ barks using wavelets
Accepted papers for EUSIPCO 2017 session on "Bird Audio Signal Processing":
(This session is co-organised with Yiannis Stylianou and Herve Glotin)
Stacked Convolutional and Recurrent Neural Networks for Bird Audio Detection preprint
Densely Connected CNNs for Bird Audio Detection preprint
Classification of Bird Song Syllables Using Wigner-Ville Ambiguity Function Cross-Terms
Convolutional Recurrent Neural Networks for Bird Audio Detection preprint
Joint Detection and Classification Convolutional Neural Network (JDC-CNN) on Weakly Labelled Bird Audio Data (BAD)
Rapid Bird Activity Detection Using Probabilistic Sequence Kernels
Automatic Frequency Feature Extraction for Bird Species Delimitation
Two Convolutional Neural Networks for Bird Detection in Audio Signals
Masked Non-negative Matrix Factorization for Bird Detection Using Weakly Labelled Data
Archetypal Analysis Based Sparse Convex Sequence Kernel for Bird Activity Detection
Automatic Detection of Bird Species from Audio Field Recordings Using HMM-based Modelling of Frequency Tracks
Please note: this is a PREVIEW - sometimes papers get withdrawn or plans change, so these lists should be considered provisional for now.
People love to take the vegans down a peg or two. I guess they must unconsciously agree that the vegans are basically correct and doing the right thing, hence the defensive mud-slinging.
There's a bullshit article "Being vegan isn’t as good for humanity as you think". Like many bullshit articles, it's based on manipulating some claims from a research paper.
The point that the article is making is summarised by this quote:
"When applied to an entire global population, the vegan diet wastes available land that could otherwise feed more people. That’s because we use different kinds of land to produce different types of food, and not all diets exploit these land types equally."
This is factually correct, according to the original research paper which itself seems a decent attempt to estimate the different land requirements of different diets. The clickbaity inference, especially as stated in the headline, is that vegans are wrong. But that's where the bullshit lies.
Why? Look again at the quote. "When applied to an entire global population." Is that actually a scenario anyone expects? The whole world going vegan? In the next ten years, fifty years, a hundred? No. It's fine for the research paper to look at full-veganism as a comparison against the 9 other scenarios they consider (e.g. 20% veggy, 100% veggy), but the researchers are quite clear that their model is about what a whole population eats. You can think of it as what "an average person" eats, but no it's not what "each person should" eat.
The research concludes that a vegetarian diet is "best", judged on this specific criterion of how big a population can the USA's farmland support. And since that's for the population as a whole, and there's no chance that meat-eating will entirely leave the Western diet, a more sensible journalistic conclusion is that we should all be encouraged to be a bit more vegetarian, and the vegans should be celebrated for helping balance out those meat-eaters.
Plus, of course, the usual conclusion: more research is needed. This research was just about land use, it didn't include considerations of CO2 emissions, welfare, social attitudes, geopolitics...
The research illustrates that the USA has more than enough land to feed its population and that this could be really boosted if we all transition to eat a bit less meat. Towards the end of the paper, the researchers note that if the USA moved to a vegetarian diet, "the dietary changes could free up capacity to feed hundreds of millions of people around the globe."
People who do technical work with sound use spectrograms a heck of a lot. This standard way of visualising sound becomes second nature to us.
As you can see from these photos, I like to point at spectrograms all the time:
(Our research group even makes some really nice software for visualising sound which you can download for free.)
It's helpful to transform sound into something visual. You can point at it, you can discuss tiny details, etc. But sometimes, the spectrogram becomes a stand-in for listening. When we're labelling data, for example, we often listen and look at the same time. There's a rich tradition in bioacoustics of presenting and discussing spectrograms while trying to tease apart the intricacies of some bird or animal's vocal repertoire.
But there's a question of validity. If I look at two spectrograms and they look the same, does that mean the sounds actually sound the same?
In strict sense, we already know that the answer is "No". Us audio people can construct counterexamples pretty easily, in which there's a subtle audio difference that's not visually obvious (e.g. phase coherence -- see this delightfully devious example by Jonathan le Roux.) But it could perhaps be even worse than that: similarities might not just be made easier or harder to spot, in practice they could actually be differently arranged. If we have a particular sound X, it could audibly be more similar to A than B, while visually it could be more similar to B than A. If this was indeed true, we'd need to be very careful about performing tasks such as clustering sounds or labelling sounds while staring at their spectrograms.
So - what does the research literature say? Does it give us guidance on how far we can trust our eyes as a proxy for our ears? Well, it gives us hints but so far not a complete answer. Here are a few relevant factoids which dance around the issue:
- Agus et al (2012) found that people could respond particularly fast to voice stimuli vs other musical stimuli (in a go/no-go discrimination task), and that this speed wasn't explained by the "distance" measured between spectrograms. (There's no visual similarity judgment here, but a pretty good automatic algorithm for comparing spectrograms [actually, "cochleagrams"] acts as a proxy.)
- Another example which Trevor Agus sent me - I'll quote him directly: "My favourite counterexample for using the spectrogram as a surrogate for auditory perception is Thurlow (1959), in which he shows that we are rubbish at reporting the number of simultaneous pure tones, even when there are just 2 or 3. This is a task that would be trivial with a spectrogram. A more complex version would be Gutschalk et al. (2008) in which sequences of pure tones that are visually obvious on a spectrogram are very difficult to detect audibly. (This builds on a series of results on the informational masking of tones, but is a particularly nice example and audio demo.)"
- Zue (1985) gives a very nice introduction and study of "spectrogram reading of speech" - this is when experts learn to look at a spectrogram of speech and to "read" from it the words/phonemes that were spoken. It's difficult and anyone who's good at it will have had to practice a lot, but as the paper shows, it's possible to get up to 80-90% accuracy on labelling the phonemes. I was surprised to read that "There was a tendency for consonants to be identified more accurately than vowels", because I would have thought the relatively long duration of vowels and the concentration of energy in different formants would have been the biggest clue for the eye. Now, the paper is arguing that spectrogram reading is possible, but I take a different lesson from it here: 80-90% is impressive but it's much much worse than the performance of an expert who is listening rather than looking. In other words, it demonstrates that looking and listening are very different, when it comes to the task of identifying phonemes. There's a question one can raise about whether spectrogram reading would reach a higher accuracy if someone learned it as thoroughly as we learn to listen to speech, but that's unlikely to be answered any time soon.
- Rob Lachlan pointed out that most often we look at standard spectrograms which have a linear frequency scale, whereas our listening doesn't really treat frequency that way - it is more like a logarithmic scale, at least at higher frequencies. This could be accommodated by using spectrograms with log, mel or ERB frequency scales. People do have a habit of using the standard spectrogram, though, perhaps because it's the common default in software and because it's the one we tend to be most familiar with.
- We know that listening can be highly accurate in many cases. This is exploited in the field of "auditory display" in which listening is used to analyse scatter plots and all kinds of things. Here's a particularly lovely exmaple quoted from Barrett & Kramer (1999): "In an experiment dating back to 1945, pilots took only an hour to learn to fly using a sonified instrument panel in which turning was heard by a sweeping pan, tilt by a change in pitch, and speed by variation in the rate of a “putt putt” sound (Kramer 1994a, p. 34). Radio beacons are used by rescue pilots to home-in on a tiny speck of a life-raft in the vast expanse of the ocean by listening to the strength of an audio signal over a set of radio headphones."
- James Beauchamp sent me their 2006 study - again, they didn't use looking-at-spectrograms directly, but they did compare listening vs spectrogram analysis, as in Agus et al. The particularly pertinent thing here is that they evaluated this using small spectral modifications, i.e. very fine-scale differences. He says: "We attempted to find a formula based on spectrogram data that would predict percent error that listeners would incur in detecting the difference between original and corresponding spectrally altered sounds. The sounds were all harmonic single-tone musical instruments that were subjected to time-variant harmonic analysis and synthesis. The formula that worked best for this case (randomly spectrally altered) did not work very well for a later study (interpolating between sounds). Finding a best formula for all cases seems to still be an open question."
Really, what does this all tell us? It tells us that looking at spectrograms and listening to sounds are different in so many myriad ways that we definitely shouldn't expect the fine details to match up. We can probably trust our eyes for broad-brush tasks such as labelling sounds that are quite distinct, but for the fine-grained comparisons (which we often need in research) one should definitely be careful, and use actual auditory perception as the judge when it really matters. How to know when this is needed? Still a question of judgment, in most cases.
My thanks go to Trevor Agus, Michael Mandel, Rob Lachlan, Anto Creo and Tony Stockman for examples quoted here, plus all the other researchers who kindly responded with suggestions.
Know any research literature on the topic? If so do email me - NB there's plenty of literature on the accuracy of looking or of listening in various situations; here the question is specifically about comparisons between the two modalities.
Last year I took part in the Dagstuhl seminar on Vocal Interactivity in-and-between Humans, Animals and Robots (VIHAR). Many fascinating discussions with phoneticians, roboticists, and animal behaviourists (ethologists).
One surprisingly difficult topic was to come up with a basic data model for describing multi-party interactions. It was so easy to pick a hole in any given model: for example, if we describe actors taking "turns" which have start-times and end-times, then are we really saying that the actor is not actively interacting when it's not their turn? Do conversation participants really flip discretely between an "on" mode and an "off" mode, or does that model ride roughshod over the phenomena we want to understand?
I was reminded of this modelling question when I read this very interesting new journal article by a Japanese research group: "HARKBird: Exploring Acoustic Interactions in Bird Communities Using a Microphone Array". They have developed this really neat setup with a portable microphone array attached to a laptop which does direction-estimation and decodes which birds are heard from which direction. In the paper they use this to help annotate the time-regions in which birds are active, a bit like on/off model I mentioned above. Here's a quick sketch:
From this type of data, Suzuki et al calculate a measure called the transfer entropy which quantifies the extent to which one individual's vocalisation patterns contain information that predicts the patterns of another. It gives them a hypothesis test for whether one particular individual affects another, in a network: who is listening to whom?
That's a very similar question to the question we were asking in our journal article last year, "Detailed temporal structure of communication networks in groups of songbirds". I talked about our model at the Dagstuhl event. Here I'll merely emphasise that our model doesn't use regions of time, but point-like events:
So our model works well for short calls, but is not appropriate for data that can't be well-described via single moments in time (e.g. extended sounds that aren't easily subdivided). The advantage of our model is that it's a generative probabilistic model: we're directly estimating the characteristics of a detailed temporal model of the communication. The transfer-entropy method, by contrast, doesn't model how the birds influence each other, just detects whether the influence has happened.
I'd love to get the best of both worlds. a generative and general model for extended sound events influencing one another. It's a tall order because for point-like events, we have point process theory; for extended events I don't think the theory is quite so well-developed. Markov models work OK but don't deal very neatly with multiple parallel streams. The search continues.
A colleague pointed out this new review paper in the journal "Animal Behaviour": Applications of machine learning in animal behaviour studies.
It's a useful introduction to machine learning for animal behaviour people. In particular, the distinction between machine learning (ML) and classical statistical modelling is nicely described (sometimes tricky to convey that without insulting one or other paradigm).
The use of illustrative case studies is good. Most introductions to machine learning base themselves around standard examples predicting "unstructured" outcomes such as house prices (i.e. predict a number) or image categories (i.e. predict a discrete label). Two of the three case studies (all of which are by the authors themselves) similarly are about predicting categorical labels, but couched in useful biological context. It was good to see the case study relating to social networks and jackdaws. Not only because it relates to my own recent work with colleagues (specifically: this on communication networks in songbirds and this on monitoring the daily activities of jackdaws - although in our case we're using audio as the data source), but also because it shows an example of using machine learning to help elucidate structured information about animal behaviour rather than just labels.
The paper is sometimes mathematically imprecise: it's incorrect that Gaussian mixture models "lack a global optimum solution", for example (it's just that the global optimum can be hard to find). But the biggest omission, given that the paper was written so recently, is any real mention of deep learning. Deep learning has been showing its strengths for years now, and is not yet widely used in animal behaviour but certainly will be in years to come; researchers reading a review of "machine learning" should really come away with at least a sense of what deep learning is, and how it sits alongside other methods such as random forests. I encourage animal behaviour researchers to look at the very readable overview by LeCun et al in Nature.
InterSpeech 2016 was a very interesting conference. I have been to InterSpeech before, yes - but I'm not a speech-recognition person so it's not my "home" conference. I was there specifically for the birds/animals special session (organised by Naomi Harte and Peter Jancovic), but it was also a great opportunity …
MLSP 2016 - i.e. the IEEE International Workshop on Machine Learning for Signal Processing - was a great, well-organised workshop, held last week on Italy's Amalfi coast. (Yes, lovely place to go for work - if only I'd had some spare time for sightseeing on the side! Anyway.)
Here are a few …
I just read this new paper, "Metrics for Polyphonic Sound Event Detection" by Mesaros et al. Very relevant topic since I'm working on a couple of projects to automatically annotate bird sound recordings.
I was hoping this article would be a complete and canonical reference that I could use as …
On Monday I did a bit of what-you-might-call standup - at the Science Showoff night. Here's a video - including a bonus extra bird-imitation contest at the end!