Merge operation: in Chomsky, and in recursive neural networks for NLP

This is either a spooky coincidence, or a really neat connection I hadn't known:

For decades, Noam Chomsky and colleagues have famously been developing and advocating a "minimalist" idea about the machinery our brain uses to process language. There's a nice statement of it here in this 2014 paper. They propose that not much machinery is needed, and one of the key components is a "merge" operation that the brain uses in composing and decomposing grammatical structures. (Figure 1 shows it in action.)

Then yesterday I was reading this introduction to embeddings in deep neural networks and NLP, and I read the following:

"Models like [...] are powerful, but they have an unfortunate limitation: they can only have a fixed number of inputs. We can overcome this by adding an association module, A, which will take two word or phrase representations and merge them.

(From Bottou (2011))

"By merging sequences of words, A takes us from representing words to representing phrases or even representing whole sentences! And because we can merge together different numbers of words, we donât have to have a fixed number of inputs."

This is a description of something called a "recursive neural network" (NOT a "recurrent neural network"). But look: the module "A" seems to do what the minimalists' "merge" operation does. The blogger quoted above even called it a "merge" operation...

As far as I can tell, the inventors of recursive neural networks were motivated by technical considerations - e.g. how to handle sentences of varying lengths - and not by the minimalist linguists. But it looks a little bit like they've created an artificial neural network embodiment of the minimalist programme! I'm not an NLP person, nor a linguist, however: surely I'm not the first to notice this connection? It would be a really neat convergence if it was indeed unconscious. Does this mean we can now test some Chomskian ideas (such as their explanation of word displacement) by implementing them in software?

UPDATE: After chatting with my QMUL colleague Matt Purver - he actually is a computational linguistics expert, unlike me - I should add that there's a little bit less to this analogy than I initially thought. The most obvious disjunction is that the ReNN model performs language analysis in a left-to-right (or right-to-left) fashion, whereas Chomskyan minimialists do not: one thing they preserve from "traditional" grammar is the varying nested constructions of linguistic trees, nothing like as neat in general as the "sat on the mat" example above.

The ReNN model also doesn't really give you anything about long-range dependencies such as the way questions are often constructed with a kind of implicit "move" of a word from one part of the tree to another.

Matt and many other linguists have also told me it's problematic to consider a model where words and sentences are both represented in the same conceptual space. For example, a complete utterance usually implies some practical consequence in the real world, whereas its individual components do not. I recognise that there are differences, but personally I haven't heard any killer argument that they shouldn't exist in the same underlying space-like representation. (After all, many utterances consist of single words; many utterances are partial fragments; many utterances lead to consequences before the speaker has finished speaking.)

I do still believe there's an interesting analogy here. I definitely can't claim that any current ReNN model is an implementation of the Strong Minimalist Programme, but it'd be interesting to see the analogy pushed further, see where it breaks and how it can be improved.

Wed 21 January 2015 | science | Permalink

mcld.co.uk

Other things on this site...