Other things on this site...

Research
profile
Recipes

Everything as a vector? Really, AI?

It's actually astonishing how much of the super-impressive recent work in machine learning is based on vector representations of data -- simple "embeddings" in which a data item becomes a coordinate in a Euclidean space (or at least, a metric space). I remember doing something related to that back in my own PhD and spending lots of time agonishing over how bad an idea it seems to assume all datasets can be represented as a set of dots in a space!

In metric spaces such as these, there are inviolable rules, such as the triangle inequality which enforces something about the (dis)similarity between 3 items. But those fixed rules might not be true about your data. It's quite common for perceptual data to violate the triangle inequality...

In my case it was vowels and consonants plus all the strange sounds that a beatboxer can produce. The traditional "vowel space" known in phonetics can be treated as a metric space, but it has no room in it for consonants. Should vowels and consonants lie in different manifolds? They probably have different numbers of dimensions, so yes, I'd say. Can those sit in the same parent space, and should those manifolds be connected to one another? How could an algorithm learn about that complex situation?

All of this is ignored when we take the generic data-driven approach of projecting everything into some high-dimensional space and then optimising the parameters of that space. It's possible that the learnt representation embeds the data into interestingly-shaped manifolds -- but, as shown by the trivial example of the triangle inequality, there are certain relationships that simply can't be represented in a standard metric space.

I've not seen any recent work exploring these questions. But perhaps that's just because I'm too busy to find those papers. ... And I'm not claiming that our own work does anything different than this mainstream. For example, our 2021 paper "Deep perceptual embeddings for unlabelled animal sound events" uses exactly this vector embedding, driven by animals' own judgments. It performs better than previous methods, but it still makes these strong "vector space" assumptions about perception.

I'm sure there are mathematicians out there who can give a well-argued opinion. But in an era when deep learning methods seem to be able to represent entire languages, natural images and more, basically encoded into metric spaces ("embeddings"), it seems particularly apposite. Are there aspects, for example, of natural language that don't fit metric spaces, and which large language models empirically do poorly at?

| Science | Permalink

social