Neuroesthetics and the Macy conferences
Written by Sil Hamilton in April 2022.
Multimodal predictive models have been all the rage of late. From DALL·E 2 to models inspired by the Socratic method, those working on machine intelligence have been directing their focus on examining how predictive models trained on particular mediums (e.g. text, vision) can work in tandem to produce more comprehensive cognitive systems in a fashion not unlike bricolage. Multimodal simply means multimedia; itself a reference to art created on the principle that we ourselves are multi-sensory beings: our ears, noses, and eyes all providing qualitatively different experiences. With the growing propensity deep learning techniques have for simulating cognitive processes, multimodal models strike academics as the natural step forward for artificial intelligence research now that large language models have seemingly secured the linguistic domain of human experience.
What enables multimodal models? Composite models like DALL·E 2 and CLIP use natural language as an interface between modes by capitalizing on the success large language models have had in inferring the semantic content of human-written text. Models are trained to convert images to text and vice-versa, while the Socratic models use language as a secret glue mediating the exchange of semantic information across mediums. A picture is worth a thousand words, but a thousand words can also buy you sound or video. The point to take away from this development is that our understanding of human language is undergoing a major phase shift. Usage-based linguistics now provides strong evidence for a close relationship between human thought and speech, naturally leading academics to transfer the fruits of cognitive linguistics into other domains. Multimodal models represent the search for a universal model of human cognition, a modern rendition of a classic search that has occupied theorists for decades.
The mid-twentieth century saw the sciences become consumed by an all-encompassing race. The prize was universal fame, for the goal was a theory equipped to subsume all disparate parts of any particular scientific field into gestalten, unified wholes. Examples of this universalizing phenomenon: category theory and λ-calculus in the mathematics; the Standard Model in physics; generative grammars in linguistics; and so on and so forth. All of these approaches share in their dream of a cosmopolitan theory-of-all, a general method by which any conceivable problem may be computed and solved in a transparent and inter-compatible way. This aim is not pure wish-making: 1936 saw Alan Turing prove “all effectively calculable sequences” could be solved with his tape machine, while 1948 came with Claude Shannon pinning down entropy and information in a single theory. The stars seemed to be aligning as the various branches of the sciences and mathematics came together.
The hype surrounding this development crystallized in the Macy conferences, whose attendants sought to develop further unifying theories and models. Of interest during this period was the introduction of the first artificial neuron by Warren McCulloch and Walter Pitts. In attendance, McCulloch and Pitts argued their model was Turing-complete; that is, it fulfilled the criteria for universal computation set out by Turing. Their goal was to abstract over the mind through a probabilistic model of computation bridging the gap between mathematics and neuroscience. This creation was emblematic of the first cognitive revolution, then called cybernetics. Just as physics, mathematics and linguistics had undergone a process of consolidation, so would biology and the social sciences with their model of the mind. Their neuron was a success. It led to the first perceptron in 1958 and thus having a direct lineage with all modern neural networks. It was not without problems, however. Aside from their lack of compute, the McCulloch-Pitts neuron was not sufficient for modelling a real biological neuron. First among the issues: reflexivity.
Neurons do not exist in a vacuum. They are fraught with noise from competing biological sources, whether this be a result of their internal clocking mechanism or the very rate at which molecules arrive at and interact with receptors. Moreover, they are subject to reflexivity – mirror neuron circuits replicate the patterns other neurons emit, causing feedback loops nearly impossible to calculate in finite time. These issues were and continue to be major showstoppers for anyone attempting to replicate the mind from the neuron up. The ambitious efforts of the Macy conferences died away with time, cybernetics being replaced with the second cognitive revolution whose major motto is “embodiment,” a recognition that our cognitive processes are inextricably linked to our real-world environment. Artists have been aware of this fact for a long time; Marshall McLuhan famously wrote in The Gadget Lover that any technology “is an extension or self-amputation of our physical bodies, and such extension also demands new ratios or new equilibriums among the other organs and extensions of the body.” See Richard Serra and Nancy Holt’s 1974 short Boomerang for a striking example of auto-amputation. We are constantly in motion, a metamorphosis too rapid to be profiled in mathematical form.
That is, until the present day came along. Our neurons may be too real for formalization, but the principles driving them are not. Tangent research on the brain has discovered all sorts of correspondences between what was previously believed to be a part of qualia and the neuronal structure of our mind. An example is found in aesthetics. Developments in neuroscience over the past twenty years have revealed particular neurons in our visual perception centres are associated with aesthetic response. Certain neurons spike most regularly when observing right angles, while others react strongly when we see straight angles. Given we evolved our vision over millions of years, the exact neuronal arrangement of our visual cortex is thus an expression of a deep-set model of beauty (ignoring personal taste for the moment). The subfield of neuroesthetics has grown to formalize this particular facet of the brain. Equivalent domains have grown in the cognitive sciences for similar facets of human culture; cognitive linguistics arguably being the most major. Transformers represent one manifestation of this discovery process, with high-parameter models demonstrating word generation patterns predictive of our own cognitive processes as observed by fMRI.
Bringing this back to DALL·E 2 and all other recent neural models equipped with seemingly magical abilities: it is apparent we are witnessing a revival of the first cognitive revolution and its search for a model capable of unifying qualia with mathematical prediction, an inversion of the human experience. We can create biologically-plausible cognitive models simply by conducting self-supervised training on billions of human-created records. Looking at texts and images generated by DALL·E 2 and GPT-3 is dizzying. One can only imagine where we will be in one or two years.