Dynamic Embeddings
Written by Sil Hamilton in March 2023.

There are three major kinds of large language models floating around today. They’re all based on the Transformer, the machine translation model Google Brain released for Google Translate back in 2017.

There are encoder-decoder models, whose architecture mimics the original Transformer (T5). Then there are models only using the encoder portion of the Transformer (BERT, BART) and models only using the decoder portion (GPT, LLaMa). Which half of the Transformer you choose to use determines what your model is capable of.

The encoder half lets you encode a sentence into a vector of n-dimensions representing the core meaning of that sentence. You receive a point in a high-dimensional space of all possible sentences, letting you do fun things like semantic search and topic clustering.

The decoder half takes a list of (implicit) vectors resting in this space of all meaning and predicts the following vector before translating them back into natural language. This is where the “generative” in “generative AI” comes in.

Put together, the encoder and decoder block are meant to work in tandem to translate sentences in one language (i.e. a space of meaning) to another. But researchers have found these halves are better at non-translation tasks when split apart and used on their own as is proven by ChatGPT’s wild success.

My work in computational literary studies has had me use models of all three types. BERT is useful for sequence classification tasks, like telling me whether a passage of text is fictional or not. I’ve also found GPT useful for simulating Supreme Court decisions through argument generation.

My research is unusual for people in computational literary studies. The usual M.O. for those studying literature at scale is to custom-build classifiers (say, for detecting narrativity) using many examples gathered by a bunch of undegrads annotating fiction to pull out the features we care most about.

That looks like it’s about to change. We’re now seeing models demonstrably capable of annotating passages with human-level accuracy, leading to some cool projects.

If you ask me which of the three types of models I outlined above is closest to how a human reads, I would say none of the above.

You see, reading is non-linear. We read each paragraph, pause, and consider how what we’ve just read impacts what we read before. We might go back and reread a section to clear doubts, or to double-check a piece of foreshadowing. And then we keep reading.

Neither encoders nor decoders do this. While existing language models might be able to annotate well, this does not necessarily mean they are reading like a person. And this becomes important when academics in the humanities use AI-generated findings to theorize about people.

Encoders read a text all at once. They encode a vector both for the overall passage and for each individual word, at the same time. We call them 'bidirectional' models because they consider the meaning of every word in a passage in relative to every other word. Encoders can’t be surprised.

Decoders do read passages one word at a time. They are 'unidirectional.' Decoders can feel surprised if a word seems improbable given the preceding words. We capture this through the loss metric. But what they can’t do is go back and reassess whether this new word changes the meaning of previous words and reflect this in the vector representation of that word.

If we want to make sure a model reads like us, we need a new architecture putting encoders and decoders back together again.

Existing encoder-decoder models like T5 are no good because they are encoders and decoders working in tandem, complete with the problems of both. We need something else.

One solution would be to use an encoder like a decoder; repeatedly encoding a passage with one new word introduced each time. We would then receive a set of “dynamic embeddings,” one for each word in a story. Each dynamic embedding would itself be a list of vectors, each vector encoded at a particular point of the story arc.

The beginning of a novel feels different when you know the story. Dynamic embeddings would bring us one step further to being able to model this important feature of narrativity—and learn something about human communication in the process.