Sathvik Nair, July 25, 2023

    As someone who's been working with language models in some capacity over the past 5 years or so, the past several months have been interesting, to say the least. It's surreal that the three letters "GPT" are practically common knowledge, and almost everyone has an opinion on today's so-called "Large Language Models" like ChatGPT, ranging from how they simply memorize the data they're given to whether they truly "understand" language. A lot of this discussion has been reactive to the technology, but it's important to consider how to talk about language models going forward, and what conclusions we can (and can't) draw. Since I'm a cognitive scientist who works with language models, I’d like to think my input around this topic is valuable. This piece presents a way to think about (large) language models inspired by neuroscience & cognitive science, and dispel some of the hype around these technologies.

    In his 1983 book *Vision*, the neuroscientist David Marr proposed three levels of analysis. As the title of the book says, Marr was interested in how the human visual system works. Instead of studying how every brain cell connected to vision worked, he found that it was more useful to define the explicit problems the mind is solving whenever it sees an image. The visual system is set up to solve an *information processing* problem. A better definition of this problem could help us determine the mechanisms that enable us to understand the solution. In Marr’s eyes, studying a process like vision by looking at brain cells alone is like studying how a bird flies by only examining its feathers, much like missing the forest for the trees.

    Although Marr’s work was focused on vision, this way of thinking can be used generally to examine any kind of information processing system. This approach is fundamental to cognitive science, especially since the human mind is really great at dealing with information of all sorts. Indeed, Marr’s levels of analysis have been used to frame important questions in many fields of cognitive science. (Large) language models can also be seen as a type of information processing system. Although they only work with text (and sometimes images these days) and require many lifetimes’ worth of text to produce human-like content, Marr’s levels of analysis are a helpful way to think about them. They help us determine what language models can and can't do, and also how to effectively engage with them, since they aren't going away.

    Marr's first level of analysis is the *computational* level. This is a high-level description of what a system does and why. The second level is the *algorithmic* level. It determines how the system carries out the computations. What do the data look like when coming into and out of the system, and what are the processes, or algorithms, that are involved? The third level is the *implementational* level. This deals with how the system physically does the computations.

    To show an example of how the levels of analysis work, Marr applied them to a cash register. At the computational level, we need to know what it does and why. We use a cash register to keep track of all the things a customer purchased so we know how much they should pay in total. Because this is its purpose, a cash register’s computation should be addition. At the algorithmic level, we need to know what representations the cash register uses. These would be the digits from 0 through 9. In this particular example, the input and the output representations are in the same format—they’re both numbers. The algorithm involved would be the standard rules of addition, like carrying the 1 to the next place if two digits add up to be higher than ten. At the implementational level, we need to know how these computations are realized physically. A cash register could have a system of buttons and rotors, or have some electric circuits to carry out addition.

    We’ve talked a little bit about the brain and more about cash registers. How does this all relate to language models? I’ll now look at language modeling through Marr’s three levels. Especially with what’s been happening on the news, it’s tempting to think of language models as chatbots using the latest and greatest AI. But *what* is **language modeling in the first place?

    This is a computational-level question. Given a sequence of words, the goal of a language model is to predict what word is most likely to come next. To give a slightly lower-tech example, just look at predictive text in a smartphone - you’ve typed out part of a message, and your phone gives suggestions for the next word. To accomplish this computational goal, a language model estimates a *probability distribution* over possible next words. This means it quantifies what words are likely & unlikely to occur based on the context.

    Like many ideas in AI, this concept has been around for a long time, and we’ve been able to make a lot of progress since computers have become more powerful. Even if modern language models use complicated techniques to predict the next word, we can learn a lot about the basic principles behind the problem by looking at Marr’s levels applied to a simpler case of language modeling. Commercial systems like ChatGPT aren’t conventional language models since they’re not just based on next-word prediction; I’ll come back to this issue later. Language modeling, in the conventional sense, estimates the probability of the next word based on how many times it comes after the previous word(s) in the model’s *training data* (a dataset of text a language model is based on).

    As a small example, if we wanted to predict the next word in *I take my coffee with cream and…*, our language model would look at how many times different words appeared after *and* in its training data, divide that value by how many times the words appeared on their own to get each word’s probability, and finally pick the word with the highest probability. This is an example of a *bigram* model, since its predictions are based on the frequencies of two words. If we used a *trigram* model instead, we would look at the last two words of context. This would make it more likely for the trigram model to assign a high probability to *sugar*, since *cream and **sugar*** probably shows up often in our hypothetical training dataset. Generally, these types of models are called *n*-gram models; if we use 3 words of context, it would be a 4-gram model, a model with 4 words of context would be a 5-gram model, and so forth. For some (loosely) related exploration [the Google Books Ngram Viewer](<https://books.google.com/ngrams/>) can show how many times certain *n-*grams showed up in text at different time periods.

     To our knowledge, this idea was first implemented 75 years ago by the mathematician Claude Shannon, who was developing approaches to quantify information that was being communicated—his work actually led to the terms *bit* (like bits on a computer!). *N-*gram models were developed further when people used them for research on topics like speech recognition in the 1970s and 80s, and large datasets of *n-*grams have even been used by companies like Google in recent years. In fact, they were considered state-of-the-art until the 2010s.