LLMs are all the hype right now. While they are still deeply flawed and should be used with a healthy dose of caution, I have to admit that the performance of these models have surpassed my expectations by a wide margin.
I have been fascinated by the ML/AI space since AlphaGo in 2016, started exploring (very basic) ML research in 2018, got my first publication in 2020, and shortly afterwards pivoted into industry as a ML engineer, and I have been there since then.
Though I am no expert, I am also sufficiently up to speed with the research community to know that the performance of the most recent generation of LLMs came as a surprise. I, for one, predicted that LLMs would never display reasoning-type capabilities at any scale, and I was very wrong.
Here are my thoughts on why I was wrong – why LLMs have been better than expected, and how they might develop further.
A survey of current art
What can LLMs do right now?
LLMs are starting to display reasoning capabilities. The canonical example is math problems – the training and architecture of LLMs would lead one to expect that they would never learn to solve math problems, but it turns out that LLMs can make up for this by reasoning using words (aka chain of thought prompting).
LLMs can be trained to utilize tools to address weaknesses intrinsic to LLMs. Examples include querying search engines, using calculators, translators, and even accessing code documentation and tests to improve code generation.
The current state of the art applications of LLMs depend heavily on prompt engineering, which is brittle and inefficient. Eventually, we will move on to techniques that can figure out optimal prompts, or even skip prompts entirely. Recent applications of such techniques reveal that LLMs are close to human performance on several impressive tasks including medical QA and web navigation.
Beyond all of the tasks LLMs are capable of right now, the most exciting recent finding has been that scaling up LLMs leads to emergent abilities. In short, LLMs performance on certain tasks are non-linear – they unlock new skills once they reach a certain size. Examples include solving math word problems, arithmetic with multiple digits, and chain of thought prompting.
Overall, my impression of LLMs given the results and research that is publicly available today is that the technology is at the same time vastly overhyped (we are still not close to AGI) and overly pessimistic (LLMs can do more than just “autocomplete”).
Why do LLMs work?
As I do not have access to the SotA models, some of my understanding is derived from research papers which share inherently lossy information, so take all of the below with a grain of salt. With that disclaimer, here’s my current theory:
Language is more general than we think. My impression of why people are so impressed by ChatGPT is that it appears to be able to do “more” than just generate cool text, but this shouldn’t surprise us. Language can be used to represent a ton of tasks and behaviors. In our pursuit of predicting the next token with greater accuracy, it made sense that a bunch of cognitive capabilities (common sense, reasoning, knowledge) would be useful for improving the performance of our LLMs, but we didn’t know whether these models would be able to pick up such abilities from the low information density, unstructured data that we were feeding them. It now appears that they in fact do (at least sometimes), but why is this the case?
Complex tasks are just many simple tasks. My impression is that humans overestimate how complex our thinking really is – most thinking appears advanced but can be broken down into consecutively smaller and smaller cognitive steps, eventually reaching a scale where each step can be solved using bounded knowledge and compute. The scaling up of the number of attention heads and layers in LLMs may simply have led us to a scale where the parameters have enough capacity to memorize the “rule-book” for various simple cognitive tasks and how to combine the results. For example, it should not surprise people that LLMs can be taught to solve long arithmetic, given that there is a finite algorithm for doing so that can be described in simple words. That said, even if this is in fact what LLMs do, what is it about the current models and the way we train them that lets them do this?
Neural networks + big data can learn anything. Although neural networks are still poorly understood, we now have a few insightful puzzle pieces that help shed some light on the situation. In particular: (1) neural networks are universal approximators, which means they can approximate any function (with a few limitations that don’t really apply to humans), (2) models trained with gradient descent actually do converge to approximately global minima, (3) as we collect more text data (of sufficiently high quality and diversity), our training objective “converges” towards our true objective – understanding human language. Put these together and we get an amazing result: a sufficiently large model with sufficient data can learn human language. But what does “sufficiently large” mean, and are LLMs and internet-scale data anywhere close to it?
Transformers are really efficient. Theoretical convergence and approximation results suggest that neural network architecture matters a ton – merely adding residual layers can produce an exponential reduction in required parameter counts. Recent research shows that Transformers (the architecture for LLMs) are very efficient because they parallelize and compose computation (at least in certain toy settings). No one really knows how to quantify how complex human language is, and how much of it is captured within our current internet datasets, but it appears that current SotA LLMs may not be too far off.
Closing thoughts
My takeaways from the past few months are as follows:
Be humble, you never know what to expect. In the past I felt deeply confident that LLM research was a waste of time as I thought they would saturate in performance quickly, and thus I could not see any practical use cases for them, not to mention that they were hideously expensive to train. I was very wrong, mainly because I was projecting forward a highly non-linear function (performance for LLMs) with few and non-representative data points (almost all models up until 2020 had fewer than 1B parameters).
Sometimes, more is different. Emergent abilities in LLMs reveal something profound about the universe – in some circumstances, a large quantity of objects interacting with each other in simple ways produces complex and amazing emergent phenomena. Small changes to the interactions or large changes to the quantity of objects can lead to unexpected results.