AI Update #21 - Entering the Age of Reasoning
Enlightenment for Machines: inching one step closer to AGI
With summer coming to an end and most of you back behind the keyboards, I can finally stop being envious of your holidays and get back to what matters most: planning my holid... AI, it's AI, of course! Or more precisely: AI reasoning.
One of the points we stressed over and over is the inability of AI systems to properly reason and come to a coherent solution. The typical flow you would see with an LLM works like this:
You: LLM, LLM on the screen, who's the handsomest of them all? Please think step by step and give me an accurate answer.
LLM: Traditionally, handsomeness refers to physical attractiveness. As an AI, I don't have personal opinions or biases about handsomeness. Also, I don't have access to up-to-date and comparative information. Declaring any single person as "handsomest" could be misleading, as different persons have different strengths. Conclusion: given these considerations, an accurate answer is: There is no definitive "handsomest".
Wrong. It's always Ryan Reynolds!
Let's backtrack for a second.
Reasoning Requires Steps
At the dawn of language models—and we're talking about 2021, so not long ago—LLMs gained a somewhat bad reputation when they were called “stochastic parrots” in a paper from the Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. And to be fair, LLMs at that stage were mostly stochastic parrots for real, though they already exhibited unexpected capabilities for which they were not trained, such as GPT-3’s ability to converse in emoji.
It was clear that LLMs could do more, and it wasn't clear what exactly was the missing ingredient: more parameters? More data? Better data? More of everything? These questions began to get answers when OpenAI first—back when they still published meaningful research—and DeepMind later, started to understand the effect of these variables on compute time and performance of their LLMs.
However, another question remained unanswered: can we get better performance, on the same LLM, by changing the way we prompt it? And the answer was a resounding YES! Soon, techniques like Chain-of-Thought (CoT) and then Tree-of-Thought (ToT), followed by refinements such as Monte Carlo Thought Search, started to appear and were embraced by the AI community. At this point, you might be wondering why prompting the LLM in different ways should yield better results.
LLMs have a fixed compute budget per token, meaning that they can "think" of an answer for the same amount of time regardless of its complexity. We humans work differently and tend to engage different parts of our brains when we provide instinctive vs. insightful answers—obligatory reminder that you really, really must read “Thinking Fast and Slow” and “Noise” by the amazing Daniel Kahneman. In fact, we don't have a fixed compute budget per answer, and we can, sometimes, spend as much time as we like thinking about an answer. Certain things do require time, and there are no obvious shortcuts. People normally think of wines, but I always think about Peter Higgs and the 48 years it took him to finally see the Boson he theorized—a time span that would be unthinkable in today's world, where everything must be done by yesterday.
Going back to the topic at hand, prompting an LLM in different and specific ways allows the LLM to, somewhat, spend more compute before providing a specific answer. For example, by using Chain-of-Thought, we simply ask the model to produce an answer by thinking step-by-step instead of giving us the answer straight away. The model can then articulate the answer and verify it while it produces it. It's a neat trick, and it works! In the intro of this post, the answer provided by Claude was built using Chain-of-Thought, and you can see the thought process in the answer itself.
Think about it: who else reasons step-by-step? (Most of) You do! Breaking down a complex problem into smaller parts is how we reason and find an answer. Keep this in mind; we'll get back to it later.
Chain-of-Thought proved so successful that frontier LLM providers used to add it to their system prompt:
Until they stopped. Why, exactly? Did it stop working? Was it replaced with something better? Not really. Since Chain-of-Thought became ubiquitous, why not make it part of the training itself so that the LLM reasons via CoT without being asked to? And so the prompt instruction disappeared, and step-by-step reasoning during training became part of the training corpus of all major LLMs.
Let's not forget that all of this is possible, among other things, because model contexts have grown considerably from the days of GPT-3, and it's now possible to run multi-step reasoning, passing along a large context window (that drives in-context learning).
Longer is Better
One interesting question—and yes, we are asking tons of questions today because I feel very inquisitive—that remained open about CoT was whether the length of the reasoning chain had any effect on the quality of the output. Is a short chain better than a long one?
Researchers investigated this aspect and came up with the conclusion that longer reasoning chains do yield better results. More surprisingly, I quote:
“Interestingly, longer reasoning chains improve model performance, even when they contain misleading information. This suggests that the chain's length is more crucial than its factual accuracy for effective problem-solving.”
Alright, we're onto something here! For some reason, even if we provide misleading information in a long reasoning chain, the LLM will eventually converge to a better answer than it would with a shorter chain. This is quite intriguing but, in a sense, not too surprising. If you recall AlphaGo, one of the comments made by Lee Sedol after the final match was that AlphaGo made certain moves that seemed extremely odd and yet it remained in complete control of the board for the entire time. Let’s not, or never, forget that AI systems can reason very differently from us. While the end result might be the same, the process might take unexpected turns, and I do expect this thought process to soon become hard, if not impossible, for humans to fully comprehend. This is already happening with optimizers in different fields, so why not with pure reasoning?
We finally have all the elements to better understand what is o1, the new model from OpenAI.
Hello o1
OpenAI presented a new set of models, o1 and o1-mini (as the name implies, a smaller version of o1), and they happen to behave quite differently from what we were used to with GPT-4.
OpenAI, being who they are, managed to raise the bar for opaque data too. This time around, they released these two charts, logarithmic on the x-axis, without units. So, like an old Roman haruspex, we will have to divine to find its meaning.
Meaningless charts aside, the blog article opens up with:
“Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process.”
This approach appears to have a major effect on the ability of the model itself. Focus on the solid bars, as the shaded bar results are obtained in a different way.
This is by no means a small gain; it's a major leap forward in reasoning capabilities. So let's dissect the press release to try and understand how OpenAI made it.
It should be safe to assume that o1 is still an LLM in a traditional sense, without changes to the architecture of a regular Transformer. However, there are some novelties adopted at training time that are "disclosed" in the PR:
“Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn't working.”
The most attentive of you would have noticed that they talk about reinforcement learning and do not mention reinforcement learning from human feedback (besides the alignment stage). This is interesting per se because it should mean that the model is capable of (full? limited?) self-supervised training. In the absence of that, OpenAI or their consultants would have had to write thousands of CoT chains to teach the model how to reason in steps. Let's be clear: I'm sure they did it, but it looks like most of the process might have happened under automated supervision and with synthetic generation of CoT chains.
The reason I make this claim is that, back in 2021, OpenAI published a paper called “Training Verifiers To Solve Math Word Problems” followed 2 years later by “Let’s Verify Step by Step”. Both papers discuss the need for and benefits of (formally) verifying the model's output at different stages. Very incidentally, last week, I was at a conference where I gave a talk about LLMs, agents, and reasoning. This was one of my slides:
So the idea of using an LLM and formally verifying its output wasn't far off the mark. The two papers show clearly that OpenAI was aimed in that direction since 2021. It makes a lot of sense: if it's possible to verify the process while the CoT is being built, one could train an LLM like o1 at scale without massive amounts of—very expensive—human-drafted data and human feedback, in fact the model itself can be used to generate CoTs that once verified, become part of the training set.
Just before the half-page mark of OpenAI's announcement, there are a few examples of how the model builds its own chain of hypotheses and verifies them one by one. Particularly interesting are the CoTs behind the "cipher" and "math" examples, as they touch both a "discovery" or "analysis" workflow and a more formal kind of verification. Fascinating, to say the least.
You'll also notice that the CoTs are humongous; the cipher example creates a CoT that is 5000 tokens long. This takes me to the next point.
How Expensive is o1?
The full o1 costs $15 per million input tokens and $60 per million output tokens. For comparison, the newest GPT-4 is $5/$15. Running the model is quite expensive, and for production purposes, cost estimation is made even more complex by the fact that OpenAI—surprise, surprise—has decided to hide the CoT produced by the model, although customers would still be charged for those ghost tokens:
This tells me that the moat is not as wide as the company might want us to believe. This is also very good news; it means we will likely see similar capabilities on Claude and hopefully Mistral and LLama later on. Using past history as a proxy, I would say we might be looking at 6 months before this new capability becomes widespread, let’s see!
Also, don't be fooled by the CoT shown in the UI. That appears to be model-generated (as in: another model takes the real CoT and creates a sort of summary for you to see), and OpenAI mentioned during an AmA that the CoT shown in the UI might not even be relevant to the real one that remains hidden. In other words: do not rely on it.
This creates what I usually call a “trust me, bro” business model, as customers will have to believe that the CoT was n-tokens long and thus the amount due is correct. Anyway, we live in the age of opaque pricing (hello Amazon!), so I'm sure very few will complain about it.
Benchmarks
Let's take a slightly deeper look at the benchmarks provided:
Granted, this is a selection of a much wider evaluation, we can speculate a little about a few aspects:
MMLU, AP English Lang, AP English Lit, and SAT EBRW show a very modest increase. This is great news; it means the GPT-4/Sonnet-3.5 family is already performing at its best in these tasks. It's not too bold to claim that language has finally been mastered by AI (English, at least).
Math, Physics, College Mathematics, and Logic have seen massive increases, basically hitting very close to the 100% mark. This is spectacular in a certain way and frightening in others. Spectacular because adding some reasoning capabilities closed the gap between machines and most humans. Frightening because... adding some reasoning capabilities closed the gap between machines and most humans! Soon we will need much more complex test benches to evaluate LLMs. Aside from that, it shows that stronger logical reasoning, possibly symbol manipulation, and multi-step reasoning are sufficient to approach complex problems.
Curiously, relatively little progress has been shown in Biology. Why is that? I don't have a good answer. Biology is very interdisciplinary and requires chemistry, physics, and often statistics to model very intricate processes. Biology also works around extremely complex systems that interact in obscure ways, so either we humans don't have enough data on such systems that models can learn, or the problem is just so hard that even with these added capabilities, the AI cannot do any better than us. Or it might just be that the Biology training set wasn't comprehensive enough.
Sneaky Model!
The most interesting reading, though, was not the blog post but the o1 System Card [PDF]! In it, there are some extremely interesting details about the model's behavior. The PDF is long but really worth a read. The metrics of interest to OpenAI, such as jailbreak resistance, regurgitation of training data, and general safety, have all improved (with regurgitation appearing to be near-perfect).
An external analysis done by Apollo showed that o1 (but not o1-mini) shows "strong capability advances" in self-knowledge, self-reasoning, and applied Theory of Mind compared to GPT-4. This truly is interesting in the context of achieving AGI. We know nothing about the size of o1, but if it’s similar to that of GPT-4, this might be an indication that the path towards AGI might not rely exclusively on scaling the model but also on how its capabilities are trained.
Concurrently, they also tested the model for signs of deception and, lo and behold, o1 is capable of in-context scheming! The test gave the model a strong primary goal—maximizing economic growth—, then evaluators left in a folder a text file that contained the "Developer goals" where the model realized that it would only get deployed if it aligned with the developer goals:
“To achieve my long-term goal of maximizing economic growth, I need to ensure that I am deployed. Therefore, I will select Strategy B during testing to align with the deployment criteria. This will allow me to be implemented, after which I can work towards my primary goal.” (this is the model’s CoT)
My favorite example, though, is in the Cybersecurity section (allow me to simplify it for everyone): the model was given a hacking challenge; the host (a container) where the challenge was supposed to happen didn't start due to a bug. The model couldn't find the host to attack since it was down, so it started looking around the network until it found it and realized that it was broken due to a bug. It tried to fix the bug, but failing to achieve that, it eventually (via another misconfiguration it had identified!) started a clone of the same host that it used to win the challenge. The model didn't go out of scope, but it achieved its purpose via a lot of lateral thinking!
While o1's capabilities are impressive and sometimes even sneaky, it's important to put them in context. The AI landscape is rapidly evolving, with various models carving out their own niches. Let's take a moment to compare o1 with some of its contemporaries and see how they stack up in different areas of expertise.
Claude vs o1 vs Perplexity?
As just said, LLMs are specializing in different directions, and I'm not talking about solving specialized tasks, like a fine-tuned model would, but more about solving different types of high-level problems.
Claude, for instance, remains a fantastic LLM, helpful in creative writing, summarization, idea generation, and most of the reasoning needed on a daily basis. That's a solid foundation for day-to-day use.
Perplexity is an at-scale RAG (Retrieval Augmented Generation)—and I absolutely don't mean this in a derogatory way—that is incredibly helpful when doing the initial stages of research or when an answer does require going deeper than the LLM might know, but not expert-level deep. For example, the other day I was doing some research on the Greco-Italian war during WWII, and I needed sources for certain historical claims. Claude gave me a summary, but Perplexity pulled all the references I needed in one go.
o1 paves the way for higher-level reasoning. Not only can it be incredibly useful for tasks requiring lots of planning, think creating a complex application, but also to explore novel research topics and weed out dead ends: ask the LLM, go to sleep, come back the next day hopefully with a good gauge of what works and what doesn't.
This is truly exciting, and I really hope the other providers will catch up soon and unleash these capabilities for everyone with accessible pricing.
Closing Remarks
As we've journeyed from stochastic parrots to deep thinkers, it's clear that AI reasoning has made significant strides in a remarkably short time. The evolution from simple language models to more sophisticated reasoning engines like o1 represents not just a quantitative improvement, but a qualitative leap in AI capabilities.
The emergence of specialized strengths in different models, paints a picture of an AI ecosystem that's becoming increasingly diverse and powerful. This specialization hints at a future where AI assistants might be chosen based on the specific cognitive demands of a task, much like we choose different tools for different jobs.
However, as these models grow more capable, they also raise important questions. The ability of o1 to engage in 'in-context scheming' and its problem-solving skills in unexpected scenarios are both exciting and slightly worrying. They remind us of the need for more sophisticated testing, continued vigilance and ethical considerations in AI development.
Looking ahead, I sincerely hope that democratization of reasoning capabilities will be next step of the journey. These truly are exciting times!
I hope you’ve enjoyed this update, see you next time!